Skip to main content

Posts

Did the user really ask for Exactly Once? Fault Tolerance

Exactly Once RequirementsIt is very tricky and can cause performance degradation, if your user could just use at least once, then always go with that.    Having data sinks like Kudu where you can do an upsert makes exactly once less needed.
https://docs.cloudera.com/csa/1.2.0/datastream-connectors/topics/csa-kafka.html
Apache Flink, Apache NiFi Stateless and Apache Kafka can participate in that.
For CDF Stream Processing and Analytics with Apache Flink 1.10 Streaming:
Both Kafka sources and sinks can be used with exactly once processing guarantees when checkpointing is enabled.

End-to-End Guaranteed Exactly-Once Record Delivery
The Data Source and Data Sink to need to support exactly-once state semantics and take part in checkpointing.

Data Sources Apache Kafka - must have Exactly-Once selected, transactions enabled and correct driver.
Select:  Semantic.EXACTLY_ONCE

Data Sinks HDFS BucketingSinkApache Kafka
For Kafka, please check the timeouts sync up to checkpoints.   https://ci.apache.org/proje…

FLaNK in the Cloud!!!! Huge Cloudera Data Platform Public Cloud Updates - July 2020 - Data Flow Releases

FLaNK in the Cloud!!!!   Huge Cloudera Data Platform Public Cloud Updates July 2020 - Data Flow Releases
With the promotion of Cloud Runtime 7.2.1 to Public Cloud, the CDF team is pleased to announce three key and very important updates that were also promoted to production today and available to customers.  These are:
https://docs.cloudera.com/runtime/7.2.1/index.html https://docs.cloudera.com/runtime/7.2.1/kafka-connector-reference/topics/kafka-connector-reference-hdfs.html https://docs.cloudera.com/runtime/7.2.1/cctrl-rest-api-reference/index.html https://docs.cloudera.com/runtime/7.2.1/smm-rest-api-reference/index.html https://docs.cloudera.com/runtime/7.2.1/srm-rest-api-reference/index.html https://docs.cloudera.com/runtime/7.2.1/cctrl-overview/topics/cctrl-how-it-works.html
Let's see what's new in 7.2.1!
https://docs.cloudera.com/cdf-datahub/7.2.1/release-notes/topics/cdf-datahub-whats-new.html
Flow (GA) - 2.0.3 is optimized for public cloud!  
https://docs.cloudera.com/cdf-datahub…

Phoenix / HBase / NiFi Resources

Sizing Your Apache NiFi Cluster For Production Workloads

Sizing Your Apache NiFi Cluster For Production WorkloadsCloudera Flow Management provides an enterprise edition of support Apache NiFi managed by Cloudera Manager.    The official documentation provides a great guide for sizing your cluster.
https://docs.cloudera.com/cfm/2.0.1/nifi-sizing/topics/cfm-nifi-sizing.html
If the use case fits, NiFi Stateless Engine may fit and perform better utilizing no disk.
Check out that heap usage and utilization, you may need to increase.    24-32 Gigabytes of RAM is a nice sweet spot for most instances.


Check out how your nodes, threads and queues are doing.   If queue is not processing fast or thread count is high, you may need more cores, RAM or nodes.


When you are managing your cluster in Cloudera Manager, make sure you increase the default JVM memory for Apache NiFi.  512MB is not going to cut it for anything but single user development.


Do this correctly and process a billion events!!!   https://blog.cloudera.com/benchmarking-nifi-performance-and-scal…

Report on This: Apache NiFi 1.11.4 - Monitor All The Things

The easiest way to grab monitoring data is via the NiFi REST API.  Also everything in the NiFi UI is done through REST calls which you can call programmatically.   Please read the NiFi docs they are linked directly from your running NiFi application or on the web.   They are very thorough and have all the information you could want:   https://nifi.apache.org/docs/nifi-docs/.   If you are not running NiFi 1.11.4, I recommend you please upgrade.   This is supported by Cloudera on multiple platforms.NiFi Rest APIhttps://nifi.apache.org/docs/nifi-docs/rest-api/There's also an awesome Python wrapper for that REST API:  https://pypi.org/project/nipyapi/Also in NiFi flow programming, every time you produce data to Kafka you get metadata back in FlowFile Attributes.   You can push those attributes directly to a kafka topic if you want.So after your PublishKafkaRecord_2_0 1.11.4 so for success read the attributes on # of record and other data then AttributesToJson and push to another topic…

Ingesting All The Weather Data With Apache NiFi

Ingesting All The Weather Data With Apache NiFi


Step By Step NiFi Flow
GenerateFlowFile - build a schedule matching when NOAA updates weatherInvokeHTTP - download all weather ZIPCompressContent - decompress ZIPUnpackContent - extract files from ZIP*RouteOnAttribute - just give us ones that are airports (${filename:startsWith('K')}). optional.*QueryRecord - XMLReader to JsonRecordSetWriter.   Query:  SELECT * FROM FLOWFILE WHERE NOT location LIKE '%Unknown%'.  This is to remove some locations that are not identified.  optional.Send it somewhere for storage.   Could put PutKudu, PutORC, PutHDFS, PutHiveStreaming, PutHbaseRecord, PutDatabaseRecord, PublishKafkaRecord2* or others.







URL For All US Data
invokehttp.request.url https://w1.weather.gov/xml/current_obs/all_xml.zip


Example Record As Converted JSON
[ {   "credit" : "NOAA's National Weather Service",   "credit_URL" : "http://weather.gov/",   "image" : {     "url" :…

Apache Flink SQL Demo (FLaNK Series)

Using Cloudera Data Platform with Flow Management and Streams on Azure

Using Cloudera Data Platform with Flow Management and Streams on Azure
Today I am going to be walking you through using Cloudera Data Platform (CDP) with Flow Management and Streams on Azure Cloud.  To see a streaming demo video, please join my webinar (or see it on demand) at Streaming Data Pipelines with CDF in Azure.  I'll share some additional how-to videos on using Apache NiFi and Apache Kafka in Azure very soon.   





In the above process group we are using QueryRecord to segment JSON records and only pick ones where the Temperature in Fahrenheit is over 80 degrees then we pick out a few attributes to display from the record and send them to a slack channel.
To become a Kafka Producer you set a Record Reader for the type coming in, this is JSON in my case and then set a Record Writer for the type to send to the sensors topic.    In this case we kept it as JSON, but we could convert to AVRO.   I usually do that if I am going to be reading it with Cloudera Kafka Connect.


Our security…

The Rise of the Mega Edge (FLaNK)

At one point edge devices were cheap, low energy and low powered.   They may have some old WiFi and a single core CPU running pretty slow.    Now power, memory, GPUs, custom processors and substantial power has come to the edge.
Sitting on my desk is the NVidia Xaver NX which is the massively powerful machine that can easily be used for edge computing while sporting 8GB of fast RAM, a 384 NVIDIA CUDA® cores and 48 Tensor cores GPU, a 6 core 64-bit ARM CPU and is fast.   This edge device would make a great workstation and is now something that can be affordably deployed in trucks, plants, sensors and other Edge and IoT applications.  
https://www.datainmotion.dev/2020/06/unboxing-most-amazing-edge-ai-device.html
Next that titan device is the inexpensive hobby device, the Raspberry Pi 4 that now sports 8 GB of LPDDR4 RAM, 4 core 64-bit ARM CPU and is speedy!   It can also be augmented with a Google Coral TPU or Intel Movidius 2 Neural Compute Stick.   
https://dzone.com/articles/efm-series-…

Explore Enterprise Apache Flink with Cloudera Streaming Analytics - CSA 1.2

Explore Enterprise Apache Flink with Cloudera Streaming Analytics - CSA 1.2
What's New in Cloudera Streaming Analytics
https://docs.cloudera.com/csa/1.2.0/release-notes/topics/csa-what-new.html https://docs.cloudera.com/csa/1.2.0/index.html
Try out the tutorials now:   https://github.com/cloudera/flink-tutorials
So let's get our Apache Flink on, as part of my FLaNK Stack series I'll show you some fun things we can do with Apache Flink + Apache Kafka + Apache NiFi.
We will look at some of updates in Apache Flink 1.10 including the SQL Client and API.
We are working with Apache Flink 1.10, Apache NiFi 1.11.4 and Apache Kafka 2.4.1.
The SQL features are strong and we will take a look at what we can do.
https://docs.cloudera.com/csa/1.2.0/release-notes/topics/csa-supported-sql.html
Table connectors KafkaKuduHive (through catalog)
Data formats (Kafka) JSONAvroCSV
Using Hive Catalog with Flink SQL:https://docs.cloudera.com/csa/1.2.0/flink-sql-table-api/topics/csa-hive-catalog.html
Use Kudu Cat…

Using Apache Kafka Using Cloudera Data Platform Data Center 7.1.1

Unboxing the Most Amazing Edge AI Device Part 1 of 3 - NVIDIA Jetson Xavier NX

Unboxing the Most Amazing Edge AI Device 
Fast, Intuitive, Powerful and Easy. Part 1 of 3 NVIDIA Jetson Xavier NX

This is the first of a series on articles on using the Jetson Xavier NX Developer kit for EdgeAI applications.   This will include running various TensorFlow, Pytorch, MXNet and other frameworks.  I will also show how to use this amazing device with Apache projects including the FLaNK Stack of Apache Flink, Apache Kafka, Apache NiFi, Apache MXNet and Apache NiFi - MiNiFi.
These are not words that one would usually use to define AI, Deep Learning, IoT or Edge Devices.    They are now.    There is a new tool for making what was incredibly slow and difficult to something that you can easily get your hands on and develop with.  Supporting running multiple models simultaneously in containers with fast frame rates is not something I thought you could affordably run in robots and IoT devices.    Now it is and this will drive some amazingly smart robots, drones, self-driving machines a…