FLaNK in the Cloud!!!! Huge Cloudera Data Platform Public Cloud Updates - July 2020 - Data Flow Releases

FLaNK in the Cloud!!!!   

Huge Cloudera Data Platform Public Cloud Updates 

July 2020 - Data Flow Releases


With the promotion of Cloud Runtime 7.2.1 to Public Cloud, the CDF team is pleased to announce three key and very important updates that were also promoted to production today and available to customers.  These are:


Let's see what's new in 7.2.1!


Flow (GA) - 2.0.3 is optimized for public cloud!  



Streams  (GA) -  Now with Apache Kafka v2.5!


Streaming Analytics  (TP) -   v1.2.1, powered by Apache Flink 1.10 is Technical Preview in the CDP Public Cloud.


Start with Streaming Analytics Light Duty

Using Kudu with Flink

Using HBase with Flink

Using Kafka with Flink

General CSA Flink Docs

  • Data source reading from Kafka
  • Data sinks writing to Kafka, HBase and Kudu
  • Apache Atlas integration
  • SQL/Table API and SQL Client
  • Table connectors 
    • Kafka
    • Kudu
    • Hive (through catalog)
We can now run the FLaNK Stack in the Public Cloud automagically!

Sizing Your Apache NiFi Cluster For Production Workloads

Sizing Your Apache NiFi Cluster For Production Workloads

Cloudera Flow Management provides an enterprise edition of support Apache NiFi managed by Cloudera Manager.    The official documentation provides a great guide for sizing your cluster.

https://docs.cloudera.com/cfm/2.0.1/nifi-sizing/topics/cfm-nifi-sizing.html

If the use case fits, NiFi Stateless Engine may fit and perform better utilizing no disk.

Check out that heap usage and utilization, you may need to increase.    24-32 Gigabytes of RAM is a nice sweet spot for most instances.



Check out how your nodes, threads and queues are doing.   If queue is not processing fast or thread count is high, you may need more cores, RAM or nodes.



When you are managing your cluster in Cloudera Manager, make sure you increase the default JVM memory for Apache NiFi.  512MB is not going to cut it for anything but single user development.



Do this correctly and process a billion events!!!   https://blog.cloudera.com/benchmarking-nifi-performance-and-scalability/.  - Notice the hardware and performance sections of that article


General tips:

Make sure you use SSD for Provenance and other repositories.  Faster disk, happier user. https://docs.cloudera.com/cfm/2.0.1/nifi-sizing/topics/cfm-sizing-disk-configuration.html

Monitor your flows to see how much resources you need:  https://www.datainmotion.dev/2020/07/report-on-this-apache-nifi-1114-monitor.html.


Use Records, if it's semistructured GrokReader can help.   https://www.nifi.rocks/record-path-cheat-sheet/  If it's CSV, JSON, XML, Parquet, Logs then use Readers and writers.   They are much faster, easier and cleaner.



Minimize use of CPU or Memory intensive processors (or make a not of them during sizing):   https://docs.cloudera.com/cfm/2.0.1/nifi-sizing/topics/cfm-sizing-resource-intensive-processors.html

There are a few decisions to make on repositories, talk to your Cloudera friends.    https://docs.cloudera.com/HDPDocuments/HDF3/HDF-3.5.1/nifi-configuration-best-practices/content/configuration-best-practices.html



Report on This: Apache NiFi 1.11.4 - Monitor All The Things

The easiest way to grab monitoring data is via the NiFi REST API.  Also everything in the NiFi UI is done through REST calls which you can call programmatically.   Please read the NiFi docs they are linked directly from your running NiFi application or on the web.   They are very thorough and have all the information you could want:   https://nifi.apache.org/docs/nifi-docs/.   If you are not running NiFi 1.11.4, I recommend you please upgrade.   This is supported by Cloudera on multiple platforms.

 

NiFi Rest API

https://nifi.apache.org/docs/nifi-docs/rest-api/

 

There's also an awesome Python wrapper for that REST API:  https://pypi.org/project/nipyapi/

 

Also in NiFi flow programming, every time you produce data to Kafka you get metadata back in FlowFile Attributes.   You can push those attributes directly to a kafka topic if you want.

 

So after your PublishKafkaRecord_2_0 1.11.4 so for success read the attributes on # of record and other data then AttributesToJson and push to another topic.   you may want a mergerecord in there to aggregate a few of those together.

 

If you are interested in Kafka metrics/record counts/monitoring then you must use Cloudera Streams Messaging Manager, it provides a full Web UI, Monitoring Tool, Alerts, REST API and everything you need for monitoring every producer, consumer, broker, cluster, topic, message, offset and Kafka component.

 

The best way to get NiFi stats is to use the NiFi Reporting Tasks, I like the SQL Reporting task.

 

SQL Reporting Tasks are very powerful and use standard SELECT * FROM JVM_METRICS style reporting, see my article:

https://www.datainmotion.dev/2020/04/sql-reporting-task-for-cloudera-flow.html

 

Monitoring Articles

 

https://www.datainmotion.dev/2019/04/monitoring-number-of-of-flow-files.html

https://www.datainmotion.dev/2019/03/apache-nifi-operations-and-monitoring.html

 

Other Resources

https://www.datainmotion.dev/2019/10/migrating-apache-flume-flows-to-apache_9.html

https://www.datainmotion.dev/2019/08/using-cloudera-streams-messaging.html

https://dev.to/tspannhw/apache-nifi-and-nifi-registry-administration-3c92

https://dev.to/tspannhw/using-nifi-cli-to-restore-nifi-flows-from-backups-18p9

https://nifi.apache.org/docs/nifi-docs/html/toolkit-guide.html

https://www.datainmotion.dev/p/links.html

https://www.tutorialspoint.com/apache_nifi/apache_nifi_monitoring.htm

https://community.cloudera.com/t5/Community-Articles/Building-a-Custom-Apache-NiFi-Operations-Dashboard-Part-1/ta-p/249060

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-metrics-reporting-nar/1.11.4/org.apache.nifi.metrics.reporting.task.MetricsReportingTask/

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-scripting-nar/1.11.4/org.apache.nifi.reporting.script.ScriptedReportingTask/index.html

 

Ingesting All The Weather Data With Apache NiFi


Ingesting All The Weather Data With Apache NiFi



Step By Step NiFi Flow

  1. GenerateFlowFile - build a schedule matching when NOAA updates weather
  2. InvokeHTTP - download all weather ZIP
  3. CompressContent - decompress ZIP
  4. UnpackContent - extract files from ZIP
  5. *RouteOnAttribute - just give us ones that are airports (${filename:startsWith('K')}). optional.
  6. *QueryRecord - XMLReader to JsonRecordSetWriter.   Query:  SELECT * FROM FLOWFILE WHERE NOT location LIKE '%Unknown%'.  This is to remove some locations that are not identified.  optional.
  7. Send it somewhere for storage.   Could put PutKudu, PutORC, PutHDFS, PutHiveStreaming, PutHbaseRecord, PutDatabaseRecord, PublishKafkaRecord2* or others.








URL For All US Data

invokehttp.request.url
https://w1.weather.gov/xml/current_obs/all_xml.zip



Example Record As Converted JSON

[ {
  "credit" : "NOAA's National Weather Service",
  "credit_URL" : "http://weather.gov/",
  "image" : {
    "url" : "http://weather.gov/images/xml_logo.gif",
    "title" : "NOAA's National Weather Service",
    "link" : "http://weather.gov"
  },
  "suggested_pickup" : "15 minutes after the hour",
  "suggested_pickup_period" : 60,
  "location" : "Stanley Municipal Airport, ND",
  "station_id" : "K08D",
  "latitude" : 48.3008,
  "longitude" : -102.4064,
  "observation_time" : "Last Updated on Jul 10 2020, 9:55 am CDT",
  "observation_time_rfc822" : "Fri, 10 Jul 2020 09:55:00 -0500",
  "weather" : "Fair",
  "temperature_string" : "66.0 F (19.0 C)",
  "temp_f" : 66.0,
  "temp_c" : 19.0,
  "relative_humidity" : 83,
  "wind_string" : "South at 6.9 MPH (6 KT)",
  "wind_dir" : "South",
  "wind_degrees" : 180,
  "wind_mph" : 6.9,
  "wind_kt" : 6,
  "pressure_in" : 30.03,
  "dewpoint_string" : "60.8 F (16.0 C)",
  "dewpoint_f" : 60.8,
  "dewpoint_c" : 16.0,
  "visibility_mi" : 10.0,
  "icon_url_base" : "http://forecast.weather.gov/images/wtf/small/",
  "two_day_history_url" : "http://www.weather.gov/data/obhistory/K08D.html",
  "icon_url_name" : "skc.png",
  "ob_url" : "http://www.weather.gov/data/METAR/K08D.1.txt",
  "disclaimer_url" : "http://weather.gov/disclaimer.html",
  "copyright_url" : "http://weather.gov/disclaimer.html",
  "privacy_policy_url" : "http://weather.gov/notice.html"
} ]


Source Code

Resources

Using Cloudera Data Platform with Flow Management and Streams on Azure


Using Cloudera Data Platform with Flow Management and Streams on Azure

Today I am going to be walking you through using Cloudera Data Platform (CDP) with Flow Management and Streams on Azure Cloud.  To see a streaming demo video, please join my webinar (or see it on demand) at Streaming Data Pipelines with CDF in Azure.  I'll share some additional how-to videos on using Apache NiFi and Apache Kafka in Azure very soon.   



Apache NiFi on Azure CDP Data Hub
Sensors to ADLS/HDFS and Kafka




In the above process group we are using QueryRecord to segment JSON records and only pick ones where the Temperature in Fahrenheit is over 80 degrees then we pick out a few attributes to display from the record and send them to a slack channel.

To become a Kafka Producer you set a Record Reader for the type coming in, this is JSON in my case and then set a Record Writer for the type to send to the sensors topic.    In this case we kept it as JSON, but we could convert to AVRO.   I usually do that if I am going to be reading it with Cloudera Kafka Connect.



Our security is automagic and requires little for you to do in NiFi.   I put in my username and password from CDP.   The SSL context is setup for my when I create my datahub.


When I am writing to our Real-Time Data Mart (Apache Kudu), I enter my Kudu servers that I copied from the Kudu Data Mart Hardware page, put in my table name and your login info.   I recommend UPSERT and use your Record Reader JSON.


For real use cases, you will need to spin up:

Public Cloud Data Hubs:
  • Streams Messaging Heavy Duty for AWS
  • Streams Messaging Heavy Duty for Azure
  • Flow Management Heavy Duty for AWS
  • Flow Management Heavy Duty for Azure
Software:
  • Apache Kafka 2.4.1
  • Cloudera Schema Registry 0.8.1
  • Cloudera Streams Messaging Manager 2.1.0
  • Apache NiFi 1.11.4
  • Apache NiFi Registry 0.5.0
Demo Source Code:


Let's configure out Data Hubs in CDP in an Azure Environment.   It is a few clicks and some naming and then it builds.












Under the Azure Portal


In Azure, we can examine the files we uploaded to the Azure object store.





Under the Data Lake SDX


NiFi and Kafka are autoconfigured to work with Apache Atlas under our environments Data Lake SDX.  We can browse through the lineage for all the Kafka topics we use.






We can also see the flow for NiFi, HDFS and Kudu.

SMM

We can examine all of our Kafka infrastructure from Kafka Brokers, Topics, Consumers, Producers, Latency and Messages.  We can also create and update topics.




Cloudera Manager

We still have access to all of our traditional items like Cloudera Manager to manage configuration of servers.



Under Real-Time Data Mart

We can view tables, create tables and query our table.   Apache Hue is a great tool for accessing data in my Real-Time Data Mart in a datahub.



We can also look at table details in the Impala UI.


References
©2020 Timothy Spann



The Rise of the Mega Edge (FLaNK)

At one point edge devices were cheap, low energy and low powered.   They may have some old WiFi and a single core CPU running pretty slow.    Now power, memory, GPUs, custom processors and substantial power has come to the edge.

Sitting on my desk is the NVidia Xaver NX which is the massively powerful machine that can easily be used for edge computing while sporting 8GB of fast RAM, a 384 NVIDIA CUDA® cores and 48 Tensor cores GPU, a 6 core 64-bit ARM CPU and is fast.   This edge device would make a great workstation and is now something that can be affordably deployed in trucks, plants, sensors and other Edge and IoT applications.  


Next that titan device is the inexpensive hobby device, the Raspberry Pi 4 that now sports 8 GB of LPDDR4 RAM, 4 core 64-bit ARM CPU and is speedy!   It can also be augmented with a Google Coral TPU or Intel Movidius 2 Neural Compute Stick.   


These boxes come with fast networking, bluetooth and the modern hardware running in small edge devices that can now deployed en masse.    Enabling edge computing, fast data capture, smart processing and integration with servers and cloud services.    By adding Apache NiFi's subproject MiNiFi C++ and Java agents we can easily integrate these powerful devices into a Streaming Data Pipeline.   We can now build very powerful flows from edge to cloud with Apache NiFi, Apache Flink, Apache Kafka  (FLaNK) and Apache NiFi - MiNiFi.    I can run AI, Deep Learning, Machine Learning including Apache MXNet, DJL, H2O, TensorFlow, Apache OpenNLP and more at any and all parts of my data pipeline.   I can push models to my edge device that now has a powerful GPU/TPU and adequate CPU, networking and RAM to do more than simple classification.    The NVIDIA Jetson Xavier NX will run multiple real-time inference streams at 60 fps on multiple cameras.  

I can run live SQL against these events at every segment of the data pipeline and combine with machine learning, alert checks and flow programming.   It's now easy to build and deploy applications from edge to cloud.

I'll be posting some examples in my next article showing some simple examples.

By next year, 12 or 16 GB of RAM may be a common edge device RAM, perhaps 2 CPUs with 8 cores, multiple GPUs and large fast SSD storage.   My edge swarm may be running much of my computing power as my flows running elastically on public and private cloud scale up and down based on demand in real-time.


Explore Enterprise Apache Flink with Cloudera Streaming Analytics - CSA 1.2


Explore Enterprise Apache Flink with Cloudera Streaming Analytics - CSA 1.2

What's New in Cloudera Streaming Analytics


Try out the tutorials now:   https://github.com/cloudera/flink-tutorials

So let's get our Apache Flink on, as part of my FLaNK Stack series I'll show you some fun things we can do with Apache Flink + Apache Kafka + Apache NiFi.

We will look at some of updates in Apache Flink 1.10 including the SQL Client and API.

We are working with Apache Flink 1.10, Apache NiFi 1.11.4 and Apache Kafka 2.4.1.

The SQL features are strong and we will take a look at what we can do.


Table connectors
  • Kafka
  • Kudu
  • Hive (through catalog)

Data formats (Kafka)
  • JSON
  • Avro
  • CSV



Building a DataStream Application in Flink

Build A Flink Project

mvn archetype:generate                               \
      -DarchetypeGroupId=org.apache.flink              \
      -DarchetypeArtifactId=flink-quickstart-java      \
      -DarchetypeVersion=1.10.0

References:










Using Apache Kafka Using Cloudera Data Platform Data Center 7.1.1

Using Apache Kafka Using Cloudera Data Platform Data Center 7.1.1

Primary Documentation Sources

Kafka Public API List

SMM REST API

SRM REST API
https://docs.cloudera.com/srm/1.0.0/rest-api-reference/index.html

Schema Registry

Kafka Using SR Libraries

Development Libraries and SDKs

Adding additional metrics to SMM

Maven Repositories

Clients

Schema Registry Usage

Tutorials

Kafka Crash Course

Details


FAQ

Unboxing the Most Amazing Edge AI Device Part 1 of 3 - NVIDIA Jetson Xavier NX

Unboxing the Most Amazing Edge AI Device 

Fast, Intuitive, Powerful and Easy.
Part 1 of 3
NVIDIA Jetson Xavier NX


This is the first of a series on articles on using the Jetson Xavier NX Developer kit for EdgeAI applications.   This will include running various TensorFlow, Pytorch, MXNet and other frameworks.  I will also show how to use this amazing device with Apache projects including the FLaNK Stack of Apache Flink, Apache Kafka, Apache NiFi, Apache MXNet and Apache NiFi - MiNiFi.

These are not words that one would usually use to define AI, Deep Learning, IoT or Edge Devices.    They are now.    There is a new tool for making what was incredibly slow and difficult to something that you can easily get your hands on and develop with.  Supporting running multiple models simultaneously in containers with fast frame rates is not something I thought you could affordably run in robots and IoT devices.    Now it is and this will drive some amazingly smart robots, drones, self-driving machines and applications that are not yet in prototypes.

Out of the box, this machine is sleek, light weight and ready to go.   And now with built-in fast WiFi, yet another great upgrade!   I added a 256GB SSD Hard drive and it took seconds and a few quick Linux commands.   It's running Ubuntu 18.04 LTS which supports all the deep learning and python libraries you need and runs well.     It has a powerful fan already attached and judging by the fast spinning when I was running benchmarks it probably needs it.   










It was super easy to get working, just plugged in a USB mouse and keyboard and HDMI monitor. 

I ran the benchmarks and was massively impressed with the FPS that can be processed.   This machine has some serious power.  Basically, this device you are going to locate at the edge in a robot, drone, car or other edge point could be your desktop machine.

I ran a few graphics demos and tests to validate everything once my keyboard, mouse and HDMI monitor were connected.   The abilities are awesome.   I can see why NVIDIA GPUs are amazing for gaming.   


The specifications for the edge device are very impressive.   The 8GB of RAM makes this feel like a powerful desktop and not a low powered edge device.  




I ran the benchmarks and they were smoking fast.   I can see using this as a workstation as the FPS were nice as you can see below.




In part 2, I am going to show how to run some edge AI workloads at tremendous speed and stream the results and images to your cloud or big data environments using Apache open source frameworks including Apache Flink, Apache NiFi - MiNiFi and Apache Kafka.

In part 3, We will push the processing capabilities and amp up the workloads and test all the impressive features of this new killer edge device.

There is so many great tutorials and learning materials available for the NVIDIA Xavier NX.     I have found that all my work for Jetson Nano has been working here, only faster.  So this is great, I'll have a few interesting demos and run throughs and a video in the follow up articles.   

I added a standard USB hub and a Logitech C270 USB Web Camera which worked perfectly.   I will use that in the follow up articles and some edge applications.

Tutorials and Guides

References:

I highly recommend all AI, Deep Learning, IoT, IIoT, Edge and streaming developers obtain one or more of these developer kits.

This is a powerful machine in a small box.   From edge applications to robotics to smart devices to anything that needs powerful processing at the edge, this is your device.  A fast CPU, fast GPU and all the interfaces you need.  This should be part of any project.   Joining my NVIDIA Jetson Nano you now have some great affordable options for Edge AI applications.   It is amazing to test drive the performance of this device.   I will also be showing this at my online meetups, so join me or watch the video on Youtube later.

===

Jetson Xavier NX Developer Kit features:
 
Power:  10W (Max efficiency) | 15W (Max performance)
NVIDIA Volta architecture with 384 NVIDIA CUDA® cores and 48 Tensor cores
6-core NVIDIA Carmel ARM®v8.2 64-bit CPU 6 MB L2 + 4 MB L3
2x NVDLA Engines
8 GB 128-bit LPDDR4x @ 51.2GB/s
2x 4K @ 30 | 6x 1080p @ 60 | 14x 1080p @ 30 (H.265/H.264)
Gigabit Ethernet, M.2 Key E (WiFi/BT included), M.2 Key M (NVMe)
HDMI 
4x USB 3.1, USB 2.0 Micro-B
2x 4K @ 60 !  If you lower the resolution, it scales up the numbers.

The Jetson Xavier NX Developer Kit is now available for $399 US at NVIDIA.com and from channel partners worldwide.    I would recommend acquiring some ASAP before current supplies wane and you may have to wait.