Showing posts with label apache-kafka. Show all posts
Showing posts with label apache-kafka. Show all posts

Using Cloudera Data Platform with Flow Management and Streams on Azure

Using Cloudera Data Platform with Flow Management and Streams on Azure

Today I am going to be walking you through using Cloudera Data Platform (CDP) with Flow Management and Streams on Azure Cloud.  To see a streaming demo video, please join my webinar (or see it on demand) at Streaming Data Pipelines with CDF in Azure.  I'll share some additional how-to videos on using Apache NiFi and Apache Kafka in Azure very soon.   

Apache NiFi on Azure CDP Data Hub
Sensors to ADLS/HDFS and Kafka

In the above process group we are using QueryRecord to segment JSON records and only pick ones where the Temperature in Fahrenheit is over 80 degrees then we pick out a few attributes to display from the record and send them to a slack channel.

To become a Kafka Producer you set a Record Reader for the type coming in, this is JSON in my case and then set a Record Writer for the type to send to the sensors topic.    In this case we kept it as JSON, but we could convert to AVRO.   I usually do that if I am going to be reading it with Cloudera Kafka Connect.

Our security is automagic and requires little for you to do in NiFi.   I put in my username and password from CDP.   The SSL context is setup for my when I create my datahub.

When I am writing to our Real-Time Data Mart (Apache Kudu), I enter my Kudu servers that I copied from the Kudu Data Mart Hardware page, put in my table name and your login info.   I recommend UPSERT and use your Record Reader JSON.

For real use cases, you will need to spin up:

Public Cloud Data Hubs:
  • Streams Messaging Heavy Duty for AWS
  • Streams Messaging Heavy Duty for Azure
  • Flow Management Heavy Duty for AWS
  • Flow Management Heavy Duty for Azure
  • Apache Kafka 2.4.1
  • Cloudera Schema Registry 0.8.1
  • Cloudera Streams Messaging Manager 2.1.0
  • Apache NiFi 1.11.4
  • Apache NiFi Registry 0.5.0
Demo Source Code:

Let's configure out Data Hubs in CDP in an Azure Environment.   It is a few clicks and some naming and then it builds.

Under the Azure Portal

In Azure, we can examine the files we uploaded to the Azure object store.

Under the Data Lake SDX

NiFi and Kafka are autoconfigured to work with Apache Atlas under our environments Data Lake SDX.  We can browse through the lineage for all the Kafka topics we use.

We can also see the flow for NiFi, HDFS and Kudu.


We can examine all of our Kafka infrastructure from Kafka Brokers, Topics, Consumers, Producers, Latency and Messages.  We can also create and update topics.

Cloudera Manager

We still have access to all of our traditional items like Cloudera Manager to manage configuration of servers.

Under Real-Time Data Mart

We can view tables, create tables and query our table.   Apache Hue is a great tool for accessing data in my Real-Time Data Mart in a datahub.

We can also look at table details in the Impala UI.

©2020 Timothy Spann

The Rise of the Mega Edge (FLaNK)

At one point edge devices were cheap, low energy and low powered.   They may have some old WiFi and a single core CPU running pretty slow.    Now power, memory, GPUs, custom processors and substantial power has come to the edge.

Sitting on my desk is the NVidia Xaver NX which is the massively powerful machine that can easily be used for edge computing while sporting 8GB of fast RAM, a 384 NVIDIA CUDA® cores and 48 Tensor cores GPU, a 6 core 64-bit ARM CPU and is fast.   This edge device would make a great workstation and is now something that can be affordably deployed in trucks, plants, sensors and other Edge and IoT applications.  

Next that titan device is the inexpensive hobby device, the Raspberry Pi 4 that now sports 8 GB of LPDDR4 RAM, 4 core 64-bit ARM CPU and is speedy!   It can also be augmented with a Google Coral TPU or Intel Movidius 2 Neural Compute Stick.   

These boxes come with fast networking, bluetooth and the modern hardware running in small edge devices that can now deployed en masse.    Enabling edge computing, fast data capture, smart processing and integration with servers and cloud services.    By adding Apache NiFi's subproject MiNiFi C++ and Java agents we can easily integrate these powerful devices into a Streaming Data Pipeline.   We can now build very powerful flows from edge to cloud with Apache NiFi, Apache Flink, Apache Kafka  (FLaNK) and Apache NiFi - MiNiFi.    I can run AI, Deep Learning, Machine Learning including Apache MXNet, DJL, H2O, TensorFlow, Apache OpenNLP and more at any and all parts of my data pipeline.   I can push models to my edge device that now has a powerful GPU/TPU and adequate CPU, networking and RAM to do more than simple classification.    The NVIDIA Jetson Xavier NX will run multiple real-time inference streams at 60 fps on multiple cameras.  

I can run live SQL against these events at every segment of the data pipeline and combine with machine learning, alert checks and flow programming.   It's now easy to build and deploy applications from edge to cloud.

I'll be posting some examples in my next article showing some simple examples.

By next year, 12 or 16 GB of RAM may be a common edge device RAM, perhaps 2 CPUs with 8 cores, multiple GPUs and large fast SSD storage.   My edge swarm may be running much of my computing power as my flows running elastically on public and private cloud scale up and down based on demand in real-time.

Time Series Analysis - Dataflow

In a first, we joined together for the forces of NYC, New Jersey and Philly to power this meetup.   A huge thanks to John Kuchmek, Amol Thacker and Paul Vidal for promoting and cross running a sweet meetup.   John was an amazing meetup lead and made sure we kept moving.  A giant thanks to Cloudera marketing for helping with logistics and some awesome giveaways!   Hopefully next year's we can do a Cinco De Mayo Taco Feast!  Bill Brooks and Robert Hryniewicz were great help!   And thanks for Cloudera for providing CDP Public Cloud on AWS and CDP-DC on OpenStack for demos, development and general data fun.   And thanks for the initial meetup suggestion and speaker to Bethann Noble and her awesome machine learning people.

Philly - NJ - NYC

To quote, John Kuchmek:

The Internet of Things (IoT) is growing in popularity but it isn’t new. Connected devices have existed in manufacturing and utilities with Supervisory Control and Data Acquisition (SCADA) systems. Time series data has been looked at for sometime in these industries as well as the stock market. Time series analysis can bring valuable insight to businesses and individuals with smart homes. There are many parts and components to be able to collect data at the edge, store in a central location for initial analysis, model build, train and eventually deploy. Time series forecasting is one of the more challenging problems to solve in data science. Important factors in time series analysis and forecasting are seasonality, stationary nature of data and autocorrelation of target variables. We show you a platform, built on open source technology, that has this potential. Sensor data will be collected at the edge, off a Raspberry Pi, using Cloudera’s Edge Flow Manager (powered by MiNiFi). The data will then be pushed to a cluster containing Cloudera Flow Manager (powered by NiFi) so it can be manipulated, routed, and then be stored in Kudu on Cloudera’s Data Platform. Initial inspection can be done in Hue using Impala. The time series data will be analyzed with potential forecasting using an ARIMA model in CML (Cloudera Machine Learning). Time series analysis and forecasting can be applied to but not limited to stock market analysis, forecasting electricity loads, inventory studies, weather conditions, census analysis and sales forecasting.

The main portion of our meetup was an amazing talk by Data Scientist - Victor Dibia.

Analyzing Time Series Data with an ARIMA model

His talk comes right after mine and is about an hour of in-depth Data Science with many hard questions answered.   Also a cool demo.   Thanks again Victor.

We also had some really great attendees who asked some tough question.  My favorite question was by a Flink expert who joined from the West Coast who asked for a FLaNK sticker.

Time Series Analysis - Dataflow

For my small part I did a demo of ingesting data from MiNiFi to NiFi to CML and Kafka.   Flink reads from two Kafka topics, joins them and inserts into a third Kafka topic.   We call the ML model for classification as part of our ingest flow.   This is an example of my FLaNK Stack.

MiNiFi sends the data it reads from sensors and a camera and sends them to a local NiFi gateway.   That NiFi gateway sends a stream to my CDP hosted CFM NiFi cluster for processing.  This cluster splits the data based on which set of sensors (energy or scada) and then publishes to Kafka topics and populates Kudu tables with an UPSERT.

We have great options for monitoring, querying and analyzing our data with the tools from CDP and CDP-DC.   These include Cloudera DAS, Apache Hue, Cloudera SMM for Kafka, Flink SQL console, Flink Dashboard, CML Notebooks, Jupyter Notebooks from CML and Apache Zeppelin.

As a separate way to investigate Kafka, I have created a Hive external table in beeline and connected that to a Kafka topic.  I can know query the current state of that topic.

Video Walkthrough of FlinkSQL Application (and awesome Machine Learning Talk on Time Series)

Slides From Talk

Related Articles