Using Cloudera Data Platform with Flow Management and Streams on Azure
Today I am going to be walking you through using Cloudera Data Platform (CDP) with Flow Management and Streams on Azure Cloud. To see a streaming demo video, please join my webinar (or see it on demand) at Streaming Data Pipelines with CDF in Azure. I'll share some additional how-to videos on using Apache NiFi and Apache Kafka in Azure very soon.
|Apache NiFi on Azure CDP Data Hub
In the above process group we are using QueryRecord to segment JSON records and only pick ones where the Temperature in Fahrenheit is over 80 degrees then we pick out a few attributes to display from the record and send them to a slack channel.
To become a Kafka Producer you set a Record Reader for the type coming in, this is JSON in my case and then set a Record Writer for the type to send to the sensors topic. In this case we kept it as JSON, but we could convert to AVRO. I usually do that if I am going to be reading it with Cloudera Kafka Connect.
Our security is automagic and requires little for you to do in NiFi. I put in my username and password from CDP. The SSL context is setup for my when I create my datahub.
When I am writing to our Real-Time Data Mart (Apache Kudu), I enter my Kudu servers that I copied from the Kudu Data Mart Hardware page, put in my table name and your login info. I recommend UPSERT and use your Record Reader JSON.
For real use cases, you will need to spin up:
Public Cloud Data Hubs:
- Streams Messaging Heavy Duty for AWS
- Streams Messaging Heavy Duty for Azure
- Flow Management Heavy Duty for AWS
- Flow Management Heavy Duty for Azure
- Apache Kafka 2.4.1
- Cloudera Schema Registry 0.8.1
- Cloudera Streams Messaging Manager 2.1.0
- Apache NiFi 1.11.4
- Apache NiFi Registry 0.5.0
Demo Source Code:
Let's configure out Data Hubs in CDP in an Azure Environment. It is a few clicks and some naming and then it builds.
Under the Azure Portal
Under the Data Lake SDX
NiFi and Kafka are autoconfigured to work with Apache Atlas under our environments Data Lake SDX. We can browse through the lineage for all the Kafka topics we use.
We can examine all of our Kafka infrastructure from Kafka Brokers, Topics, Consumers, Producers, Latency and Messages. We can also create and update topics.
We still have access to all of our traditional items like Cloudera Manager to manage configuration of servers.
Under Real-Time Data Mart
We can view tables, create tables and query our table. Apache Hue is a great tool for accessing data in my Real-Time Data Mart in a datahub.
We can also look at table details in the Impala UI.
©2020 Timothy Spann