Showing posts with label apache atlas. Show all posts
Showing posts with label apache atlas. Show all posts

Using Cloudera Data Platform with Flow Management and Streams on Azure


Using Cloudera Data Platform with Flow Management and Streams on Azure

Today I am going to be walking you through using Cloudera Data Platform (CDP) with Flow Management and Streams on Azure Cloud.  To see a streaming demo video, please join my webinar (or see it on demand) at Streaming Data Pipelines with CDF in Azure.  I'll share some additional how-to videos on using Apache NiFi and Apache Kafka in Azure very soon.   



Apache NiFi on Azure CDP Data Hub
Sensors to ADLS/HDFS and Kafka




In the above process group we are using QueryRecord to segment JSON records and only pick ones where the Temperature in Fahrenheit is over 80 degrees then we pick out a few attributes to display from the record and send them to a slack channel.

To become a Kafka Producer you set a Record Reader for the type coming in, this is JSON in my case and then set a Record Writer for the type to send to the sensors topic.    In this case we kept it as JSON, but we could convert to AVRO.   I usually do that if I am going to be reading it with Cloudera Kafka Connect.



Our security is automagic and requires little for you to do in NiFi.   I put in my username and password from CDP.   The SSL context is setup for my when I create my datahub.


When I am writing to our Real-Time Data Mart (Apache Kudu), I enter my Kudu servers that I copied from the Kudu Data Mart Hardware page, put in my table name and your login info.   I recommend UPSERT and use your Record Reader JSON.


For real use cases, you will need to spin up:

Public Cloud Data Hubs:
  • Streams Messaging Heavy Duty for AWS
  • Streams Messaging Heavy Duty for Azure
  • Flow Management Heavy Duty for AWS
  • Flow Management Heavy Duty for Azure
Software:
  • Apache Kafka 2.4.1
  • Cloudera Schema Registry 0.8.1
  • Cloudera Streams Messaging Manager 2.1.0
  • Apache NiFi 1.11.4
  • Apache NiFi Registry 0.5.0
Demo Source Code:


Let's configure out Data Hubs in CDP in an Azure Environment.   It is a few clicks and some naming and then it builds.












Under the Azure Portal


In Azure, we can examine the files we uploaded to the Azure object store.





Under the Data Lake SDX


NiFi and Kafka are autoconfigured to work with Apache Atlas under our environments Data Lake SDX.  We can browse through the lineage for all the Kafka topics we use.






We can also see the flow for NiFi, HDFS and Kudu.

SMM

We can examine all of our Kafka infrastructure from Kafka Brokers, Topics, Consumers, Producers, Latency and Messages.  We can also create and update topics.




Cloudera Manager

We still have access to all of our traditional items like Cloudera Manager to manage configuration of servers.



Under Real-Time Data Mart

We can view tables, create tables and query our table.   Apache Hue is a great tool for accessing data in my Real-Time Data Mart in a datahub.



We can also look at table details in the Impala UI.


References
©2020 Timothy Spann



Commonly Used TCP/IP Ports in Streaming

Cloudera CDF and HDF Ports
NiFi and Friends
FLaNK Extended Stack


Note: 

All of these ports can be changed by administrators or in version updates.   Also if you are running Apache Knox like in Cloudera Data Platform Public Cloud, these ports may be changed or hidden.   This is just based on a version of CDF I am running and defaults in.   This does not include standard Cloudera ports for Cloudera Manager, Hadoop, Atlas, Ranger and other necessary and fun services.


Cloudera Flow Management (CFM Powered by Apache NiFi)
  • Cloudera NiFi HTTP:    8080 or 9090
  • Cloudera NiFi HTTPS:  8443 or 9443
  • Cloudera NiFi RIP Socket: 10443 or 50999
  • Cloudera NiFi Node Protocol: 11443
  • Cloudera NiFi Load Balancing:  6342
  • Cloudera NiFi Registry: 18080
  • Cloudera NiFi Registry SSL: 18433
  • Cloudera NiFi Certificate Authority:  10443

Cloudera Edge Flow Management (CEM Powered by Apache NiFi - MiNiFi)

  • Cloudera EFM HTTP:  10080
  • Cloudera EFM CoAP:  8989

Cloudera Stream Processing (CSP Powered by Apache Kafka)
  • Cloudera Kafka: 9092
  • Cloudera Kafka SSL:  9093
  • Cloudera Kafka Connect:  38083
  • Cloudera Kafka Connect SSL:  38085
  • Cloudera Kafka Jetty Metrics: 38084
  • Cloudera Kafka JMX: 9393
  • Cloudera Kafka MirrorMaker JMX: 9394
  • Cloudera Kafka HTTP Metric: 24042
  • Cloudera Schema Registry Registry: 7788
  • Cloudera Schema Registry Admin: 7789
  • Cloudera Schema Registry SSL:  7790
  • Cloudera Schema Registry Admin SSL:  7791
  • Cloudera Schema Registry Database (Postgresql):  5432
  • Cloudera SRM:  6669
  • Cloudera RPC: 8081
  • Cloudera SRM Rest: 6670
  • Cloudera SRM Rest SSL:  6671
  • Cloudera SMM Rest / UI: 9991
  • Cloudera SMM Manager:  8585
  • Cloudera SMM Manager SSL:  8587
  • Cloudera SMM Manager Admin:  8586
  • Cloudera SMM Manager Admin SSL: 8588
  • Cloudera SMM Service Monitor:  9997
  • Cloudera SMM Kafka Connect:  38083
  • Cloudera SMM Database (Postgresql):  5432

Cloudera Streaming Analytics (CSA Powered by Apache Flink)
  • Cloudera Flink Dashboard:  8082



References



Harnessing the Data Lifecycle for Customer Experience Optimization: Streaming Classifications On Twitter Streams

Harnessing the Data Lifecycle for Customer Experience Optimization: Streaming Classifications

For a deeper dive see this past webinar: available here.

In the use case solved for this webinar, I am a Streaming Engineer at an airline, CloudAir.   I need to find, filter and clean Twitter streams then perform sentiment analysis.


Score Models in the Stream to Act











As the Streaming Engineer at CloudAIR I am responsible for ingesting data from thousands of sources, operationalizing machine learning models as part of our streams, running real-time ELT/ETL processes and building event processing systems running from devices, servers and edge nodes. For today’s use case, one of our ML engineers had given me a model that was deployed into one of our production Cloudera Machine Learning (CML) environments. I logged into Cloudera Data Platform (CDP), found the model, tested it, and then extracted the information I need to add this model to our streaming ingest flow for the social media team.




I have been given permissions to access the airline-sentiment workshop in CDP Public Cloud.




I can see all the models deployed in the project I have access to. I see that predict-sentiment is the one I am to use. It is deployed and has 8GB of RAM and 2 vCPU.


   


I can see that it has been running successfully for a while and I can test it right from the project.


You can see the URL after the POST and the accessKey is in JSON.


Using Cloudera Flow Management (CFM) I am ingesting real-time Twitter streams which I filter for only airline specific data. I then clean and transform these records in a few simple steps.   The next pieces I will need are those two critical values from CML: the Access Key and URL for the model. I will add them to an instance of an ExecuteClouderaML processor.






I am also sending the raw tweet (large JSON files) to a Kafka topic for further processing by other teams.


I also need to store this data to tables for ad-hoc queries. So I quickly spin up a virtual warehouse with Impala for reporting uses. I will put my data into S3 buckets as Parquet files, with an external Impala table on top, for these reports.






Once my environment is ready, which will only take a few minutes, I will launch Hue to create a table.






From the virtual warehouse I can grab the JDBC URL that I will need to add to my Impala Connection pool in CFM for connecting to the warehouse. I will also need the JDBC driver.




From CFM I add a JDBC Controller and copy in the URL, the Impala driver name and a link to that JDBC jar. I will also set my user and password, or Kerberos credentials, to connect.






After having called CML from CFM, I can see the scoring results and now use them to augment my twitter data. The data is added to the attributes for each event and does not affect the current flowfile data.




Now that data is streaming into Impala I can run ad-hoc queries and build charts on my sentiment-enriched, cleaned-up twitter data.


For those of you that love the command line, we can grab a link to the Impala command line tool for the virtual warehouse as well, and query from there. Good for quick checks.




In another section of our flow we are also storing our enriched tweets in a CDP Data Center (CDP-DC) Kudu table for additional analytics that we are running in Hue and in a Jupyter notebook
that we spin up with our CDP-DC CML.



Jupyter notebooks spun up from Cloudera Machine Learning let me explore my data and do some charting, graphs and SQL work in Python3.




One of the amazing features that comes in handy when you have a complex flow that spans a hybrid
environment is to have data management and governance abilities. We can do that with Apache Atlas.
We can navigate and search through Atlas to see how data travels through Apache NiFi, Apache
Kafka, tables and Cloudera Machine Learning model activities like deployment.




Final DataFlow For Scoring




We have a Query Record processor in CFM that analyzes the streaming events and looks for
Negative sentiment by influencers, we then push those events to a Slack channel for our social
media team to handle.




As we have seen, we are sending several different streams of data to Kafka topics for further
processing with Spark Streaming, Flink, NiFi, Java and Kafka Streams applications. Using
Cloudera Streams Messaging Manager we can see all the components of our Kafka cluster
and where our events are as they travel through topics in various brokers. You can see
messages in all of the partitions, you can also build up alerts for any part of your Kafka system.
An important piece is you can trace messages from all of the consumers to all of the producers
and see any lag or latency that occurs in clients.






We can also push to our Operational Database (HBase) and easily scan through the rapidly inserted rows.




This demo was presented in the webinar,
Harnessing the Data Lifecycle for Customer Experience Optimization


Source Code Resources


Queries, Python, Models, Notebooks


Example Cloudera Machine Learning Connector

SQL