Harnessing the Data Lifecycle for Customer Experience Optimization: Streaming Classifications On Twitter Streams

Harnessing the Data Lifecycle for Customer Experience Optimization: Streaming Classifications

For a deeper dive see this past webinar: available here.

In the use case solved for this webinar, I am a Streaming Engineer at an airline, CloudAir. I need to find, filter and clean Twitter streams then perform sentiment analysis.

Score Models in the Stream to Act

As the Streaming Engineer at CloudAIR I am responsible for ingesting data from thousands of sources, operationalizing machine learning models as part of our streams, running real-time ELT/ETL processes and building event processing systems running from devices, servers and edge nodes. For today’s use case, one of our ML engineers had given me a model that was deployed into one of our production Cloudera Machine Learning (CML) environments. I logged into Cloudera Data Platform (CDP), found the model, tested it, and then extracted the information I need to add this model to our streaming ingest flow for the social media team.

I have been given permissions to access the airline-sentiment workshop in CDP Public Cloud.

I can see all the models deployed in the project I have access to. I see that predict-sentiment is the one I am to use. It is deployed and has 8GB of RAM and 2 vCPU.

I can see that it has been running successfully for a while and I can test it right from the project.

You can see the URL after the POST and the accessKey is in JSON.

Using Cloudera Flow Management (CFM) I am ingesting real-time Twitter streams which I filter for only airline specific data. I then clean and transform these records in a few simple steps. The next pieces I will need are those two critical values from CML: the Access Key and URL for the model. I will add them to an instance of an ExecuteClouderaML processor.

I am also sending the raw tweet (large JSON files) to a Kafka topic for further processing by other teams.

I also need to store this data to tables for ad-hoc queries. So I quickly spin up a virtual warehouse with Impala for reporting uses. I will put my data into S3 buckets as Parquet files, with an external Impala table on top, for these reports.

Once my environment is ready, which will only take a few minutes, I will launch Hue to create a table.

From the virtual warehouse I can grab the JDBC URL that I will need to add to my Impala Connection pool in CFM for connecting to the warehouse. I will also need the JDBC driver.

From CFM I add a JDBC Controller and copy in the URL, the Impala driver name and a link to that JDBC jar. I will also set my user and password, or Kerberos credentials, to connect.

After having called CML from CFM, I can see the scoring results and now use them to augment my twitter data. The data is added to the attributes for each event and does not affect the current flowfile data.

Now that data is streaming into Impala I can run ad-hoc queries and build charts on my sentiment-enriched, cleaned-up twitter data.

For those of you that love the command line, we can grab a link to the Impala command line tool for the virtual warehouse as well, and query from there. Good for quick checks.

In another section of our flow we are also storing our enriched tweets in a CDP Data Center (CDP-DC) Kudu table for additional analytics that we are running in Hue and in a Jupyter notebook
that we spin up with our CDP-DC CML.

Jupyter notebooks spun up from Cloudera Machine Learning let me explore my data and do some charting, graphs and SQL work in Python3.

One of the amazing features that comes in handy when you have a complex flow that spans a hybrid
environment is to have data management and governance abilities. We can do that with Apache Atlas.
We can navigate and search through Atlas to see how data travels through Apache NiFi, Apache
Kafka, tables and Cloudera Machine Learning model activities like deployment.

Final DataFlow For Scoring

We have a Query Record processor in CFM that analyzes the streaming events and looks for
Negative sentiment by influencers, we then push those events to a Slack channel for our social
media team to handle.

As we have seen, we are sending several different streams of data to Kafka topics for further
processing with Spark Streaming, Flink, NiFi, Java and Kafka Streams applications. Using
Cloudera Streams Messaging Manager we can see all the components of our Kafka cluster
and where our events are as they travel through topics in various brokers. You can see
messages in all of the partitions, you can also build up alerts for any part of your Kafka system.
An important piece is you can trace messages from all of the consumers to all of the producers
and see any lag or latency that occurs in clients.

We can also push to our Operational Database (HBase) and easily scan through the rapidly inserted rows.

This demo was presented in the webinar,
Harnessing the Data Lifecycle for Customer Experience Optimization.

Source Code Resources

Queries, Python, Models, Notebooks

https://github.com/tspannhw/airline-sentiment-streaming

Example Cloudera Machine Learning Connector

https://github.com/tspannhw/ExecuteClouderaML

SQL

https://github.com/tspannhw/table-ddl

Predicting Sensor Readings with Time Series Machine Learning

Sensors:

Sensor Unit (https://shop.pimoroni.com/products/enviro?variant=31155658457171)

BME280 temperature, pressure, humidity sensor
LTR-559 light and proximity sensor
MICS6814 analog gas sensor
ADS1015 ADC with spare channel for adding another analog sensor
MEMS microphone
0.96-inch, 160 x 80 color LCD

Unit

Raspberry Pi 4
Intel Movidius 2
JDK 8
MiNIFi Java Agent 0.6.0
Python 3

See: https://www.datainmotion.dev/2019/12/iot-series-minifi-agent-on-raspberry-pi.html

See: https://learn.pimoroni.com/tutorial/sandyj/getting-started-with-enviro-plus

Source: https://github.com/tspannhw/meetup-sensors/ https://github.com/tspannhw/ClouderaFlowManagementWorkshop

Example Data

{"uuid": "rpi4_uuid_omi_20200417211935", "amplitude100": 0.3, "amplitude500": 0.1, "amplitude1000": 0.1, "lownoise": 0.1, "midnoise": 0.1, "highnoise": 0.1, "amps": 0.3, "ipaddress": "192.168.1.243", "host": "rp4", "host_name": "rp4", "macaddress": "dc:a6:32:03:a6:e9", "systemtime": "04/17/2020 17:19:36", "endtime": "1587158376.22", "runtime": "36.47", "starttime": "04/17/2020 17:18:58", "cpu": 0.0, "cpu_temp": "59.0", "diskusage": "46651.6 MB", "memory": 6.3, "id": "20200417211935_7b7ae5da-905b-418b-94f1-270a15dbc1df", "temperature": "38.7", "adjtemp": "29.7", "adjtempf": "65.5", "temperaturef": "81.7", "pressure": 1015.6, "humidity": 6.8, "lux": 1.2, "proximity": 0, "oxidising": 8.3, "reducing": 306.4, "nh3": 129.5, "gasKO": "Oxidising: 8300.63 Ohms\nReducing: 306352.94 Ohms\nNH3: 129542.17 Ohms"}

SQL Reporting Task for Cloudera Flow Management / HDF / Apache NiFi

Would you like to have reporting tasks gathering metrics and sending them to your database or Kafka from NiFi based on a query of NiFi provenance, bulletins, metrics, processor status or other KPI?

Now you can. If you are using HDF 3.5.0, this Reporting task NAR is pre installed and ready to go.

Let's add some Reporting tasks that use SQL!!! QueryNiFiReportingTask.

The first one that was interesting for me was to write queries against provenance for one processor that consumes from a certain topic, I decided to query it every 10 seconds. My query and some results are below.

So let's go to Controller Settings / Reporting Tasks and then add QueryNiFiReportingTask:

We add one per item we want to monitor. Then for the reporting task we will need a place to send the records (a sink), we can send it to a JDBC Database (DatabaseRecordSink, KafkaRecordSink, PrometheusRecordSink, ScriptedRecordSink or SiteToSiteReportingRecordSink). I am going to do Kafka, but Prometheus, Database and S2S are good options. If you send SiteToSite you can send to another NiFi cluster for processing for that NiFinception processing route,

We have to write an Apache Calcite compliant SQL query, set our sink and decide if we want to include zero record results (false is a good idea).

One option is to query the BULLETINS table, which are NiFi Cluster bulletins (warnings/errors).

Another option is the CONNECTION_STATUS table.

How about NiFi JVM Metrics? That has some good stuff in there.

Let's configure some Kafka Record Sinks. I am using Kafka 2.3, so I'll use the Kafka 2 sink. I set some brokers (default port of 9092), create a new topic for it, chose the record writer. I chose JSON, but it could be CSV, AVRO, Parquet, XML or something else. Stick with JSON or AVRO for Kafka.

So data starts getting sent, by default the schedule is every 5 minutes. This can be adjusted, for provenance I set mine to every 10 seconds.

Let's look at data as it passes through Kafka, in the follow up to this article we'll land in Impala/Kudu/Hive/Druid/Phoenix/HBase or HDFS/S3 and query it. For now, we can examine the data within Kafka via Cloudera Streams Messaging Manager (SMM).

We can examine any of the topics in Kafka with SMM. We can then start consuming these records and build live metric handling applications or send to live data marts and dashboards powered by Cloudera Data Platform (CDP).

We have a full set of SQL to use for these reporting tasks including selecting columns, doing aggregates like MAX and AVG, ordering, grouping, where clauses and row limits.

Provenance Query

SELECT eventId, durationMillis, lineageStart, timestampMillis,
updatedAttributes, entitySize, details
FROM PROVENANCE
WHERE componentName = 'Consume Kafka iot messages'
LIMIT 25

Provenance Results

[{"eventId":2724,"durationMillis":69,"lineageStart":1586294707989,"timestampMillis":1586294708002,"updatedAttributes":"{path=./, schema.protocol.version=1, filename=be499074-c595-46f5-a03a-482607fb9c8c, schema.identifier=1, kafka.partition=7, kafka.offset=36, schema.name=SensorReading, kafka.timestamp=1586294707933, kafka.topic=iot, schema.version=1, mime.type=application/json, uuid=be499074-c595-46f5-a03a-482607fb9c8c}","entitySize":246,"details":null},{"eventId":2736,"durationMillis":20,"lineageStart":1586294708905,"timestampMillis":1586294708916,"updatedAttributes":"{path=./, schema.protocol.version=1, filename=74c9e28d-82a4-4cea-8331-92b7a4bee1b3, schema.identifier=1, kafka.partition=3, kafka.offset=36, schema.name=SensorReading, kafka.timestamp=1586294708898, kafka.topic=iot, schema.version=1, mime.type=application/json, uuid=74c9e28d-82a4-4cea-8331-92b7a4bee1b3}","entitySize":247,"details":null},..]

I have classes on using this technology and some webinars coming to you in the East Coast of the US and virtually at my meetup.

https://www.meetup.com/futureofdata-princeton/

Cloudera HDF 3.5.0 Is Now Available For Download

https://docs.cloudera.com/HDPDocuments/HDF3/HDF-3.5.0/release-notes/content/whats-new.html

https://www.cloudera.com/downloads/hdf.html

HDF 3.5.0 includes the following components:

Apache Ambari 2.7.5
Apache Kafka 2.3.1
Apache NiFi 1.11.1
NiFi Registry 0.5.0
Apache Ranger 1.2.0
Apache Storm 1.2.1
Apache ZooKeeper 3.4.6
Apache MiNiFi Java Agent 0.6.0
Apache MiNiFi C++ 0.6.0
Hortonworks Schema Registry 0.8.1
Hortonworks Streaming Analytics Manager 0.6.0
Apache Knox 1.0.0
SmartSense 1.5.0

SQL reporting task. The QueryNiFiReportingTask allows users to execute SQL queries against tables containing information on Connection Status, Processor Status, Bulletins, Process Group Status, JVM Metrics, Provenance and Connection Status Predictions. In combination with Site to Site, it is particularly useful to define fine-grained monitoring capabilities on top of the running workflows.

Read Apache Impala - Apache KUDU Tables and Send To Apache Kafka In Bulk Easily with Apache NiFi

See:

https://www.flankstack.dev/