Harnessing the Data Lifecycle for Customer Experience Optimization: Streaming Classifications On Twitter Streams

Harnessing the Data Lifecycle for Customer Experience Optimization: Streaming Classifications

For a deeper dive see this past webinar: available here.

In the use case solved for this webinar, I am a Streaming Engineer at an airline, CloudAir.   I need to find, filter and clean Twitter streams then perform sentiment analysis.


Score Models in the Stream to Act











As the Streaming Engineer at CloudAIR I am responsible for ingesting data from thousands of sources, operationalizing machine learning models as part of our streams, running real-time ELT/ETL processes and building event processing systems running from devices, servers and edge nodes. For today’s use case, one of our ML engineers had given me a model that was deployed into one of our production Cloudera Machine Learning (CML) environments. I logged into Cloudera Data Platform (CDP), found the model, tested it, and then extracted the information I need to add this model to our streaming ingest flow for the social media team.




I have been given permissions to access the airline-sentiment workshop in CDP Public Cloud.




I can see all the models deployed in the project I have access to. I see that predict-sentiment is the one I am to use. It is deployed and has 8GB of RAM and 2 vCPU.


   


I can see that it has been running successfully for a while and I can test it right from the project.


You can see the URL after the POST and the accessKey is in JSON.


Using Cloudera Flow Management (CFM) I am ingesting real-time Twitter streams which I filter for only airline specific data. I then clean and transform these records in a few simple steps.   The next pieces I will need are those two critical values from CML: the Access Key and URL for the model. I will add them to an instance of an ExecuteClouderaML processor.






I am also sending the raw tweet (large JSON files) to a Kafka topic for further processing by other teams.


I also need to store this data to tables for ad-hoc queries. So I quickly spin up a virtual warehouse with Impala for reporting uses. I will put my data into S3 buckets as Parquet files, with an external Impala table on top, for these reports.






Once my environment is ready, which will only take a few minutes, I will launch Hue to create a table.






From the virtual warehouse I can grab the JDBC URL that I will need to add to my Impala Connection pool in CFM for connecting to the warehouse. I will also need the JDBC driver.




From CFM I add a JDBC Controller and copy in the URL, the Impala driver name and a link to that JDBC jar. I will also set my user and password, or Kerberos credentials, to connect.






After having called CML from CFM, I can see the scoring results and now use them to augment my twitter data. The data is added to the attributes for each event and does not affect the current flowfile data.




Now that data is streaming into Impala I can run ad-hoc queries and build charts on my sentiment-enriched, cleaned-up twitter data.


For those of you that love the command line, we can grab a link to the Impala command line tool for the virtual warehouse as well, and query from there. Good for quick checks.




In another section of our flow we are also storing our enriched tweets in a CDP Data Center (CDP-DC) Kudu table for additional analytics that we are running in Hue and in a Jupyter notebook
that we spin up with our CDP-DC CML.



Jupyter notebooks spun up from Cloudera Machine Learning let me explore my data and do some charting, graphs and SQL work in Python3.




One of the amazing features that comes in handy when you have a complex flow that spans a hybrid
environment is to have data management and governance abilities. We can do that with Apache Atlas.
We can navigate and search through Atlas to see how data travels through Apache NiFi, Apache
Kafka, tables and Cloudera Machine Learning model activities like deployment.




Final DataFlow For Scoring




We have a Query Record processor in CFM that analyzes the streaming events and looks for
Negative sentiment by influencers, we then push those events to a Slack channel for our social
media team to handle.




As we have seen, we are sending several different streams of data to Kafka topics for further
processing with Spark Streaming, Flink, NiFi, Java and Kafka Streams applications. Using
Cloudera Streams Messaging Manager we can see all the components of our Kafka cluster
and where our events are as they travel through topics in various brokers. You can see
messages in all of the partitions, you can also build up alerts for any part of your Kafka system.
An important piece is you can trace messages from all of the consumers to all of the producers
and see any lag or latency that occurs in clients.






We can also push to our Operational Database (HBase) and easily scan through the rapidly inserted rows.




This demo was presented in the webinar,
Harnessing the Data Lifecycle for Customer Experience Optimization


Source Code Resources


Queries, Python, Models, Notebooks


Example Cloudera Machine Learning Connector

SQL


Predicting Sensor Readings with Time Series Machine Learning


Predicting Sensor Readings with Time Series Machine Learning





Sensors:

Sensor Unit (https://shop.pimoroni.com/products/enviro?variant=31155658457171)
  • BME280 temperature, pressure, humidity sensor
  • LTR-559 light and proximity sensor
  • MICS6814 analog gas sensor
  • ADS1015 ADC with spare channel for adding another analog sensor
  • MEMS microphone
  • 0.96-inch, 160 x 80 color LCD
Unit
  • Raspberry Pi 4
  • Intel Movidius 2
  • JDK 8
  • MiNIFi Java Agent 0.6.0
  • Python 3




Example Data

{"uuid": "rpi4_uuid_omi_20200417211935", "amplitude100": 0.3, "amplitude500": 0.1, "amplitude1000": 0.1, "lownoise": 0.1, "midnoise": 0.1, "highnoise": 0.1, "amps": 0.3, "ipaddress": "192.168.1.243", "host": "rp4", "host_name": "rp4", "macaddress": "dc:a6:32:03:a6:e9", "systemtime": "04/17/2020 17:19:36", "endtime": "1587158376.22", "runtime": "36.47", "starttime": "04/17/2020 17:18:58", "cpu": 0.0, "cpu_temp": "59.0", "diskusage": "46651.6 MB", "memory": 6.3, "id": "20200417211935_7b7ae5da-905b-418b-94f1-270a15dbc1df", "temperature": "38.7", "adjtemp": "29.7", "adjtempf": "65.5", "temperaturef": "81.7", "pressure": 1015.6, "humidity": 6.8, "lux": 1.2, "proximity": 0, "oxidising": 8.3, "reducing": 306.4, "nh3": 129.5, "gasKO": "Oxidising: 8300.63 Ohms\nReducing: 306352.94 Ohms\nNH3: 129542.17 Ohms"}

SQL Reporting Task for Cloudera Flow Management / HDF / Apache NiFi

SQL Reporting Task for Cloudera Flow Management / HDF / Apache NiFi

Would you like to have reporting tasks gathering metrics and sending them to your database or Kafka from NiFi based on a query of NiFi provenance, bulletins, metrics, processor status or other KPI?

Now you can.   If you are using HDF 3.5.0, this Reporting task NAR is pre installed and ready to go.

Let's add some Reporting tasks that use SQL!!!  QueryNiFiReportingTask.



The first one that was interesting for me was to write queries against provenance for one processor that consumes from a certain topic, I decided to query it every 10 seconds.   My query and some results are below.

So let's go to Controller Settings / Reporting Tasks and then add QueryNiFiReportingTask:


We add one per item we want to monitor.   Then for the reporting task we will need a place to send the records (a sink), we can send it to a JDBC Database (DatabaseRecordSink, KafkaRecordSink, PrometheusRecordSink, ScriptedRecordSink or SiteToSiteReportingRecordSink).   I am going to do Kafka, but Prometheus, Database and S2S are good options.   If you send SiteToSite you can send to another NiFi cluster for processing for that NiFinception processing route,


We have to write an Apache Calcite compliant SQL query, set our sink and decide if we want to include zero record results (false is a good idea).



One option is to query the BULLETINS table, which are NiFi Cluster bulletins (warnings/errors).


Another option is the CONNECTION_STATUS table.




How about NiFi JVM Metrics?  That has some good stuff in there.



Let's configure some Kafka Record Sinks.  I am using Kafka 2.3, so I'll use the Kafka 2 sink.   I set some brokers (default port of 9092), create a new topic for it, chose the record writer.  I chose JSON, but it could be CSV, AVRO, Parquet, XML or something else.   Stick with JSON or AVRO for Kafka.


So data starts getting sent, by default the schedule is every 5 minutes.   This can be adjusted, for provenance I set mine to every 10 seconds.

 

Let's look at data as it passes through Kafka, in the follow up to this article we'll land in Impala/Kudu/Hive/Druid/Phoenix/HBase or HDFS/S3 and query it.   For now, we can examine the data within Kafka via Cloudera Streams Messaging Manager (SMM).







We can examine any of the topics in Kafka with SMM.   We can then start consuming these records and build live metric handling applications or send to live data marts and dashboards powered by Cloudera Data Platform (CDP).


We have a full set of SQL to use for these reporting tasks including selecting columns, doing aggregates like MAX and AVG, ordering, grouping, where clauses and row limits.



Provenance Query

SELECT eventId, durationMillis, lineageStart, timestampMillis,
updatedAttributes, entitySize, details
FROM PROVENANCE
WHERE componentName = 'Consume Kafka iot messages'
LIMIT 25

Provenance Results


[{"eventId":2724,"durationMillis":69,"lineageStart":1586294707989,"timestampMillis":1586294708002,"updatedAttributes":"{path=./, schema.protocol.version=1, filename=be499074-c595-46f5-a03a-482607fb9c8c, schema.identifier=1, kafka.partition=7, kafka.offset=36, schema.name=SensorReading, kafka.timestamp=1586294707933, kafka.topic=iot, schema.version=1, mime.type=application/json, uuid=be499074-c595-46f5-a03a-482607fb9c8c}","entitySize":246,"details":null},{"eventId":2736,"durationMillis":20,"lineageStart":1586294708905,"timestampMillis":1586294708916,"updatedAttributes":"{path=./, schema.protocol.version=1, filename=74c9e28d-82a4-4cea-8331-92b7a4bee1b3, schema.identifier=1, kafka.partition=3, kafka.offset=36, schema.name=SensorReading, kafka.timestamp=1586294708898, kafka.topic=iot, schema.version=1, mime.type=application/json, uuid=74c9e28d-82a4-4cea-8331-92b7a4bee1b3}","entitySize":247,"details":null},..]

I have classes on using this technology and some webinars coming to you in the East Coast of the US and virtually at my meetup.

https://www.meetup.com/futureofdata-princeton/

Cloudera HDF 3.5.0 Is Now Available For Download

https://docs.cloudera.com/HDPDocuments/HDF3/HDF-3.5.0/release-notes/content/whats-new.html

https://www.cloudera.com/downloads/hdf.html

HDF 3.5.0 includes the following components:
  • Apache Ambari 2.7.5
  • Apache Kafka 2.3.1
  • Apache NiFi 1.11.1
  • NiFi Registry 0.5.0
  • Apache Ranger 1.2.0
  • Apache Storm 1.2.1
  • Apache ZooKeeper 3.4.6
  • Apache MiNiFi Java Agent 0.6.0
  • Apache MiNiFi C++ 0.6.0
  • Hortonworks Schema Registry 0.8.1
  • Hortonworks Streaming Analytics Manager 0.6.0
  • Apache Knox 1.0.0
  • SmartSense 1.5.0

SQL reporting task. The QueryNiFiReportingTask allows users to execute SQL queries against tables containing information on Connection Status, Processor Status, Bulletins, Process Group Status, JVM Metrics, Provenance and Connection Status Predictions. In combination with Site to Site, it is particularly useful to define fine-grained monitoring capabilities on top of the running workflows.

Using NiFi CLI to Restore NiFi Flows From Backups

Using NiFi CLI to Restore NiFi Flows From Backups

Please note, Apache NiFi 1.11.4 is now available for download

https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-Version1.11.4

References:



#> registry list-buckets -u http://somesite.compute-1.amazonaws.com:18080

#   Name   Id                                     Description 
-   ----   ------------------------------------   ----------- 
1   IOT    45834964-d022-4f4c-891f-695898e1e5f0   (empty)     
2   IoT    250a5ae5-ced8-4f4e-8b3b-01eb9d47a0d9   (empty)     
3   dev    46b7bab7-400f-44ae-a0e6-7340ff19c96f   (empty)     
4   iot    c594d6bc-7413-4f6a-ba9a-50b8020eec37   (empty)     
5   prod   0bf59d2e-1dd5-4d24-8aa0-0614bf991dc9   (empty)     


#> registry create-flow -verbose -u http://somesite.compute-1.amazonaws.com:18080 -b 250a5ae5-ced8-4f4e-8b3b-01eb9d47a0d9 --flowName iotFlow


a5a4ac59-9aeb-416e-937f-e601ca8beba9


#> registry import-flow-version -verbose -u http://somesite.compute-1.amazonaws.com:18080 -f a5a4ac59-9aeb-416e-937f-e601ca8beba9 -i iot-1.json


#> registry list-flows  -u http://ec2-35-171-154-174.compute-1.amazonaws.com:18080 -b 250a5ae5-ced8-4f4e-8b3b-01eb9d47a0d9

#   Name      Id                                     Description 
-   -------   ------------------------------------   ----------- 
1   iotFlow   a5a4ac59-9aeb-416e-937f-e601ca8beba9   (empty)     




Fixing Linux Webcams


  v4l2-ctl --list-devices
  v4l2-ctl -d /dev/video0 --list-ctrls
  v4l2-ctl --get-ctrl=white_balance_temperature
  v4l2-ctl --set-ctrl=white_balance_temperature=4000
  v4l2-ctl --set-ctrl=white_balance_temperature=4000 -d /dev/video0
  v4l2-ctl --set-ctrl=white_balance_temperature_auto=1
  v4l2-ctl --set-ctrl=white_balance_temperature_auto=0
  v4l2-ctl --set-ctrl=white_balance_temperature_auto=4000
  v4l2-ctl --set-ctrl=exposure_auto=3
  v4l2-ctl --set-ctrl=exposure_auto_priority=0
  v4l2-ctl --set-ctrl=exposure_absolute=250
  v4l2-ctl --set-ctrl=exposure_absolute=0
  v4l2-ctl --set-ctrl=exposure_absolute=250
  v4l2-ctl --set-ctrl=gain=0
  v4l2-ctl -d /dev/video0 --list-ctrls
  v4l2-ctl --set-ctrl=white_balance_temperature_auto=4000
  v4l2-ctl --set-ctrl=white_balance_temperature_auto=0
  v4l2-ctl --set-ctrl=white_balance_temperature=4000
 v4l2-ctl -d /dev/video0 --list-ctrls



This article is great:   https://www.kurokesu.com/main/2016/01/16/manual-usb-camera-settings-in-linux/


v4l2-ctl -d /dev/video0 --list-ctrls
                     brightness 0x00980900 (int)    : min=0 max=255 step=1 default=128 value=128
                       contrast 0x00980901 (int)    : min=0 max=255 step=1 default=128 value=128
                     saturation 0x00980902 (int)    : min=0 max=255 step=1 default=128 value=128
 white_balance_temperature_auto 0x0098090c (bool)   : default=1 value=0
                           gain 0x00980913 (int)    : min=0 max=255 step=1 default=0 value=0
           power_line_frequency 0x00980918 (menu)   : min=0 max=2 default=2 value=2
      white_balance_temperature 0x0098091a (int)    : min=2000 max=6500 step=1 default=4000 value=4000
                      sharpness 0x0098091b (int)    : min=0 max=255 step=1 default=128 value=128
         backlight_compensation 0x0098091c (int)    : min=0 max=1 step=1 default=0 value=0
                  exposure_auto 0x009a0901 (menu)   : min=0 max=3 default=3 value=3
              exposure_absolute 0x009a0902 (int)    : min=3 max=2047 step=1 default=250 value=83 flags=inactive
         exposure_auto_priority 0x009a0903 (bool)   : default=0 value=0
                   pan_absolute 0x009a0908 (int)    : min=-36000 max=36000 step=3600 default=0 value=0
                  tilt_absolute 0x009a0909 (int)    : min=-36000 max=36000 step=3600 default=0 value=0
                 focus_absolute 0x009a090a (int)    : min=0 max=250 step=5 default=0 value=0 flags=inactive
                     focus_auto 0x009a090c (bool)   : default=1 value=1
                  zoom_absolute 0x009a090d (int)    : min=100 max=500 step=1 default=100 value=100




v4l2-ctl --list-devices
HD Pro Webcam C920 (usb-70090000.xusb-2.2):
/dev/video0


ODPI's OpenDS4All - Open Source Data Science Content To Teach the World

OpenDS4All




Start learning now:


ODPI has officially announced this recently and it looks great.

There is a ton of amazing materials including slides, notes, documentation, homework, exercises and Jupyter notebooks covering Data Wrangling, Data Science, the Basics and Apache Spark.   

taxonomy

This“starter set” of training materials can help you build a Data Science program for yourself, your company, your university or your non-profit.    I am going to bring some of these to my meetups and hopefully can help give back with new materials, updates and suggestions.

These are college level materials developed by the University of Pennsylvania and open source via the ODPI with IBM leading.   The code and slides look great.   I can see these helping to enable the world adding another million desperately needed Data Scientists, Data Engineers and Data Science Enabled professionals.

I have been running some of this via Cloudera Machine Learning in my CDP cluster in AWS and it works great.   This is really well made.   I am hoping to create a module on Streaming Data Science to contribute.