Data In Motion: January 2021

Automating Starting Services in Apache NiFi and Applying Parameters

Automate all the things! You can call these commands interactively or script all of them with awesome devops tools. Andre and Dan can tell you more about that.

Enable All NiFi Services on the Canvas

By running this three times, I get any stubborn ones or ones that needed something previously running. This could be put into a loop and check the status before trying again.

nifi pg-list
nifi pg-status
nifi pg-get-services

The NiFi CLI has interactive help available and also some good documentation:

https://nifi.apache.org/docs/nifi-docs/html/toolkit-guide.html#nifi_CLI

/opt/demo/nifi-toolkit-1.12.1/bin/cli.sh nifi pg-enable-services -u http://edge2ai-1.dim.local:8080 --processGroupId root

We could then start a process group if we wanted:

nifi pg-start -u http://edge2ai-1.dim.local:8080 -pgid 2c1860b3-7f21-36f4-a0b8-b415c652fc62

List all process groups

/opt/demo/nifi-toolkit-1.12.1/bin/cli.sh nifi pg-list -u http://edge2ai-1.dim.local:8080

List Parameters

/opt/demo/nifi-toolkit-1.12.1/bin/cli.sh nifi list-param-contexts -u http://edge2ai-1.dim.local:8080 -verbose

Set parameters to set parameter context for a process group, you can loop to do all.

pgid => parameter group id
pcid => parameter context id

I need to put this in a shell or python script:

/opt/demo/nifi-toolkit-1.12.1/bin/cli.sh nifi pg-set-param-context -u http://edge2ai-1.dim.local:8080 -verbose -pgid 2c1860b3-7f21-36f4-a0b8-b415c652fc62 -pcid 39f0f296-0177-1000-ffff-ffffdccb6d90

Example

https://github.com/tspannhw/ApacheConAtHome2020/blob/main/scripts/setupnifi.sh

You could also use the NiFi REST API or Dan's awesome Python API (https://nipyapi.readthedocs.io/en/latest/).

References

Migrating from Apache Storm to Apache Flink

The first thing you need to do is to not just pick up and dump to a new system, but to see what can be reconfigured, refactored or reimagined. For some routing, transformation or simple ingest type applications or solution parts you may want to use Apache NiFi.

For others Spark or Spark Streaming can quickly meet your needs. For simple Thing to Kafka or Kafka to Thing flows, a flow with Kafka Connect is appropriate. For things that need to run in individual devices, containers, pods you may want to move a small application to NiFi Stateless. There are also sometimes a simple Kafka Stream application will meet your needs.

For many use cases you can replace a compiled application with some solid Flink SQL code. For some discussions, check this out.

For some really good information on how to migrate Storm solutions to Flink, Cloudera has a well documented solution for you:

https://docs.cloudera.com/csa/1.2.0/stormflink-migration/topics/csa-stormflink-migration-process.html

Conceptual

https://docs.cloudera.com/csa/1.2.0/stormflink-migration/topics/csa-stormflink-concept.html

Architecture

https://docs.cloudera.com/csa/1.2.0/stormflink-migration/topics/csa-stormflink-architecture.html

Redistribution

https://docs.cloudera.com/csa/1.2.0/stormflink-migration/topics/csa-stormflink-redistribution.html

References

Upcoming Demo Jam and The Latest Articles - 13 January 2020

Upcoming Demo Jam (Ask Live Questions of Pierre) - 21 Jan

https://www.cloudera.com/about/events/webinars/demo-jam-live-returns-build-data-flow-with-apache-nifi.html?utm_medium=tspann

Your Top 5 Apache NiFi Questions Answered By Experts

https://blog.cloudera.com/top-5-questions-about-apache-nifi/

Apache NiFi - The Data Movement Enabler in Hybrid Cloud

https://blog.cloudera.com/apache-nifi-the-data-movement-enabler-in-a-hybrid-cloud-environment/

Real-Time Transit Information Ingest

https://www.datainmotion.dev/2021/01/flank-real-time-transit-information-for.html

Using Kudu as a Cache for REST Streams

https://www.datainmotion.dev/2021/01/flank-using-apache-kudu-as-cache-for.html

Simple Change Data Capture with SQL

https://www.datainmotion.dev/2020/12/simple-change-data-capture-cdc-with-sql.html

Ingesting Websocket Data Live

https://www.datainmotion.dev/2020/12/ingesting-websocket-data-for-live-stock.html

Smart Stock Processing with Apache NiFi

https://www.datainmotion.dev/2020/12/smart-stocks-with-flank-nifi-kafka.html?es_id=1fb9486166

More Apache NiFi Resources

https://github.com/tspannhw/EverythingApacheNiFi

FLaNK: Real-Time Transit Information For NY/NJ/CT (TRANSCOM)

SOURCE: XML/RSS REST ENDPOINT

FREQUENCY: Every Minute

DESTINATIONS: HDFS, Kudu/Impala, Cloud, Kafka

The main source of this real-time transit updates for New Jersey, New York and Connecticut is TRANSCOM. I will read from this datasource every minute to know about real-time traffic events that occurring on the roads and transportation systems near me. We will be reading the feed that is in XML/RSS format and parse out the hundreds of events that come with each minutes update.

I want to store the raw XML/RSS file in S3/ADLS2/HDFS or GCS, that's an easy step. I will also parse and enhance this data for easier querying and tracking.

I will add to all events a unique ID and a timestamp as the data is streaming by. I will store my data in Impala/Kudu for fast queries and upserts. I can then build some graphs, charts and tables with Apache Hue and Cloudera Visual Applications. I will also publish my data as AVRO enhanced with a schema to Kafka so that I can use it from Spark, Kafka Connect, Kafka Streams and Flink SQL applications.

GenerateFlowFile - optional scheduler
InvokeHTTP - call RSS endpoint
PutHDFS - store raw data to Object or File store on premise or in the cloud via HDFS / S3 / ADLSv2 / GCP / Ozone / ...
QueryRecord - convert XML to JSON
SplitJSON - break out individual events

UpdateAttribute - set schema name
UpdateRecord - generate an add a unique ID and timestamp
UpdateRecord - clean up the point field
UpdateRecord - remove garbage whitespace

PutKudu - upsert new data to our Impala / Kudu table.
RetryFlowFile - retry if network or other connectivity issue.

Send Messages to Kafka

Our flow has delivered many messages to our transcomevents topic as schema attached Apache Avro formatted messages.

SMM links into the Schema Registry and schema for this topic.

We use a schema for validation and as a contract between consumers and producers of these traffic events.

Since events are streaming into our Kafka topic and have a schema, we can query them with Continuous SQL with Flink SQL. We can then run some Continuous ETL.

We could also consume this data with Structured Spark Streaming applications, Spring Boot apps, Kafka Streams, Stateless NiFi and Kafka Connect applications.

We also stored our data in Impala / Kudu for permanent storage, ad-hoc queries, analytics, Cloudera Visualizations, reports, applications and more.

It is very easy to have fast data against our agile Cloud Data Lakehouse.

Source Code

https://github.com/tspannhw/SmartTransit

Resources

FLaNK: Using Apache Kudu As a Cache For FDA Updates

InvokeHTTP: We invoke the RSS feed to get our data.
QueryRecord: Convert RSS to JSON
SplitJson: Split one file into individual records. (Should refactor to ForkRecord)
EvaluateJSONPath: Extract attributes you need.
ProcessGroup for SPL Processing

We call and check the RSS feed frequently and we parse out the records from the RSS(XML) feed and check against our cache. We use Cloudera's Real-Time Datahub with Apache Impala/Apache Kudu as our cache to see if we already received a record for that. If it hasn't arrived yet, it will be added to the Kudu cache and processed.

SPL Processing

We use the extracted SPL to grab a detailed record for that SPL using InvokeHTTP.