Top 25 Use Cases of Cloudera Flow Management Powered by Apache NiFi

Top 25 Use Cases of Cloudera Flow Management Powered by Apache NiFi

 Cloudera Flow Management has proven immensely popular in solving so many different use cases I thought I would make a list of the top twenty-five that I have seen recently.   

If you have never used CFM or Apache NiFi before, please checkout these two quick resources:   https://github.com/tspannhw/EverythingApacheNiFi and https://nifi.apache.org/docs/nifi-docs/.

21-25

25.   Ingesting Data into Kafka in the Public Cloud

https://docs.cloudera.com/cdf-datahub/7.2.2/nifi-kafka-ingest/topics/cdf-datahub-fm-kafka-ingest-overview.html

24.  Cybersecurity Data Collection and Filtering

https://www.datainmotion.dev/2020/10/monitoring-mac-laptops-with-apache-nifi.html

23.  Ingesting Data into Hive in the Public Cloud

https://docs.cloudera.com/cdf-datahub/7.2.2/nifi-hive-ingest/topics/cdf-datahub-nifi-hive-ingest.html

22. Ingesting Data into HBase in the Public Cloud

https://docs.cloudera.com/cdf-datahub/7.2.2/nifi-hbase-ingest/topics/cdf-datahub-nifi-hbase-ingest.html

21. Ingesting Data into Kudu in the Public Cloud

https://docs.cloudera.com/cdf-datahub/7.2.2/nifi-kudu-ingest/topics/cdf-datahub-nifi-kudu-ingest.html


16-20

20.  Ingesting Data into ADLS Storage

https://docs.cloudera.com/cdf-datahub/7.2.2/nifi-azure-ingest/topics/cdf-datahub-fm-adls-ingest-overview.html

19.   Populate SOLR Indexes

https://www.datainmotion.dev/2020/04/building-search-indexes-with-apache.html

18.  Hadoop Data to Kafka

https://www.datainmotion.dev/2020/04/read-apache-impala-apache-kudu-tables.html

17.   Deep Learning And Machine Learning Pipelines

https://www.datainmotion.dev/2019/12/easy-deep-learning-in-apache-nifi-with.html

16.  Intercepting JMS and SOA

https://www.datainmotion.dev/2019/10/migrating-apache-flume-flows-to-apache_42.html


11-15

15.    Edge ML Model Integration

14.   Migrate Data from On-Premise Private Cloud to Public Cloud

13.   Converting XML to JSON

12.   MQTT to HDFS

11.   Ingesting REST Endpoints (Bulk)

6-10

10.  Ingesting Data into AWS S3 Buckets

9.  Ingest REST Endpoints

8.  Ingesting SAAS Products Like Salesforce

7.   Automating Manual Tasks

6.  Ingesting Social Media Data

Top 5

5.  Logs, Logs, Logs

https://www.datainmotion.dev/2019/10/migrating-apache-flume-flows-to-apache_35.html

https://www.datainmotion.dev/2019/08/migrating-apache-flume-flows-to-apache.html

4.  FLaNK Streaming Data Pipeline (Any Data to Kafka to Flink SQL)

https://www.flankstack.dev/

3.   IoT - MiNiFi Agents Ingest, Store and Forward

https://www.datainmotion.dev/2020/02/edgeai-google-coral-with-coral.html

https://community.cloudera.com/t5/Community-Articles/IoT-Series-Sensors-Utilizing-Breakout-Garden-Hat-Part-2/ta-p/249380

2. Pseudo-CDC / Database Ingest

https://www.datainmotion.dev/2019/10/migrating-apache-flume-flows-to-apache_15.html

1.  Doing a 1,000 different ingest, conversion, routing and transformation flows

The most common use case is doing a lot of things with a lot of data, including things like documents, XML, JSON, AVRO, Parquet, CSV, PDF, Images, Video, Mongo documents, Logs and more.    Rarely do I ever see someone solve just one problem with NiFi and say, that was enough.   One simple use cases leads to another and another and before you know it every cron job, script, ETL, ELT and big data op is now touched by NiFi.    Keep it up, Cloudera will make it ever easier soon.   Also check out NiFi Stateless for some of those more job/event oriented things like File to Kafka, Kafka to Kafka and more.

https://community.cloudera.com/t5/Community-Articles/Scanning-Documents-into-Data-Lakes-via-Tesseract-MQTT-Python/ta-p/248492


Running Flink SQL Against Kafka Using a Schema Registry Catalog

[FLaNK]:  Running Apache Flink SQL Against Kafka Using a Schema Registry Catalog



There are a few things you can do when you are sending data from Apache NiFi to Apache Kafka to maximize it's availability to Flink SQL queries through the catalogs.


AvroWriter



JSONReader




Producing Kafka Messages


Make sure you set AvroRecordSetWriter and set a Message Key Field.






A great way to work with Flink SQL is to connect to the Cloudera Schema Registry.   It let's you define your schema once them use it in Apache NiFi, Apache Kafka Connect, Apache Spark, Java Microservices 

Setup



Make sure you setup your HDFS directory for use by Flink which keeps history and other important information in HDFS.

HADOOP_USER_NAME=hdfs hdfs dfs -mkdir /user/root

HADOOP_USER_NAME=hdfs hdfs dfs -chown root:root /user/root


SQL-ENV.YAML:

configuration:
execution.target: yarn-session
catalogs:
- name: registry
type: cloudera-registry
# Registry Client standard properties
registry.properties.schema.registry.url: http://edge2ai-1.dim.local:7788/api/v1
# registry.properties.key:
# Registry Client SSL properties
# Kafka Connector properties
connector.properties.bootstrap.servers: edge2ai-1.dim.local:9092
connector.startup-mode: earliest-offset
- name: kudu
type: kudu
kudu.masters: edge2ai-1.dim.local:7051

CLI:

flink-sql-client embedded -e sql-env.yaml


We now have access to Kudu and Schema Registry catalogs of tables.   This let's use start querying, joining and filtering any of these multiple tables without having to recreate or redefine them.


SELECT * FROM events

Code:






Automating the Building, Migration, Backup, Restore and Testing of Streaming Applications

 Automating the Building, Migration, Backup, Restore and Testing of Streaming Applications


One of the main things you will want to add to your flows as you restore them from backup or migrate them between clusters is apply appropriate parameters.

So you can import the parameter contexts and then connect them to the correct process group(s).

nifi-toolkit-1.12.0/bin/cli.sh nifi import-param-context -u http://edge2ai-1.dim.local:8080 -i parameter.json

Note, values can be encrypted so the NiFi Operator or Developer doesn't have to see keys or protected values.


See an example script:

https://github.com/tspannhw/ApacheConAtHome2020/blob/main/scripts/setupnifi.sh

Resources

Monitoring Mac Laptops With Apache NiFi and osquery

 Monitoring Mac Laptops With Apache NiFi and osquery


The other way is pass a SQL query to osquery interpreter (ala osqueryi --json "SELECT * FROM $1") and get the query results back as JSON.

We can tail the main file (/var/log/osquery/osqueryd.results.log) and send the JSON to be used at scale as events.  We can also grab any and all osquery logs like INFO, WARN and ERROR via osquery.+.



Either download or brew cask install.    https://osquery.readthedocs.io/en/2.11.2/installation/install-osx/

I setup a simple configuration here: (https://github.com/tspannhw/nifi-osquery)

{

  "options": {

    "config_plugin": "filesystem",

    "logger_plugin": "filesystem",

    "logger_path": "/var/log/osquery",

    "disable_logging": "false",

    "disable_events": "false",

    "database_path": "/var/osquery/osquery.db",

    "utc": "true"

  },


  "schedule": {

    "system_info": {

      "query": "SELECT hostname, cpu_brand, physical_memory FROM system_info;",

      "interval": 3600

    }

  },


  "decorators": {

    "load": [

      "SELECT uuid AS host_uuid FROM system_info;",

      "SELECT user AS username FROM logged_in_users ORDER BY time DESC LIMIT 1;"

    ]

  },


  "packs": {

       "osquery-monitoring": "/var/osquery/packs/osquery-monitoring.conf",

     "incident-response": "/var/osquery/packs/incident-response.conf",

     "it-compliance": "/var/osquery/packs/it-compliance.conf",

       "osx-attacks": "/var/osquery/packs/osx-attacks.conf",

       "vuln-management": "/var/osquery/packs/vuln-management.conf",

       "hardware-monitoring": "/var/osquery/packs/hardware-monitoring.conf",

     "ossec-rootkit": "/var/osquery/packs/ossec-rootkit.conf"

   }

}



We then turn JSON osquery records into records that can be used for routing, queries, aggregates and ultimately pushing it to Impala/Kudu for rich Cloudera Visual Apps and to Kafka as Schema Aware AVRO to use in Kafka Connect as well as a live continuous query feed to Flink SQL streaming analytic applications.

We could also have osquery push directly to Kafka, but since I am often disconnected from a Kafka server, in offline mode or just want a local buffer for these events lets use Apache NiFi which can run as a single 2GB node on my machine.   I can also do local processing of the data and some local alerting if needed.

Once you have the data from one or million machines you can do log aggregation, anomaly detection, predictive maintenance or whatever else you might need to do.   Sending this data to Cloudera Data Platform in AWS or Azure and having CML and Visual Apps to store, analyze, report, query, build apps, build pipelines and ultimately build production machine learning flows on really makes this a simple example of how to take any data and bring it into a full data platform.

References:

Tracking Satellites with Apache NiFi

 Tracking Satellites with Apache NiFi

Thanks to https://www.n2yo.com/ for awesome data feeds.


Again, these types of ingests are so easy in Apache NiFi.   


Step 1, schedule when we want these.   There is a limit of 1,000 calls per hour, so let's keep it to 4 calls a minute for each of the three REST end points.



Let's get satellite information on right above me.

We set parameters for:   your latitude, your longitude, your apikey and then just change up bits of the REST URL.   Note for this one we are using SSL, so make sure you have an SSL context.





Now we have three streams of JSON data that has lat and long, so we can plot this on a map with Cloudera Visual Apps, storing our data in Impala tables in Kudu.


Some example data:


{

  "info" : {

    "satname" : "SPACE STATION",

    "satid" : 25544,

    "transactionscount" : "5"

  },

  "positions" : [ {

    "satlatitude" : 37.46839338,

    "satlongitude" : 95.12767402,

    "sataltitude" : 422.01,

    "azimuth" : 8.37,

    "elevation" : -49.35,

    "ra" : 290.4714551,

    "dec" : 0.06300703,

    "timestamp" : 1602242926,

    "eclipsed" : false

  }, {

    "satlatitude" : 37.4278686,

    "satlongitude" : 95.18684731,

    "sataltitude" : 422.01,

    "azimuth" : 8.32,

    "elevation" : -49.37,

    "ra" : 290.50535165,

    "dec" : 0.04159856,

    "timestamp" : 1602242927,

    "eclipsed" : false

  } ]

}

Unveiling the NVIDIA Jetson Nano 2GB and Other NVIDIA GTC 2020 Announcements

 Unveiling the NVIDIA Jetson Nano 2GB and Other NVIDIA GTC 2020 Announcements 







NVIDIA Jetson Nano 2GB Press Release

https://nvidianews.nvidia.com/news/nvidia-unveils-jetson-nano-2gb-the-ultimate-ai-and-robotics-starter-kit-for-students-educators-robotics-hobbyists



I have given this one a test run, it has all the features you like for Jetson, with just 2 GB less RAM and 2 less USB ports.   This is a very affordable device to do cool apps.


  • 128-core NVIDIA MaxwellTM 
  • 64-bit Quad-core ARM A57 (1.43 GHz)

  • 2 GB 64-bit LPDDR4 (25.6 GB/s bandwidth)

  • Gigabit Ethernet

  • 1x USB 3.0 Type A ports, 2x USB 2.0 Type A ports, 1x USB 2.0

    Micro-B

  • HDMI

  • WiFi

  • GPIOs, I2C, I2S, SPI, PWM, UART

  • 1x MIPI CSI-2 connector

  • MicroSD Connector

  • 12-pin header (Power and related signals, UART)

  • 100mm x 80mm x 29mm

  • USB-C Port for Power

Depending where you or or how you buy the package you may need to buy a power supply and USB WiFi.

All of my existing workloads have been working fine in the 2GB version, but with a very nice cost saving.  The setup is easy, the system is fast, I have to highly recommend anyone looking for a quick way to do Edge AI and other edge workloads a try.   This could be a decent machine for learning.   I hooked mine up to a monitor, keyboard and mouse and I could use it right away for edge development and also as a basic desktop.   Nice work!  I might need to get 11 more of these.   These will run MiNiFi agents, Python and Deep Learning classifications at ease.

NVIDIA didn't stop with the ultimate low-cost edge device, they have some serious enterprise updates as well:

Cloudera superchargers the Enterprise Data Cloud with NVIDIA

https://blog.cloudera.com/cloudera-supercharges-the-enterprise-data-cloud-with-nvidia/

There seems to be a ton more news coming at this virtual event, so I recommend attending and watching for more detailed posts on new things coming out.

Product page: 

https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-nano/education-projects/


Unboxing video:

https://youtu.be/dVGEtWYkP2c


NVIDIA Jetson Developer Community AI Projects: 

https://youtu.be/2T8CG7lDkcU


Open-source projects on Jetson Nano 2GB: 

https://youtu.be/fIESu365Sb0


Dev Blog:

https://developer.nvidia.com/blog/ultimate-starter-ai-computer-jetson-nano-2gb-developer-kit/



DevOps: Working with Parameter Contexts in Apache NiFi 1.11.4+

 DevOps:  Working with Parameter Contexts in Apache NiFi 1.11.4+

nifi list-param-contexts -u http://localhost:8080 -ot simple


#   Id                                     Name             Description   

-   ------------------------------------   --------------   -----------   

1   3a801ff4-1f73-1836-b59c-b9fbc79ab030   backupregistry                 

2   7184b9f4-0171-1000-4627-967e118f3037   health                         

3   3a801faf-1f87-1836-54ba-3d913fa223ad   retail                         

4   3a801fde-1f73-1836-957b-a9f4d2c9b73d   sensors                        


#> nifi export-param-context -u http://localhost:8080 -verbose --paramContextId 3a801faf-1f87-1836-54ba-3d913fa223ad


{

  "name" : "retail",

  "description" : "",

  "parameters" : [ {

    "parameter" : {

      "name" : "allquery",

      "description" : "",

      "sensitive" : false,

      "value" : "SELECT * FROM FLOWFILE"

    }

  }, {

    "parameter" : {

      "name" : "allrecordssql",

      "description" : "",

      "sensitive" : false,

      "value" : "SELECT * FROM FLOWFILE"

    }

  }, {

    "parameter" : {

      "name" : "energytopic",

      "description" : "",

      "sensitive" : false,

      "value" : "energy"

    }

  }, {

    "parameter" : {

      "name" : "importantsql",

      "description" : "",

      "sensitive" : false,

      "value" : "SELECT * FROM FLOWFILE\nWHERE kernel_logs like '%SIGKILL%'"

    }

  }, {

    "parameter" : {

      "name" : "itempricetable",

      "description" : "",

      "sensitive" : false,

      "value" : "impala::default.itemprice"

    }

  }, {

    "parameter" : {

      "name" : "itsgettingHotInHere",

      "description" : "",

      "sensitive" : false,

      "value" : "SELECT * FROM\nFLOWFILE\nWHERE CAST (temp_f as DOUBLE) > 80\nAND UPPER(location) LIKE '%NJ%'"

    }

  },





You can now move that to another server and import. nifi import-param-context.


 bin/cli.sh nifi list-param-contexts -u http://localhost:8080 -ot json

Use simple for a simple table list.

 bin/cli.sh nifi export-param-context -u http://localhost:8080  --paramContextId 3a801ff4-1f73-1836-b59c-b9fbc79ab030 -ot json -o backupregistry.json

Example Shell Script

/Users/tspann/Downloads/nifi-toolkit-1.12.0/bin/cli.sh nifi export-param-context -u http://localhost:8080  --paramContextId a13e3764-134c-16f0-7c35-312b7ee4b182 -ot json -o financial.json
/Users/tspann/Downloads/nifi-toolkit-1.12.0/bin/cli.sh nifi export-param-context -u http://localhost:8080  --paramContextId 7184b9f4-0171-1000-4627-967e118f3037 -ot json -o health.json
/Users/tspann/Downloads/nifi-toolkit-1.12.0/bin/cli.sh nifi export-param-context -u http://localhost:8080  --paramContextId 3a801faf-1f87-1836-54ba-3d913fa223ad -ot json -o retail.json
/Users/tspann/Downloads/nifi-toolkit-1.12.0/bin/cli.sh nifi export-param-context -u http://localhost:8080  --paramContextId 3a801fde-1f73-1836-957b-a9f4d2c9b73d -ot json -o  sensors.json
/Users/tspann/Downloads/nifi-toolkit-1.12.0/bin/cli.sh nifi export-param-context -u http://localhost:8080  --paramContextId 3a801ff4-1f73-1836-b59c-b9fbc79ab030 -ot json -o backupregistry.json

Reference

http://nifi.apache.org/docs/nifi-docs/html/toolkit-guide.html#nifi_CLI


https://docs.cloudera.com/HDPDocuments/HDF3/HDF-3.5.1/versioning-a-dataflow/content/parameters-in-versioned-flows.html