Technical Preview - Cloudera DataFlow

Import Flows Build in CDP Datahub Flow Management

https://docs.cloudera.com/dataflow/cloud/quick-start/topics/cdf-qs-definition.html

Deploy Flows

https://docs.cloudera.com/dataflow/cloud/quick-start/topics/cdf-qs-deploy.html

Import a Quick Flow

https://docs.cloudera.com/dataflow/cloud/qs-flow-definitions/topics/cdf-import-quick-flow.html

Monitoring

https://docs.cloudera.com/dataflow/cloud/kpi-overview/topics/cdf-introduction-to-kpi.html

Top DataFlow Resources

https://docs.cloudera.com/dataflow/cloud/index.html
https://docs.cloudera.com/dataflow/cloud/overview/topics/cdf-overview.html
https://docs.cloudera.com/dataflow/cloud/overview/topics/cdf-architecture.html
https://docs.cloudera.com/dataflow/cloud/enable-environment/topics/cdf-enable-your-environemnt.html
https://docs.cloudera.com/dataflow/cloud/manage-environment/topics/cdf-download-kube-config.html
https://docs.cloudera.com/dataflow/cloud/bp-flow-definition/topics/cdf-flow-isolation.html
https://docs.cloudera.com/dataflow/cloud/deploy-flows/topics/cdf-deploy-flow.html
https://blog.cloudera.com/top-three-requirements-for-data-flows/
https://www.cloudera.com/about/events/cloudera-now-cdp.html

Example Flows

https://github.com/tspannhw/DataFlowExperience

How about Some Free Cloud Training?

CDP Private Cloud Fundamentals

https://www.cloudera.com/about/training/courses/cdp-private-cloud-fundamentals.html

CDP Private Cloud Fundamentals

Cloudera, IBM, and Red Hat

Our CDP Private Cloud Fundamentals OnDemand course provides a solid introduction to CDP Private Cloud. In addition to learning what CDP Private Cloud is and how it fits into the Enterprise Data Cloud vision, you'll find out about its architecture and how it uses cloud-native design elements such as containerization in order to overcome limitations of the traditional bare metal cluster architecture. Following a summary of the system requirements, the course concludes with a demonstration of a CDP Private Cloud installation.

https://www.cloudera.com/about/training/courses/cloudera-ibm-redhat-cdp-pvc-fundamentals.html

Cloudera Essentials for CDP

https://www.cloudera.com/about/training/courses/cloudera-essentials-for-cdp.html

Introduction to Cloudera Manager

https://www.cloudera.com/about/training/courses/introduction-to-cloudera-manager.html

Introduction to Cloudera Data Warehouse: Self-Service Analytics in the Cloud with CDP

https://www.cloudera.com/about/training/courses/cdp-intro-to-cloudera-data-warehouse.html

Introduction to Cloudera Machine Learning

https://www.cloudera.com/about/training/courses/cdp-intro-to-cloudera-machine-learning.html

Introduction to Apache Impala

Enriching Data with Pyspark

Introduction to Hive

Demo Jam Build a Flow with Apache NiFi

Semantic Analysis

Apache Yunikorn

Introduction to Apache Ozone

Importing Data into the Cloud with Apache NiFi

Tutorial: Ingesting into Hive with NiFi

Processing Fixed Width and Complex Files

Pointers

The first decision you will have to make is if it's structured at all. If it is a known type like CSV, JSON, AVRO, XML or Parquet then just use a record.

If it's semi-structured like a log file, GrokReader may work or ExtractGrok.

If it's like CSV, you may be able to tweak the CSV reader to work (say header or no header) or try one of the two CSV parsers NiFi has (Jackson or Apache Commons).

If it's a format like PDF, Word, Excel, RTF or something like that, I have a custom processor that uses Apache Tika and that should be able to parse it into text. Once it is text you can probably work with it.

Examples

Documentation

https://nifi.apache.org/docs/nifi-docs/

Processors To Use For File Manipulation

AttributesToCSV
AttributesToJSON
ConvertExcelToCSVProcessor
ConvertRecord
ConvertText
CSVReader
EvaluateJSONPath
EvaluateXPath
EvaluateXQuery
ExecuteScript
ExecuteStreamCommand
ExtractGrok
ExtractText
FlattenJson
ForkRecord
GrokReader
JsonPathReader
JsonTreeReader
JoltTransformJSON
JoltTransformRecord
LookupAttribute
LookupRecord
MergeContent
MergeRecord
ModifyBytes
ParseSyslog*
PartitionRecord
QueryRecord
ReaderLookup
ReplaceText
ReplaceTextWithMapping
ScriptedReader
ScriptedRecordSink
ScriptedTransformRecord
SegmentContent
SplitContent
SplitJson
SplitRecord
SplitText
SplitXml
SyslogReader
TransformXml
UnpackContent
UpdateAttribute
UpdateRecord
ValidCsv
ValidateRecord
ValidateXml

Custom Processors

Helper Projects, SDK, Libraries and Services

https://tika.apache.org/ - Apache Tika can be integrated as a custom processor or called via REST and run as a seperate server/service.
Cloudera Machine Learning - call this service from REST and have AI do it. https://blog.cloudera.com/integrating-machine-learning-models-into-your-big-data-pipelines-in-real-time-with-no-coding/
REST Service - there may be a service you can run locally or use in the cloud that may be able to parse it. NiFi can call this
Python - execute a stream command and have Python or a shell script or OS executeable do it!
Spark - try custom Spark with Java, Python or Scala.
Flink - try custom Flink with Java.
XSLT
XPath
XQuery
JsonPath
Json
https://github.com/AbsaOSS/cobrix
https://github.com/tspannhw/EverythingApacheNiFi
You may need to use Cache: https://www.datainmotion.dev/2021/01/flank-using-apache-kudu-as-cache-for.html

Using Cloudera Flow Management (Powered by Apache NiFi) to Track Cloud Services

See:

https://community.cloudera.com/t5/Community-Articles/Using-Cloudera-Flow-Management-To-Ingest-and-Process-RSS/ta-p/313074

Price Comparisons Using Retail REST APIs with Apache NiFi, Kafka and Flink SQL

Part 1: NiFi Rest

Part 2: Kafka - Flink SQL

Part 3: Cloudera Visual Apps

Part 4: Smart Shelf Updates - MiNiFi Agents

MiNiFi Agent Update March 2021

Cloudera Agent Availability

Getting Started

https://docs.cloudera.com/cem/1.2.2/minifi-agent-quick-start/topics/cem-install-and-start-minifi-cpp.html

MiNiFi (C++)

Version cpp-0.9.0

Release Date: 1 March 2021

Highlights of 0.9.0 release include:

Added support for RocksDB-based content repository for better performance
Added SQL extension
Improved task scheduling
Various C2 improvements
Bug fixes and improvements to TailFile, ConsumeWindowsEventLog, MergeContent, CompressContent, PublishKafka, InvokeHTTP
Implemented RetryFlowFile and smart handling of loopback connections
Added a way to encrypt sensitive config properties and the flow configuration
Implemented full S3 support
Reduced memory footprint when working with many flow files

Build Notes:

It is advised that you use the bootstrap.sh when not building on windows.

https://cwiki.apache.org/confluence/display/MINIFI/Release+Notes#ReleaseNotes-Versioncpp-0.9.0

Download Now As Source or Pre-Build for Your Platform

https://nifi.apache.org/minifi/download.html

Kafka Replication with Cloudera Streams Replication Manager

Source Code: https://github.com/tspannhw/EmergingTechnologyDay2020

Source Event: https://www.cloudera.com/about/events/virtual-events/cloudera-emerging-technology-day.html

Spring Data JPA Against Big Data Sources

Download Hive JDBC Driver

https://www.cloudera.com/downloads/connectors/hive/jdbc/2-6-5.html

References

https://docs.cloudera.com/documentation/other/connectors/hive-jdbc/2-6-9.html

https://community.cloudera.com/t5/Community-Articles/Writing-Spring-Boot-Microservices-To-Access-Hive/ta-p/247250

https://shalishvj.wordpress.com/2018/06/02/hive-jdbc-spring-boot-restful-webservice-in-pivotal-cloud-foundry/

https://github.com/tspannhw/hive/blob/master/pom.xml

https://github.com/firecodeman/Cloudera-Impala-Hive-JDBC-Example/blob/master/src/main/java/com/cloudera/example/ClouderaHiveJdbcExample.java

https://docs.cloudera.com/documentation/other/connectors/hive-jdbc/Cloudera-JDBC-Driver-for-Apache-Hive-Release-Notes.txt

https://start.spring.io/

Using the Mm FLaNK Stack for Edge AI

New Features of Apache NiFi 1.13.2

Check it out: https://twitter.com/pvillard31/status/1361569608327716867?s=27

Download today: https://nifi.apache.org/download.html

Release Notes: https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-Version1.13.0

Migration: https://cwiki.apache.org/confluence/display/NIFI/Migration+Guidance

New Features

ListenFTP
UpdateHiveTable - Hive DDL changes -Hive Update Schema ie Data Drift ie Hive Schema Migration!!!!
SampleRecord - different sampling approaches to records (Interval Sampling, Probabilistic Sampling, Reservoir Sampling)
CDC Updates
Kudu updates
AMQP and MQTT Integration Upgrades
ConsumeMQTT - readers and writers added
HTTP access to NiFi by default is now configured to accept connections to 127.0.0.1/localhost only. If you want to allow broader access for some reason for HTTP and you understand the security implications you can still control that as always by changing the 'nifi.web.http.host' property in nifi.properties as always. That said, please take the time to configure proper HTTPS. We offer detailed instructions and tooling to assist.
ConsumeMQTT - add record reader/writer
The ability to run NiFi with no GUI as MiNiFi/NiFi combined code base continues.
Support for Kudu Dates (https://kudu.apache.org/releases/1.12.0/docs/release_notes.html)
Updated GRPC versions
Apache Calcite update
PutDatabaseRecord update

Here is an example NiFi ETL Flow:

Example NiFi 1.13.2 Flow:

ConsumeMQTT: now with readers
UpdateAttribute: set record.sink.name to kafka and recordreader.name to json.
SampleRecord: sample a few of the records
PutRecord: Use reader and destination service
UpdateHiveTable: new sink

Consume from MQTT and read and write to/from records.

Some example attributes from a running flow:

Connection Pools for DatabaseRecordSinks can be JDBC, Hadoop and Hive.

FreeFormTextRecordSetWriter is great for writing any format.

RecordSinkService we will pick Kafka as our destination.

KafkaRecordSink from PutRecord

Reader will pick json in our example based on our UpdateAttribute, we can dynamically change this as data streams.

ReaderLookup - lets you pick a reader based on an attribute.

We have defined readers for Parquet, JSON, AVRO, XML and CSV so no matter the type I can automagically read it. Great for reusing code and great for cases like our new ListenFTP where you may get sent tons of different files to process. Use one FLOW!

RecordSinkService can help you make all our flows generic so you can drop in different sinks/destinations for your writers based on what the data coming in is. This is revolutionary for code reuse.

We can write our output in a custom format that could look like a document, HTML, fixed width, a form letter, weird delimiter or whatever you need.

Sample records using different methods.

We use the RecordSinkServiceLookup to allow us to change our sink location dynamically, we are passing in an attribute to choose Kafka.

We have pushed our data to Kafka via the KafkaRecordSink. We can see our data easily in Streams Messaging Manager (SMM).

With a RecordReaderFactory, you can pick readers like the new WindowsEventLogReader.

As another output, we can UpdateHiveTable from our data and change the table as needed.

Straight From Release Notes: New Feature

[NIFI-7386] - AzureStorageCredentialsControllerService should also connect to storage emulator
[NIFI-7429] - Add Status History capabilities for system level metrics
[NIFI-7549] - Adding Hazelcast based implementation for DistributedMapCacheClient
[NIFI-7624] - Build a ListenFTP processor
[NIFI-7745] - Add a SampleRecord processor
[NIFI-7796] - Add Prometheus metrics for total bytes received and bytes sent for components
[NIFI-7801] - Add acknowledgement check to Splunk
[NIFI-7821] - Create a Cassandra implementation of DistributedMapCacheClient
[NIFI-7879] - Create record path function for UUID v5
[NIFI-7906] - Add graph processor with flexibility to query graph database conditioned on flowfile content and attributes
[NIFI-7989] - Add Hive "data drift" processor
[NIFI-8136] - Allow State Management to be tied to Process Session
[NIFI-8142] - Add "on conflict do nothing" feature to PutDatabaseRecord
[NIFI-8146] - Allow RecordPath to be used for specifying operation type and data fields when using PutDatabaseRecord
[NIFI-8175] - Add a WindowsEventLogReader