Technical Preview - Cloudera DataFlow

Technical Preview - Cloudera DataFlow

Import Flows Build in CDP Datahub Flow Management

Deploy Flows

Import a Quick Flow


Top DataFlow Resources

Example Flows

How about Some Free Cloud Training?

How about Some Free Cloud Training?


CDP Private Cloud Fundamentals

CDP Private Cloud Fundamentals

Cloudera, IBM, and Red Hat

Our CDP Private Cloud Fundamentals OnDemand course provides a solid introduction to CDP Private Cloud. In addition to learning what CDP Private Cloud is and how it fits into the Enterprise Data Cloud vision, you'll find out about its architecture and how it uses cloud-native design elements such as containerization in order to overcome limitations of the traditional bare metal cluster architecture. Following a summary of the system requirements, the course concludes with a demonstration of a CDP Private Cloud installation.

Cloudera Essentials for CDP

Introduction to Cloudera Manager

Introduction to Cloudera Data Warehouse: Self-Service Analytics in the Cloud with CDP

Introduction to Cloudera Machine Learning

Introduction to Apache Impala

Enriching Data with Pyspark

Introduction to Hive

Demo Jam Build a Flow with Apache NiFi

Semantic Analysis

Apache Yunikorn

Introduction to Apache Ozone

Importing Data into the Cloud with Apache NiFi

Processing Fixed Width and Complex Files

Processing Fixed Width and Complex Files


The first decision you will have to make is if it's structured at all.   If it is a known type like CSV, JSON, AVRO, XML or Parquet then just use a record.

If it's semi-structured like a log file, GrokReader may work or ExtractGrok.

If it's like CSV, you may be able to tweak the CSV reader to work (say header or no header) or try one of the two CSV parsers NiFi has (Jackson or Apache Commons).    

If it's a format like PDF, Word, Excel, RTF or something like that, I have a custom processor that uses Apache Tika and that should be able to parse it into text.   Once it is text you can probably work with it.



Processors To Use For File Manipulation

  • AttributesToCSV
  • AttributesToJSON
  • ConvertExcelToCSVProcessor 
  • ConvertRecord
  • ConvertText
  • CSVReader
  • EvaluateJSONPath
  • EvaluateXPath
  • EvaluateXQuery
  • ExecuteScript
  • ExecuteStreamCommand
  • ExtractGrok
  • ExtractText
  • FlattenJson
  • ForkRecord
  • GrokReader
  • JsonPathReader
  • JsonTreeReader
  • JoltTransformJSON
  • JoltTransformRecord
  • LookupAttribute
  • LookupRecord
  • MergeContent
  • MergeRecord
  • ModifyBytes
  • ParseSyslog*
  • PartitionRecord
  • QueryRecord
  • ReaderLookup
  • ReplaceText
  • ReplaceTextWithMapping
  • ScriptedReader
  • ScriptedRecordSink
  • ScriptedTransformRecord
  • SegmentContent
  • SplitContent
  • SplitJson
  • SplitRecord
  • SplitText
  • SplitXml
  • SyslogReader
  • TransformXml
  • UnpackContent
  • UpdateAttribute
  • UpdateRecord
  • ValidCsv
  • ValidateRecord
  • ValidateXml

Custom Processors

Helper Projects, SDK, Libraries and Services

Price Comparisons Using Retail REST APIs with Apache NiFi, Kafka and Flink SQL

 Price Comparisons Using Retail REST APIs with Apache NiFi, Kafka and Flink SQL

Part 1:   NiFi Rest
Part 2:   Kafka - Flink SQL
Part 3:  Cloudera Visual Apps
Part 4:   Smart Shelf Updates - MiNiFi Agents

MiNiFi Agent Update March 2021

 Cloudera Agent Availability

Getting Started

MiNiFi (C++)

Version cpp-0.9.0

Release Date: March 2021

Highlights of 0.9.0 release include:

  • Added support for RocksDB-based content repository for better performance
  • Added SQL extension
  • Improved task scheduling
  • Various C2 improvements
  • Bug fixes and improvements to TailFile, ConsumeWindowsEventLog, MergeContent, CompressContent, PublishKafka, InvokeHTTP
  • Implemented RetryFlowFile and smart handling of loopback connections
  • Added a way to encrypt sensitive config properties and the flow configuration
  • Implemented full S3 support
  • Reduced memory footprint when working with many flow files

Build Notes:

It is advised that you use the when not building on windows.

Download Now As Source or Pre-Build for Your Platform

New Features of Apache NiFi 1.13.2

 New Features of Apache NiFi 1.13.2

New Features
  • ListenFTP
  • UpdateHiveTable - Hive DDL changes -Hive Update Schema ie Data Drift ie Hive Schema Migration!!!!
  • SampleRecord - different sampling approaches to records (Interval Sampling, Probabilistic Sampling, Reservoir Sampling)
  • CDC Updates
  • Kudu updates
  • AMQP and MQTT Integration Upgrades
  • ConsumeMQTT - readers and writers added
  • HTTP access to NiFi by default is now configured to accept connections to only.  If you want to allow broader access for some reason for HTTP and you understand the security implications you can still control that as always by changing the '' property in as always. That said, please take the time to configure proper HTTPS.  We offer detailed instructions and tooling to assist.
  • ConsumeMQTT - add record reader/writer
  • The ability to run NiFi with no GUI as MiNiFi/NiFi combined code base continues.
  • Support for Kudu Dates (
  • Updated GRPC versions
  • Apache Calcite update
  • PutDatabaseRecord update

Here is an example NiFi ETL Flow:

Example NiFi 1.13.2 Flow:
  • ConsumeMQTT:   now with readers
  • UpdateAttribute:   set to kafka and to json.
  • SampleRecord:   sample a few of the records
  • PutRecord:   Use reader and destination service
  • UpdateHiveTable:   new sink
Consume from MQTT and read and write to/from records.

Some example attributes from a running flow:

Connection Pools for DatabaseRecordSinks can be JDBC, Hadoop and Hive.

FreeFormTextRecordSetWriter is great for writing any format.

RecordSinkService we will pick Kafka as our destination.

KafkaRecordSink from PutRecord

Reader will pick json in our example based on our UpdateAttribute, we can dynamically change this as data streams.

ReaderLookup  - lets you pick a reader based on an attribute.

We have defined readers for Parquet, JSON, AVRO, XML and CSV so no matter the type I can automagically read it.    Great for reusing code and great for cases like our new ListenFTP where you may get sent tons of different files to process.   Use one FLOW!

RecordSinkService can help you make all our flows generic so you can drop in different sinks/destinations for your writers based on what the data coming in is.   This is revolutionary for code reuse.

We can write our output in a custom format that could look like a document, HTML, fixed width, a form letter, weird delimiter or whatever you need.

Sample records using different methods.

We use the RecordSinkServiceLookup to allow us to change our sink location dynamically, we are passing in an attribute to choose Kafka.

We have pushed our data to Kafka via the KafkaRecordSink.  We can see our data easily in Streams Messaging Manager (SMM).

With a RecordReaderFactory, you can pick readers like the new WindowsEventLogReader.

As another output, we can UpdateHiveTable from our data and change the table as needed.

Straight From Release Notes:  New Feature
  • [NIFI-7386] - AzureStorageCredentialsControllerService should also connect to storage emulator
  • [NIFI-7429] - Add Status History capabilities for system level metrics
  • [NIFI-7549] - Adding Hazelcast based implementation for DistributedMapCacheClient
  • [NIFI-7624] - Build a ListenFTP processor
  • [NIFI-7745] - Add a SampleRecord processor
  • [NIFI-7796] - Add Prometheus metrics for total bytes received and bytes sent for components
  • [NIFI-7801] - Add acknowledgement check to Splunk
  • [NIFI-7821] - Create a Cassandra implementation of DistributedMapCacheClient
  • [NIFI-7879] - Create record path function for UUID v5
  • [NIFI-7906] - Add graph processor with flexibility to query graph database conditioned on flowfile content and attributes
  • [NIFI-7989] - Add Hive "data drift" processor
  • [NIFI-8136] - Allow State Management to be tied to Process Session
  • [NIFI-8142] - Add "on conflict do nothing" feature to PutDatabaseRecord
  • [NIFI-8146] - Allow RecordPath to be used for specifying operation type and data fields when using PutDatabaseRecord
  • [NIFI-8175] - Add a WindowsEventLogReader

An update Cloudera Flow Management!

Cloudera Flow Management on DataHub Public Cloud 

This minor update has some Schema Registry and Atlas integration updates.  

If that wasn't enough, new version of MiNiFi C++ Agent!

Cloudera Edge Manager 1.2.2 Release

February 15, 2021

CEM MiNiFi C++ Agent - 1.21.01 release includes:
  • Support for JSON output in the Consume Windows Even Log processor
  • Full Expression Language support on Windows
  • Full S3 support (List, Fetch, Get, Put)

Remember when you are done.