How about Some Free Cloud Training?

CDP Private Cloud Fundamentals

Cloudera, IBM, and Red Hat

Our CDP Private Cloud Fundamentals OnDemand course provides a solid introduction to CDP Private Cloud. In addition to learning what CDP Private Cloud is and how it fits into the Enterprise Data Cloud vision, you'll find out about its architecture and how it uses cloud-native design elements such as containerization in order to overcome limitations of the traditional bare metal cluster architecture. Following a summary of the system requirements, the course concludes with a demonstration of a CDP Private Cloud installation.

Cloudera Essentials for CDP

Introduction to Cloudera Manager

Introduction to Cloudera Data Warehouse: Self-Service Analytics in the Cloud with CDP

Introduction to Cloudera Machine Learning

Introduction to Apache Impala

Enriching Data with Pyspark

Introduction to Hive

Demo Jam Build a Flow with Apache NiFi

Semantic Analysis

Apache Yunikorn

Introduction to Apache Ozone

Importing Data into the Cloud with Apache NiFi

Processing Fixed Width and Complex Files

The first decision you will have to make is if it's structured at all.   If it is a known type like CSV, JSON, AVRO, XML or Parquet then just use a record.

If it's semi-structured like a log file, GrokReader may work or ExtractGrok.

If it's like CSV, you may be able to tweak the CSV reader to work (say header or no header) or try one of the two CSV parsers NiFi has (Jackson or Apache Commons).    

If it's a format like PDF, Word, Excel, RTF or something like that, I have a custom processor that uses Apache Tika and that should be able to parse it into text.   Once it is text you can probably work with it.



Processors To Use For File Manipulation

  • AttributesToCSV
  • AttributesToJSON
  • ConvertExcelToCSVProcessor 
  • ConvertRecord
  • ConvertText
  • CSVReader
  • EvaluateJSONPath
  • EvaluateXPath
  • EvaluateXQuery
  • ExecuteScript
  • ExecuteStreamCommand
  • ExtractGrok
  • ExtractText
  • FlattenJson
  • ForkRecord
  • GrokReader
  • JsonPathReader
  • JsonTreeReader
  • JoltTransformJSON
  • JoltTransformRecord
  • LookupAttribute
  • LookupRecord
  • MergeContent
  • MergeRecord
  • ModifyBytes
  • ParseSyslog*
  • PartitionRecord
  • QueryRecord
  • ReaderLookup
  • ReplaceText
  • ReplaceTextWithMapping
  • ScriptedReader
  • ScriptedRecordSink
  • ScriptedTransformRecord
  • SegmentContent
  • SplitContent
  • SplitJson
  • SplitRecord
  • SplitText
  • SplitXml
  • SyslogReader
  • TransformXml
  • UnpackContent
  • UpdateAttribute
  • UpdateRecord
  • ValidCsv
  • ValidateRecord
  • ValidateXml

Custom Processors

Helper Projects, SDK, Libraries and Services

Price Comparisons Using Retail REST APIs with Apache NiFi, Kafka and Flink SQL

Part 1:   NiFi Rest
Part 2:   Kafka - Flink SQL
Part 3:  Cloudera Visual Apps
Part 4:   Smart Shelf Updates - MiNiFi Agents

MiNiFi Agent Update March 2021

 Cloudera Agent Availability

Getting Started

MiNiFi (C++)

Version cpp-0.9.0

Release Date: March 2021

Highlights of 0.9.0 release include:

  • Added support for RocksDB-based content repository for better performance
  • Added SQL extension
  • Improved task scheduling
  • Various C2 improvements
  • Bug fixes and improvements to TailFile, ConsumeWindowsEventLog, MergeContent, CompressContent, PublishKafka, InvokeHTTP
  • Implemented RetryFlowFile and smart handling of loopback connections
  • Added a way to encrypt sensitive config properties and the flow configuration
  • Implemented full S3 support
  • Reduced memory footprint when working with many flow files

Build Notes:

It is advised that you use the when not building on windows.

Download Now As Source or Pre-Build for Your Platform