Skip to main content

No More Spaghetti Flows

Spaghetti Flows




You may have heard of:   https://en.wikipedia.org/wiki/Spaghetti_code.   For Apache NiFi, I have seen some (and have done some of them in the past), I call them Spaghetti Flows.


Let's avoid them.   When you are first building a flow it often meanders and has lots of extra steps and extra UpdateAttributes and random routes. This applies if you are running on-premise, in CDP or in other stateful NiFi clusters (or single nodes). The following video from Mark Payne is a must watch before you write any NiFi flows.


Apache NiFi Anti-Patterns with Mark Payne


https://www.youtube.com/watch?v=RjWstt7nRVY

https://www.youtube.com/watch?v=v1CoQk730qs

https://www.youtube.com/watch?v=JbUjYr6Kd3I

https://github.com/tspannhw/EverythingApacheNiFi 



Do Not:

  • Do not Put 1,000 Flows on one workspace.

  • If your flow has hundreds of steps, this is a Flow Smell.   Investigate why.

  • Do not Use ExecuteProcess, ExecuteScripts or a lot of Groovy scripts as a default, look for existing processors

  • Do not Use Random Custom Processors you find that have no documentation or are unknown.

  • Do not forget to upgrade, if you are running anything before Apache NiFi 1.10, upgrade now!

  • Do not run on default 512M RAM.

  • Do not run one node and think you have a highly available cluster.

  • Do not split a file with millions of records to individual records in one shot without checking available space/memory and back pressure.

  • Use Split processors only as an absolute last resort. Many processors are designed to work on FlowFiles that contain many records or many lines of text. Keeping the FlowFiles together instead of splitting them apart can often yield performance that is improved by 1-2 orders of magnitude.


Do:

  • Reduce, Reuse, Recycle.    Use Parameters to reuse common modules.

  • Put flows, reusable chunks (write to Slack, Database, Kafka) into separate Process Groups.

  • Write custom processors if you need new or specialized features

  • Use Cloudera supported NiFi Processors

  • Use RecordProcessors everywhere

  • Read the Docs!

  • Use the NiFi Registry for version control.

  • Use NiFi CLI and DevOps for Migrations.

  • Run a CDP NiFi Datahub or CFM managed 3 or more node cluster.

  • Walk through your flow and make sure you understand every step and it’s easy to read and follow.   Is every processor used?   Are there dead ends?

  • Do run Zookeeper on different nodes from Apache NiFi.

  • For Cloud Hosted Apache NiFi - go with the "high cpu" instances, such as 8 cores, 7 GB ram.

  • same flow 'templatized' and deployed many many times with different params in the same instance

  • Use routing based on content and attributes to allow one flow to handle multiple nearly identical flows is better than deploying the same flow many times with tweaks to parameters in same cluster.

  • Use the correct driver for your database.   There's usually a couple different JDBC drivers.

  • Make sure you match your Hive version to the NiFi processor for it.   There are ones out there for Hive 1 and Hive 3!   HiveStreaming needs Hive3 with ACID, ORC.  https://community.cloudera.com/t5/Support-Questions/how-to-use-puthivestreaming/td-p/108430


Let's revisit some Best Practices:


https://medium.com/@abdelkrim.hadjidj/best-practices-for-using-apache-nifi-in-real-world-projects-3-takeaways-1fe6912101db


Get your Apache NiFi for Dummies.   My own NiFi 101.


Here are a few things you should have read and tried before building your first Apache NiFi flow:

Also when in doubt, use Records!  Use Record Processors and use pre-defined schemas, this will be easier to develop, cleaner and more performant. Easier, Faster, Better!!!


There are record processors for Logs (Grok), JSON, AVRO, XML, CSV, Parquet and more.


Look for a processor that has “Record” in the name like PutDatabaseRecord or QueryRecord.


Use the best DevOps processes, testing and tools.

Some newer features in 1.8, 1.9, 1.10, 1.11 that you need to use.

Advanced Articles:

Spaghetti is for eating, not for real-time data streams.   Let's keep it that way.


If you are not sure what to do check out the Cloudera Community, NiFi Slack or the NiFi docs.   Also I may have a helpful article here. Join me and my NiFi friends at virtual meetups for more in-depth NiFi, Flink, Kafka and more. We keep it interactive so you can feel free to ask questions.


Note:   In this picture I am in Italy doing spaghetti research.


Popular posts from this blog

Ingesting Drone Data From DJII Ryze Tello Drones Part 1 - Setup and Practice

Ingesting Drone Data From DJII Ryze Tello Drones Part 1 - Setup and Practice In Part 1, we will setup our drone, our communication environment, capture the data and do initial analysis. We will eventually grab live video stream for object detection, real-time flight control and real-time data ingest of photos, videos and sensor readings. We will have Apache NiFi react to live situations facing the drone and have it issue flight commands via UDP. In this initial section, we will control the drone with Python which can be triggered by NiFi. Apache NiFi will ingest log data that is stored as CSV files on a NiFi node connected to the drone's WiFi. This will eventually move to a dedicated embedded device running MiniFi. This is a small personal drone with less than 13 minutes of flight time per battery. This is not a commercial drone, but gives you an idea of the what you can do with drones. Drone Live Communications for Sensor Readings and Drone Control You must connect t

NiFi on Cloudera Data Platform Upgrade - April 2021

CFM 2.1.1 on CDP 7.1.6 There is a new Cloudera release of Apache NiFi now with SAML support. Apache NiFi 1.13.2.2.1.1.0 Apache NiFi Registry 0.8.0.2.1.1.0 See:    https://blog.cloudera.com/the-new-releases-of-apache-nifi-in-public-cloud-and-private-cloud/ https://docs.cloudera.com/cfm/2.1.1/release-notes/topics/cfm-component-support.html https://docs.cloudera.com/cfm/2.1.1/release-notes/topics/cfm-whats-new.html https://docs.cloudera.com/cfm/2.1.1/upgrade-paths/topics/cfm-upgrade-paths.html   For changes:    https://www.datainmotion.dev/2021/02/new-features-of-apache-nifi-1130.html Get your download on:  https://docs.cloudera.com/cfm/2.1.1/download/topics/cfm-download-locations.html To start researching for the future, take a look at some of the technical preview features around Easy Rules engine and handlers. https://docs.cloudera.com/cfm/2.1.1/release-notes/topics/cfm-technical-preview.html Make sure you use the latest possible JDK 8 as there are some bugs out there.   Use a recent v

Advanced XML Processing with Apache NiFi 1.9.1

Advanced XML Processing with Apache NiFi 1.9.1 With the latest version of Apache NiFi, you can now directly convert XML to JSON or Apache AVRO, CSV or any other format supported by RecordWriters.   This is a great advancement.  To make it even easier, you don't even need to know the schema before hand.   There is a built-in option to Infer Schema. The results of an RSS (XML) feed converted to JSON and displayed in a slack channel. Besides just RSS feeds, we can grab regular XML data including XML data that is wrapped in a Zip file (or even in a Zipfile in an email, SFTP server or Google Docs). Get the Hourly Weather Observation for the United States Decompress That Zip  Unpack That Zip into Files One ZIP becomes many XML files of data. An example XML record from a NOAA weather station. Converted to JSON Automagically Let's Read Those Records With A Query and Convert the results to JSON Records