Skip to main content

No More Spaghetti Flows

Spaghetti Flows




You may have heard of:   https://en.wikipedia.org/wiki/Spaghetti_code.   For Apache NiFi, I have seen some (and have done some of them in the past), I call them Spaghetti Flows.


Let's avoid them.   When you are first building a flow it often meanders and has lots of extra steps and extra UpdateAttributes and random routes. This applies if you are running on-premise, in CDP or in other stateful NiFi clusters (or single nodes). The following video from Mark Payne is a must watch before you write any NiFi flows.


Apache NiFi Anti-Patterns with Mark Payne


https://www.youtube.com/watch?v=RjWstt7nRVY

https://www.youtube.com/watch?v=v1CoQk730qs

https://www.youtube.com/watch?v=JbUjYr6Kd3I

https://github.com/tspannhw/EverythingApacheNiFi 



Do Not:

  • Do not Put 1,000 Flows on one workspace.

  • If your flow has hundreds of steps, this is a Flow Smell.   Investigate why.

  • Do not Use ExecuteProcess, ExecuteScripts or a lot of Groovy scripts as a default, look for existing processors

  • Do not Use Random Custom Processors you find that have no documentation or are unknown.

  • Do not forget to upgrade, if you are running anything before Apache NiFi 1.10, upgrade now!

  • Do not run on default 512M RAM.

  • Do not run one node and think you have a highly available cluster.

  • Do not split a file with millions of records to individual records in one shot without checking available space/memory and back pressure.

  • Use Split processors only as an absolute last resort. Many processors are designed to work on FlowFiles that contain many records or many lines of text. Keeping the FlowFiles together instead of splitting them apart can often yield performance that is improved by 1-2 orders of magnitude.


Do:

  • Reduce, Reuse, Recycle.    Use Parameters to reuse common modules.

  • Put flows, reusable chunks (write to Slack, Database, Kafka) into separate Process Groups.

  • Write custom processors if you need new or specialized features

  • Use Cloudera supported NiFi Processors

  • Use RecordProcessors everywhere

  • Read the Docs!

  • Use the NiFi Registry for version control.

  • Use NiFi CLI and DevOps for Migrations.

  • Run a CDP NiFi Datahub or CFM managed 3 or more node cluster.

  • Walk through your flow and make sure you understand every step and it’s easy to read and follow.   Is every processor used?   Are there dead ends?

  • Do run Zookeeper on different nodes from Apache NiFi.

  • For Cloud Hosted Apache NiFi - go with the "high cpu" instances, such as 8 cores, 7 GB ram.

  • same flow 'templatized' and deployed many many times with different params in the same instance

  • Use routing based on content and attributes to allow one flow to handle multiple nearly identical flows is better than deploying the same flow many times with tweaks to parameters in same cluster.

  • Use the correct driver for your database.   There's usually a couple different JDBC drivers.

  • Make sure you match your Hive version to the NiFi processor for it.   There are ones out there for Hive 1 and Hive 3!   HiveStreaming needs Hive3 with ACID, ORC.  https://community.cloudera.com/t5/Support-Questions/how-to-use-puthivestreaming/td-p/108430


Let's revisit some Best Practices:


https://medium.com/@abdelkrim.hadjidj/best-practices-for-using-apache-nifi-in-real-world-projects-3-takeaways-1fe6912101db


Get your Apache NiFi for Dummies.   My own NiFi 101.


Here are a few things you should have read and tried before building your first Apache NiFi flow:

Also when in doubt, use Records!  Use Record Processors and use pre-defined schemas, this will be easier to develop, cleaner and more performant. Easier, Faster, Better!!!


There are record processors for Logs (Grok), JSON, AVRO, XML, CSV, Parquet and more.


Look for a processor that has “Record” in the name like PutDatabaseRecord or QueryRecord.


Use the best DevOps processes, testing and tools.

Some newer features in 1.8, 1.9, 1.10, 1.11 that you need to use.

Advanced Articles:

Spaghetti is for eating, not for real-time data streams.   Let's keep it that way.


If you are not sure what to do check out the Cloudera Community, NiFi Slack or the NiFi docs.   Also I may have a helpful article here. Join me and my NiFi friends at virtual meetups for more in-depth NiFi, Flink, Kafka and more. We keep it interactive so you can feel free to ask questions.


Note:   In this picture I am in Italy doing spaghetti research.


Popular posts from this blog

Ingesting Drone Data From DJII Ryze Tello Drones Part 1 - Setup and Practice

Ingesting Drone Data From DJII Ryze Tello Drones Part 1 - Setup and Practice In Part 1, we will setup our drone, our communication environment, capture the data and do initial analysis. We will eventually grab live video stream for object detection, real-time flight control and real-time data ingest of photos, videos and sensor readings. We will have Apache NiFi react to live situations facing the drone and have it issue flight commands via UDP. In this initial section, we will control the drone with Python which can be triggered by NiFi. Apache NiFi will ingest log data that is stored as CSV files on a NiFi node connected to the drone's WiFi. This will eventually move to a dedicated embedded device running MiniFi. This is a small personal drone with less than 13 minutes of flight time per battery. This is not a commercial drone, but gives you an idea of the what you can do with drones. Drone Live Communications for Sensor Readings and Drone Control You must connect t

NiFi on Cloudera Data Platform Upgrade - April 2021

CFM 2.1.1 on CDP 7.1.6 There is a new Cloudera release of Apache NiFi now with SAML support. Apache NiFi 1.13.2.2.1.1.0 Apache NiFi Registry 0.8.0.2.1.1.0 See:    https://blog.cloudera.com/the-new-releases-of-apache-nifi-in-public-cloud-and-private-cloud/ https://docs.cloudera.com/cfm/2.1.1/release-notes/topics/cfm-component-support.html https://docs.cloudera.com/cfm/2.1.1/release-notes/topics/cfm-whats-new.html https://docs.cloudera.com/cfm/2.1.1/upgrade-paths/topics/cfm-upgrade-paths.html   For changes:    https://www.datainmotion.dev/2021/02/new-features-of-apache-nifi-1130.html Get your download on:  https://docs.cloudera.com/cfm/2.1.1/download/topics/cfm-download-locations.html To start researching for the future, take a look at some of the technical preview features around Easy Rules engine and handlers. https://docs.cloudera.com/cfm/2.1.1/release-notes/topics/cfm-technical-preview.html Make sure you use the latest possible JDK 8 as there are some bugs out there.   Use a recent v

Using Apache NiFi in OpenShift and Anywhere Else to Act as Your Global Integration Gateway

Using Apache NiFi in OpenShift and Anywhere Else to Act as Your Global Integration Gateway What does it look like? Where Can I Run This Magic Engine: Private Cloud, Public Cloud, Hybrid Cloud, VM, Bare Metal, Single Node, Laptop, Raspberry Pi or anywhere you have a 1GB of RAM and some CPU is a good place to run a powerful graphical integration and dataflow engine.   You can also run MiNiFi C++ or Java agents if you want it even smaller. Sounds Too Powerful and Expensive: Apache NiFi is Open Source and can be run freely anywhere. For What Use Cases: Microservices, Images, Deep Learning and Machine Learning Models, Structured Data, Unstructured Data, NLP, Sentiment Analysis, Semistructured Data, Hive, Hadoop, MongoDB, ElasticSearch, SOLR, ETL/ELT, MySQL CDC, MySQL Insert/Update/Delete/Query, Hosting Unlimited REST Services, Interactive with Websockets, Ingesting Any REST API, Natively Converting JSON/XML/CSV/TSV/Logs/Avro/Parquet, Excel, PDF, Word Documents, Syslog, Kafka, JMS, MQTT, TCP