DataFlow Processors Cheat Sheet

 Processors Available in Cloudera DataFlow Designer

I put together a list of some of the processors available in the April 2023 version of flow designer as a quick document list.












































Documentation


Available ReadyFlows

Using a ReadyFlow to build your data flow allows you to get started with CDF quickly and easily. A ReadyFlow is a flow definition template optimized to work with a specific CDP source and destination. So instead of spending your time on building the data flow in NiFi, you can focus on deploying your flow and defining the right KPIs for easy monitoring.

The ReadyFlow Gallery is where you can find out-of-box flow definitions. To use a ReadyFlow, add it to the Catalog and then use it to create a Flow Deployment.

This ReadyFlow consumes JSON, CSV or Avro data from a source ADLS container and transforms the data into Avro files before writing it to another ADLS container.

You can use the Airtable to S3/ADLS ReadyFlow to consume objects from an Airtable table, filter them, and write them as JSON, CSV or Avro files to a destination in Amazon S3 or Azure Data Lake Service (ADLS).

You can use the Azure Event Hub to ADLS ReadyFlow to ingest JSON, CSV or Avro files from an Azure Event Hub namespace, optionally parsing the schema using CDP Schema Registry or direct schema input. The flow then filters records based on a user-provided SQL query and writes them to a target Azure Data Lake Storage (ADLS) location in the specified output data format.

You can use the Box to S3/ADLS ReadyFlow to move data from a source Box location to a destination Amazon S3 bucket or Azure Data Lake Storage (ADLS) location.

This ReadyFlow consumes JSON, CSV or Avro data from a source Kafka topic in Confluent Cloud and parses the schema by looking up the schema name in the Confluent Schema Registry. The filtered events are then written to the destination S3 or ADLS location.

This ReadyFlow consumes JSON, CSV or Avro data from a source Kafka topic in Confluent Cloud and filters records based on a user-provided SQL query before writing it to a Snowflake table.

You can use the Dropbox to S3/ADLS ReadyFlow to ingest data from Dropbox and write it to a destination in Amazon S3 or Azure Data Lake Service (ADLS).

You can use the Google Drive to S3/ADLS ReadyFlow to ingest data from a Google Drive location to a destination in Amazon S3 or Azure Data Lake Service (ADLS).

This ReadyFlow consumes change data from the Wikipedia API and converts JSON events to Avro, filtering and merging them before a writing a file to local disk.

This ReadyFlow retrieves objects from a Private HubSpot App, converting them to the specified output data format and writing them to the target S3 or ADLS destination.

This ReadyFlow moves data between database tables, filtering records by means of an SQL query.

This ReadyFlow consumes data from a source database table and filters events based on a user-provided SQL query before writing it to a destination Amazon S3 or Azure Data Lake Storage (ADLS) location in the specified output data format.

This ReadyFlow consumes JSON, CSV, or Avro data from a source Kafka topic and parses the schema by looking up the schema name in the CDP Schema Registry. You can filter events by specifying a SQL query in the Filter Rule parameter.

This ReadyFlow consumes JSON, CSV or Avro data from a source Kafka topic and merges the events into Avro files before writing the data to ADLS. The flow writes out a file every time its size has either reached 100 MB or five minutes have passed.

This ReadyFlow consumes JSON, CSV or Avro data from a source Kafka topic, parses the schema by looking up the schema name in the CDP Schema Registry and ingests it into an HBase table in COD.

This ReadyFlow consumes JSON, CSV, or Avro data from a source Kafka topic, parses the schema by looking up the schema name in the CDP Schema Registry, and ingests data into an Iceberg table in Hive.

This ReadyFlow consumes JSON, CSV, or Avro data from a source Kafka topic and parses the schema by looking up the schema name in the CDP Schema Registry.

This ReadyFlow consumes JSON, CSV or Avro data from a source Kafka topic, parses the schema by looking up the schema name in the CDP Schema Registry and ingests it into a Kudu table.

This ReadyFlow consumes JSON, CSV or Avro data from a source Kafka topic and merges the events into Avro files before writing the data to S3. The flow writes out a file every time its size has either reached 100 MB or five minutes have passed.

This ReadyFlow listens to a JSON, CSV or Avro data stream on a specified port and parses the schema by looking up the schema name in the CDP Schema Registry. You can filter events by specifying a SQL query. The filtered events are then converted to the specified output data format and written to the destination CDP Kafka topic.

This ReadyFlow listens to a Syslog data stream on a specified port. You can filter events by specifying a SQL query. The filtered events are then converted to the specified output data format and written to the target S3 or ADLS destination.

This ReadyFlow listens to a JSON, CSV or Avro data stream and parses the data based on a specified Avro-formatted schema. You can filter events by specifying a SQL query. The filtered events are then converted to the specified output data format and written to the target S3 or ADLS destination.

This ReadyFlow consumes JSON, CSV or Avro data from a source MQTT topic. You can filter events by specifying a SQL query. The filtered events are then converted to the specified output data format and written to the destination Kafka topic.

This ReadyFlow moves data between non-CDP managed source and CDP-managed destination ADLS locations.

This ReadyFlow moves data between non-CDP managed source and CDP-managed destination S3 locations.

This ReadyFlow consumes JSON, CSV or Avro data from a source S3 bucket and transforms the data into Avro files before writing it to another S3 bucket.

This ReadyFlow consumes JSON, CSV or Avro data from a source S3 bucket and transforms the data into Avro files before writing it to another S3 bucket. The ReadyFLow is configured with notifications about new files that arrive in the sourcce AWS bucket.

You can use the Salesforce filter to S3/ADLS ReadyFlow to consume objects from a Salesforce database table, filter them, and write the data as JSON, CSV or Avro files to a destination in Amazon S3 or Azure Data Lake Service (ADLS).

This ReadyFlow consumes objects from a Custom Shopify App, converts them to the specified output data format, and writes them to a CDP managed destination S3 or ADLS location.


NiFi Introduces Python API

FLaNK Stack Weekly for 24-april-2023

24-April-2023 FLiPN-FLaNK Stack Weekly Tim Spann @PaaSDev Real-Time Analytics Summit! StarTree #RTASummit #ApacheNiFi #ApacheFlink #ApacheKafka #ApachePulsar #ApachePinot #IoT All The Things Open Source. Join me soon in San Francisco Spann30 for 30% off registration. Real-Time Analytics Summit! https://rtasummit.com/ https://www.linkedin.com/feed/update/urn:li:activity:7052315567994605568?utm_source=share&utm_medium=member_desktop April 25-26, 2023! April 25, 2023! meetup https://www.meetup.com/futureofdata-sanfrancisco/events/292453316/ May 3, 2023! Join me and the NiFi creators! https://attend.cloudera.com/nificommitters0503?internal_keyplay=data-flow&internal_campaign=FY24-Q2_Webinar_Cloudera_AMER_NiFi_Meet_the_Committers&cid=7012H000001ZNXBQA4&internal_link=p07 CODE + COMMUNITY Please join my meetup group NJ/NYC/Philly/Virtual. http://www.meetup.com/futureofdata-princeton/ https://www.meetup.com/futureofdata-sanfrancisco/events/292453316/ https://www.meetup.com/futureofdata-newyork/ https://www.meetup.com/futureofdata-philadelphia/ ready This is Issue #80 https://github.com/tspannhw/FLiPStackWeekly https://www.linkedin.com/pulse/schedule-2023-tim-spann-/ nifi Downloads CEM MiNiFi Java 1.23.04 with FIPS support, Access through Load Balancer / Proxy https://docs.cloudera.com/cem/1.5.1/release-notes-minifi-java/topics/cem-minifi-java-download-locations.html https://docs.cloudera.com/cem/1.5.1/release-notes/topics/cem-download-locations.html https://docs.cloudera.com/cem/1.5.1/release-notes-minifi-java/topics/cem-minifi-java-agent-updates.html https://docs.cloudera.com/cem/1.5.1/release-notes/topics/cem-whats-new.html Videos https://www.youtube.com/watch?v=Bt40Qx0d7qA https://www.youtube.com/watch?v=mNyLx9tuJbw&t=1s https://www.youtube.com/watch?v=Ws7YmAHE1O8 https://www.youtube.com/watch?v=Z3XrYeh-QMA Articles https://blog.cloudera.com/using-dead-letter-queues-with-sql-stream-builder/ https://hazelcast.com/blog/the-power-of-the-hazelcast-community-a-recap-of-the-real-time-stream-processing-unconference/ https://www.inrhythm.com/apache-integrations/ https://streamnative.io/podcasts/learning-streaming-toolset-apache-pulsar-flink-nifi-ep4-of-crossing-the-streams https://www.cloudera.com/about/customers/major-airline.html?utm_medium=email&utm_source=nurture&keyplay=SEC&utm_campaign=FY22-Q1_NU_AMER_Major_Airline_CS_SPN_2021-02-15&cid=UNGATED&lid=ale326x7ls8d https://www.bloomberg.com/company/stories/bloomberg-publishes-pystack-debugging-tool-python/ https://blogs.nvidia.com/blog/2023/04/25/ai-chatbot-guardrails-nemo/ Recent Talks https://www.slideshare.net/bunkertor/warsawitdays-apachenifi202 Documentation Events https://www.youtube.com/watch?v=Ws7YmAHE1O8 https://www.cloudera.com/about/events/evolve.html https://web.cvent.com/event/7598f981-2f7e-4915-b662-bd7be9b5f48d/summary?RefId=homepage_impact24 April 24-26, 2023: Real-Time Analytics Summit: San Francisco, CA. In-Person. https://rtasummit.com/ April 25, 2023: Future of Data Meetup: San Francisco, CA. In-Person. https://www.meetup.com/futureofdata-princeton/ https://www.meetup.com/futureofdata-sanfrancisco/events/292453316/ May 3, 2023: Meet the Committers. Virtual https://attend.cloudera.com/nificommitters0503 May 3-10, 2023: Special Once in a Lifetime Event. Virtual. img May 9, 2023: Garden State Java User Group. In-Person. New Jersey https://gsjug.org/. Modern Data Streaming Pipelines with Java, NiFi, Flink, Kafka. https://gsjug.org/meetings/2023/may2023.html May 10-12, 2023: Open Source Summit North America. Virtual https://events.linuxfoundation.org/open-source-summit-north-america/ May 17-18, 2023: IBM Event. Raleigh, NC. May 23, 2023: Pulsar Summit Europe. Virtual https://pulsar-summit.org/ talks talks2 May 24-25, 2023: Big Data Fest. Virtual. https://sessionize.com/big-data-fest-by-softserve/ June 26-28, 2023: NLIT Summit. Milwaukee. https://www.fbcinc.com/e/nlit/default.aspx June 28, 2023: NiFi Meetup. Milwaukee and Hybrid. https://www.meetup.com/futureofdata-princeton/events/292976004/ meetup Cloudera Now - Virtual July 19, 2023: 2-Hours to Data Innovation: Data Flow https://www.cloudera.com/about/events/hands-on-lab-series-2-hours-to-data-innovation.html October 18, 2023: 2-Hours to Data Innovation: Data Flow https://www.cloudera.com/about/events/hands-on-lab-series-2-hours-to-data-innovation.html Cloudera Events https://www.cloudera.com/about/events.html More Events: https://www.linkedin.com/pulse/schedule-2023-tim-spann-/ Code My Apache Tika processor got put into the main stream NiFi. https://issues.apache.org/jira/browse/NIFI-9647 Tools https://github.com/Vision-CAIR/MiniGPT-4 https://github.com/h2oai/h2ogpt https://stability.ai/blog/stability-ai-launches-the-first-of-its-stablelm-suite-of-language-models https://github.com/h2oai/h2o-llmstudio https://github.com/suno-ai/bark https://github.com/schibsted/jslt https://www.garshol.priv.no/jslt-demo https://github.com/ninja-ide/ninja-ide © 2020-2023 Tim Spann

FLiPN-FLaNK Stack Weekly for 17 April 2023

 

17-April-2023

FLiPN-FLaNK Stack Weekly

Tim Spann @PaaSDev

Real-Time Analytics Summit!

StarTree #RTASummit #ApacheNiFi #ApacheFlink #ApacheKafka #ApachePulsar #ApachePinot #IoT All The Things Open Source.

Join me soon in San Francisco Spann30 for 30% off registration. Real-Time Analytics Summit!

https://rtasummit.com/

https://www.linkedin.com/feed/update/urn:li:activity:7052315567994605568?utm_source=share&utm_medium=member_desktop

April 25-26, 2023!

April 25, 2023!

meetup https://www.meetup.com/futureofdata-sanfrancisco/events/292453316/

May 3, 2023! Join me and the NiFi creators! https://attend.cloudera.com/nificommitters0503?internal_keyplay=data-flow&internal_campaign=FY24-Q2_Webinar_Cloudera_AMER_NiFi_Meet_the_Committers&cid=7012H000001ZNXBQA4&internal_link=p07

CODE + COMMUNITY

Please join my meetup group NJ/NYC/Philly/Virtual.

http://www.meetup.com/futureofdata-princeton/

https://www.meetup.com/futureofdata-sanfrancisco/events/292453316/

https://www.meetup.com/futureofdata-newyork/

https://www.meetup.com/futureofdata-philadelphia/

This is Issue #79

https://github.com/tspannhw/FLiPStackWeekly

https://www.linkedin.com/pulse/schedule-2023-tim-spann-/

New Releases

Cloudera Flow Management (CFM) 2.1.5 SP1 for CDP 7.1.7 & CDP 7.1.8

  • PutIceberg is GA
  • ListenGRPC, ConvertProtobuf for gRPC with custom schemas
  • EncodeContent, DecryptContent, DecryptContentCompatibility
  • GetAsanaObject
  • GetJiraIssue
  • JSLTTransformJSON
  • PutBoxFile, PutDropbx, PutGoogleDrive
  • PutIoTDBRecord
  • PutRedisHashRecord (Technical Preview)
  • PutSalesforceObject
  • TriggerHiveMetaStoreEvent
  • UpdateDeltaLakeTable (Technical Preview)

https://docs.cloudera.com/cfm/2.1.5/release-notes/topics/cfm-whats-new.html

https://docs.cloudera.com/cfm/2.1.5/release-notes/topics/cfm-download-locations.html

Videos

https://www.youtube.com/watch?v=mNyLx9tuJbw&ab_channel=Cloudera%2CInc

https://www.youtube.com/watch?v=Ws7YmAHE1O8&t=15s

https://youtube.com/shorts/z1S0rs6Pa6s?feature=share

https://www.youtube.com/watch?v=YJHrTgPjX5M&ab_channel=SpringDeveloper

https://www.youtube.com/watch?v=GncQz8HZr68&ab_channel=Hazelcast

Articles

https://medium.com/geekculture/stop-doing-this-on-chatgpt-get-ahead-99-users-ai-artificial-intelligence-productivity-prompt-engineering-4-f3441bf7a25a

https://janethl.medium.com/how-to-build-a-sustainable-developer-community-a-5-phase-framework-f1f38a06e744

https://medium.com/illumination/7-secret-websites-you-should-know-but-no-one-told-you-about-aee59b8f5aeb

https://dev.to/tspannhw/one-minute-nifi-tip-calcite-sql-notes-561

Recent Talks

Build ML Enhanced Event Streaming Apps with Python Microservices | Tim Spann | Conf42 Python 2023 https://www.youtube.com/watch?v=ptjRobC1FSw&ab_channel=Conf42

https://www.slideshare.net/bunkertor/conf42-python-ml-enhanced-event-streaming-apps-with-python-microservices

Documentation

https://docs.cloudera.com/dataflow/cloud/aws-lambda-functions/topics/cdf-create-aws-lambda-function.html

Events

https://www.cloudera.com/about/events/evolve.html

https://web.cvent.com/event/7598f981-2f7e-4915-b662-bd7be9b5f48d/summary?RefId=homepage_impact24

April 24-26, 2023: Real-Time Analytics Summit: San Francisco, CA. In-Person. https://rtasummit.com/

April 25, 2023: Future of Data Meetup: San Francisco, CA. In-Person. https://www.meetup.com/futureofdata-princeton/ https://www.meetup.com/futureofdata-sanfrancisco/events/292453316/

May 3, 2023: Meet the Committers. Virtual https://attend.cloudera.com/nificommitters0503

May 3-10, 2023: Special Once in a Lifetime Event. Virtual.

May 9, 2023: Garden State Java User Group. In-Person. New Jersey https://gsjug.org/. Modern Data Streaming Pipelines with Java, NiFi, Flink, Kafka. https://gsjug.org/meetings/2023/may2023.html

May 10-12, 2023: Open Source Summit North America. Virtual https://events.linuxfoundation.org/open-source-summit-north-america/

May 23, 2023: Pulsar Summit Europe. Virtual https://pulsar-summit.org/

Cloudera Events https://www.cloudera.com/about/events.html

More Events: https://www.linkedin.com/pulse/schedule-2023-tim-spann-/

Cloudera Now

nano

Code

shaggy

Tools

https://github.com/facebookresearch/segment-anything

https://github.com/Torantulino/Auto-GPT

https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat

https://www.equalto.com/markup/#

https://github.com/ultralytics/ultralytics

https://github.com/mikel-brostrom/yolov8_tracking

https://www.stef.be/dpaint/

https://github.com/soulteary/docker-prompt-generator

https://github.com/nomic-ai/gpt4all

https://github.com/Torantulino/Auto-GPT

https://github.com/Priler/jarvis

https://github.com/facebookresearch/AnimatedDrawings

https://github.com/xtekky/openai-gpt4

© 2023 Tim Spann