Showing posts with label apache nifi. Show all posts

FLiP-FLaNK Stack Weekly for 20 February 2023

FLiP-FLaNK Stack Weekly 20-February-2023

#apachepulsar #apacheflink #apacenifi #apachekafka

20-February-2023

FLiPN-FLaNK Stack Weekly

Welcome to the seventh newsletter of 2023. Getting closer...

Tim Spann @PaaSDev

Happy President's Day.

The new stuff in NiFi 1.20 is incredible, what is coming is unreal. I got some secret early looks at upcoming stuff and I can't wait to show at some talks in March.

https://www.timswarmercabel.com/

Upcoming talk at Spring / VMWare / Tanzu

https://tanzu.vmware.com/developer/tv/golden-path/6/

https://www.youtube.com/@VMwareTanzu

type="application/x-shockwave-flash"
wmode="transparent" width="425" height="350" />

https://youtu.be/f2HsqchP2_A

PODCAST

New podcast coming.

CODE + COMMUNITY

Join my meetup group NJ/NYC/Philly/Virtual.

https://www.meetup.com/new-york-city-apache-pulsar-meetup/

https://www.meetup.com/futureofdata-princeton/

This is Issue #71!!

https://github.com/tspannhw/FLiPStackWeekly

https://www.linkedin.com/pulse/schedule-2023-tim-spann-/

Meetup

http://www.meetup.com/futureofdata-princeton/

https://www.meetup.com/phillyjug/events/291103971/

New

Articles

https://blog.cloudera.com/getting-started-with-cloudera-stream-processing-community-edition/

https://www.kschaul.com/post/2023/02/16/how-the-post-is-replacing-mapbox-with-open-source-solutions/

https://policylab.rutgers.edu/report-release-garden-state-open-data-index/

https://hubertdulay.substack.com/p/stream-processing-vs-real-time-olap

Events

Feb 17, 2023: Spring One: Virtual and VoD
https://tanzu.vmware.com/developer/tv/golden-path/6/

Feb 21, 2023: Summit for Java Dev: Virtual
https://geekle.us/schedule/java23

Feb 23, 2023: Pulsar Meetup - Rising Wave + Pulsar: Virtual
https://www.meetup.com/new-york-city-apache-pulsar-meetup/events/291048765/

March 3, 2023: Spring One: Virtual
https://tanzu.vmware.com/developer/tv/

March 8, 2023: Cloudera Now: Virtual
https://www.cloudera.com/about/events/cloudera-now-cdp.html

March 9, 2023: Hazelcast Unconference: Virtual
https://hazelcast.com/lp/unconference/

March 15, 2023: Philly JUG Meetup: Philly
[https://www.meetup.com/phillyjug/

March 16, 2023: Python Web Conference: Virtual
https://2023.pythonwebconf.com/

March 17, 2023: TCF Pro: Trenton, NJ
https://princetonacm.acm.org/tcfpro/profiles/timothy-spann.html

March 30, 2023: Pulsar Meetup - Flink: Virtual
https://www.meetup.com/new-york-city-apache-pulsar-meetup/events/290459862/

April 4-6, 2023: DevNexus: Atlanta, GA
https://devnexus.com/

April 24-26, 2023: Real-Time Analytics Summit: San Francisco, CA
https://rtasummit.com/

May 24-25, 2023: Infoshare: Gdansk, Poland
https://infoshare.pl/conference

More Events:
https://www.linkedin.com/pulse/schedule-2023-tim-spann-/

Tools

Devices

https://github.com/tspannhw/Flow-SGP30-MLX90640/blob/main/README.md

FLiP Stack Weekly for 13 February 2023

13-February-2023

FLiP Stack Weekly

Welcome to the sixth newsletter of 2023. Some exciting news is coming in a few weeks.

Tim Spann @PaaSDev

Happy Super Bowl!

News!

Apache NiFi updated to 1.20! Lots of new features and processors for Box, Google Drive and more.

https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-Version1.20.0

New demo will be coming around this site.

https://www.timswarmercabel.com/

Upcoming talk at Spring / VMWare / Tanzu

https://tanzu.vmware.com/developer/tv/golden-path/6/

https://www.youtube.com/@VMwareTanzu

type="application/x-shockwave-flash"
wmode="transparent" width="425" height="350" />

https://youtu.be/f2HsqchP2_A

PODCAST

New podcast coming.

CODE + COMMUNITY

Join my meetup group NJ/NYC/Philly/Virtual.

https://www.meetup.com/new-york-city-apache-pulsar-meetup/

https://www.meetup.com/futureofdata-princeton/

This is Issue #70!!

https://github.com/tspannhw/FLiPStackWeekly

https://www.linkedin.com/pulse/schedule-2023-tim-spann-/

Meetup

http://www.meetup.com/futureofdata-princeton/

https://www.meetup.com/phillyjug/events/291103971/

New

Articles

https://dzone.com/articles/migration-from-amazon-sqs-and-kinesis-to-apache-ka

https://dzone.com/articles/top-three-docker-alternatives-to-consider

https://asyncq.com/how-to-make-unique-id-generator-microservice-using-spring-boot

[https://pratikbarjatya.medium.com/building-data-ingestion-system-using-apache-nifi-76e90765ac43])(https://pratikbarjatya.medium.com/building-data-ingestion-system-using-apache-nifi-76e90765ac43_

https://clickhouse.com/docs/en/integrations/nifi-and-clickhouse/

https://medium.com/@wesleynitikromo/open-source-data-tools-will-provide-a-way-out-of-your-overpriced-cloud-tools-4041bc394eb4

https://aiven.io/blog/why-you-should-think-about-moving-analytics-from-batch-to-real-time

https://www.morling.dev/blog/kafka-where-art-thou/

https://yesidays.medium.com/why-parquet-files-are-the-key-to-unlocking-big-data-analytics-fa2cf37b8b82

Events

Feb 15, 2023: Scylla Summit. Virtual
https://www.scylladb.com/scylladb-summit-2023/

Feb 16, 2023: Spring One: Virtual
https://tanzu.vmware.com/developer/tv/golden-path/6/

Feb 21, 2023: Summit for Java Dev: Virtual
https://geekle.us/schedule/java23

Feb 23, 2023: Pulsar Meetup - Rising Wave + Pulsar: Virtual
https://www.meetup.com/new-york-city-apache-pulsar-meetup/events/291048765/

March 3, 2023: Spring One: Virtual
https://tanzu.vmware.com/developer/tv/

March 8, 2023: Cloudera Now: Virtual
https://www.cloudera.com/about/events/cloudera-now-cdp.html

March 13-17, 2023: Python Web Conference: Virtual
https://2023.pythonwebconf.com/

March 15, 2023: Philly JUG Meetup: Philly
[https://www.meetup.com/phillyjug/

March 30, 2023: Pulsar Meetup - Flink: Virtual
https://www.meetup.com/new-york-city-apache-pulsar-meetup/events/290459862/

April 4-6, 2023: DevNexus: Atlanta, GA
https://devnexus.com/

More Events:
https://www.linkedin.com/pulse/schedule-2023-tim-spann-/

Tools

Devices

https://github.com/tspannhw/Flow-SGP30-MLX90640/blob/main/README.md

Building Modern Data Streaming Apps - Pulsar Summit Asia 2022

Autoscaling Apache NiFi with Data Flow Experience on Kubernetes (K8) on AWS

https://www.clouddataops.dev/data-flow-experience

Technical Preview - Cloudera DataFlow

Import Flows Build in CDP Datahub Flow Management

https://docs.cloudera.com/dataflow/cloud/quick-start/topics/cdf-qs-definition.html

Deploy Flows

https://docs.cloudera.com/dataflow/cloud/quick-start/topics/cdf-qs-deploy.html

Import a Quick Flow

https://docs.cloudera.com/dataflow/cloud/qs-flow-definitions/topics/cdf-import-quick-flow.html

Monitoring

https://docs.cloudera.com/dataflow/cloud/kpi-overview/topics/cdf-introduction-to-kpi.html

Top DataFlow Resources

https://docs.cloudera.com/dataflow/cloud/index.html
https://docs.cloudera.com/dataflow/cloud/overview/topics/cdf-overview.html
https://docs.cloudera.com/dataflow/cloud/overview/topics/cdf-architecture.html
https://docs.cloudera.com/dataflow/cloud/enable-environment/topics/cdf-enable-your-environemnt.html
https://docs.cloudera.com/dataflow/cloud/manage-environment/topics/cdf-download-kube-config.html
https://docs.cloudera.com/dataflow/cloud/bp-flow-definition/topics/cdf-flow-isolation.html
https://docs.cloudera.com/dataflow/cloud/deploy-flows/topics/cdf-deploy-flow.html
https://blog.cloudera.com/top-three-requirements-for-data-flows/
https://www.cloudera.com/about/events/cloudera-now-cdp.html

Example Flows

https://github.com/tspannhw/DataFlowExperience

Using Cloudera Flow Management (Powered by Apache NiFi) to Track Cloud Services

See:

https://community.cloudera.com/t5/Community-Articles/Using-Cloudera-Flow-Management-To-Ingest-and-Process-RSS/ta-p/313074

FLaNK: Using Apache Kudu As a Cache For FDA Updates

InvokeHTTP: We invoke the RSS feed to get our data.
QueryRecord: Convert RSS to JSON
SplitJson: Split one file into individual records. (Should refactor to ForkRecord)
EvaluateJSONPath: Extract attributes you need.
ProcessGroup for SPL Processing

We call and check the RSS feed frequently and we parse out the records from the RSS(XML) feed and check against our cache. We use Cloudera's Real-Time Datahub with Apache Impala/Apache Kudu as our cache to see if we already received a record for that. If it hasn't arrived yet, it will be added to the Kudu cache and processed.

SPL Processing

We use the extracted SPL to grab a detailed record for that SPL using InvokeHTTP.

We have created an Impala/Kudu table to cache this information with Apache Hue.

We use the LookupRecord to read from our cache.

If we don't have that value yet, we send it to the table for an UPSERT.

We send our raw data as XML/RSS to HDFS for archival use and audits.

We can see the results of our flow in Apache Atlas to see full governance of our solution.

So with some simple NiFi flow, we can ingest all the new updates to DailyMed and not reprocess anything that we already have.

Resources

Apache NiFi 1.12 Released! 18-August-2020

Release Notes

https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-Version1.12.0

Issues

https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12316020&version=12346778

Major Feature List

https://twitter.com/pvillard31/status/1296469452180119553

Release Date: August 18, 2020.

Major Features:

New processor to write scripted record transforms live in the flow (ScriptedTransformRecord)
Expose a REST Endpoint for easy metric scraping by Prometheus
Ability to specify group level flow file concurrency - for instance run a single flow file end to end for traditional job handling
Improved several capabilities related to Azure service interaction including ADLS Gen2
Improved AMQP and MQTT support as well as JMS improvements
Support for latest Kafka 2.6 clients
Search UI Improvements

I will be posting a few demos and test drives soon.

ScriptedTransformRecord

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-scripting-nar/1.12.0/org.apache.nifi.processors.script.ScriptedTransformRecord/index.html

Did the user really ask for Exactly Once? Fault Tolerance

Exactly Once Requirements

It is very tricky and can cause performance degradation, if your user could just use at least once, then always go with that. Having data sinks like Kudu where you can do an upsert makes exactly once less needed.

https://docs.cloudera.com/csa/1.2.0/datastream-connectors/topics/csa-kafka.html

Apache Flink, Apache NiFi Stateless and Apache Kafka can participate in that.

For CDF Stream Processing and Analytics with Apache Flink 1.10 Streaming:

Both Kafka sources and sinks can be used with exactly once processing guarantees when checkpointing is enabled.

End-to-End Guaranteed Exactly-Once Record Delivery

The Data Source and Data Sink to need to support exactly-once state semantics and take part in checkpointing.

Data Sources

Apache Kafka - must have Exactly-Once selected, transactions enabled and correct driver.

Select: Semantic.EXACTLY_ONCE

Data Sinks

HDFS BucketingSink
Apache Kafka

For Kafka, please check the timeouts sync up to checkpoints. https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/connectors/kafka.html#kafka-producers-and-fault-tolerance

Reference

Sizing Your Apache NiFi Cluster For Production Workloads

Cloudera Flow Management provides an enterprise edition of support Apache NiFi managed by Cloudera Manager. The official documentation provides a great guide for sizing your cluster.

https://docs.cloudera.com/cfm/2.0.1/nifi-sizing/topics/cfm-nifi-sizing.html

If the use case fits, NiFi Stateless Engine may fit and perform better utilizing no disk.

Check out that heap usage and utilization, you may need to increase. 24-32 Gigabytes of RAM is a nice sweet spot for most instances.

Check out how your nodes, threads and queues are doing. If queue is not processing fast or thread count is high, you may need more cores, RAM or nodes.

When you are managing your cluster in Cloudera Manager, make sure you increase the default JVM memory for Apache NiFi. 512MB is not going to cut it for anything but single user development.

Do this correctly and process a billion events!!! https://blog.cloudera.com/benchmarking-nifi-performance-and-scalability/. - Notice the hardware and performance sections of that article

General tips:

Make sure you use SSD for Provenance and other repositories. Faster disk, happier user. https://docs.cloudera.com/cfm/2.0.1/nifi-sizing/topics/cfm-sizing-disk-configuration.html

Monitor your flows to see how much resources you need: https://www.datainmotion.dev/2020/07/report-on-this-apache-nifi-1114-monitor.html.

Three or more nodes plus external Zookeepers. https://docs.cloudera.com/cfm/2.0.1/nifi-sizing/topics/cfm-sizing-cluster-layout.html

Use Records, if it's semistructured GrokReader can help. https://www.nifi.rocks/record-path-cheat-sheet/ If it's CSV, JSON, XML, Parquet, Logs then use Readers and writers. They are much faster, easier and cleaner.

Good clean code is better than spaghetti. https://docs.cloudera.com/cfm/2.0.1/nifi-tuning/topics/cfm-tuning-your-data-flow.html. Saw no to bad coding: https://www.datainmotion.dev/2020/06/no-more-spaghetti-flows.html.

Recommendations: https://docs.cloudera.com/cfm/2.0.1/nifi-sizing/topics/cfm-sizing-recommendations.html

Minimize use of CPU or Memory intensive processors (or make a not of them during sizing): https://docs.cloudera.com/cfm/2.0.1/nifi-sizing/topics/cfm-sizing-resource-intensive-processors.html

There are a few decisions to make on repositories, talk to your Cloudera friends. https://docs.cloudera.com/HDPDocuments/HDF3/HDF-3.5.1/nifi-configuration-best-practices/content/configuration-best-practices.html

Report on This: Apache NiFi 1.11.4 - Monitor All The Things

The easiest way to grab monitoring data is via the NiFi REST API. Also everything in the NiFi UI is done through REST calls which you can call programmatically. Please read the NiFi docs they are linked directly from your running NiFi application or on the web. They are very thorough and have all the information you could want: https://nifi.apache.org/docs/nifi-docs/. If you are not running NiFi 1.11.4, I recommend you please upgrade. This is supported by Cloudera on multiple platforms.

NiFi Rest API

https://nifi.apache.org/docs/nifi-docs/rest-api/

There's also an awesome Python wrapper for that REST API: https://pypi.org/project/nipyapi/

Also in NiFi flow programming, every time you produce data to Kafka you get metadata back in FlowFile Attributes. You can push those attributes directly to a kafka topic if you want.

So after your PublishKafkaRecord_2_0 1.11.4 so for success read the attributes on # of record and other data then AttributesToJson and push to another topic. you may want a mergerecord in there to aggregate a few of those together.

If you are interested in Kafka metrics/record counts/monitoring then you must use Cloudera Streams Messaging Manager, it provides a full Web UI, Monitoring Tool, Alerts, REST API and everything you need for monitoring every producer, consumer, broker, cluster, topic, message, offset and Kafka component.

The best way to get NiFi stats is to use the NiFi Reporting Tasks, I like the SQL Reporting task.

SQL Reporting Tasks are very powerful and use standard SELECT * FROM JVM_METRICS style reporting, see my article:

https://www.datainmotion.dev/2020/04/sql-reporting-task-for-cloudera-flow.html

Monitoring Articles

https://www.datainmotion.dev/2019/04/monitoring-number-of-of-flow-files.html

https://www.datainmotion.dev/2019/03/apache-nifi-operations-and-monitoring.html

Other Resources

https://www.datainmotion.dev/2019/10/migrating-apache-flume-flows-to-apache_9.html

https://www.datainmotion.dev/2019/08/using-cloudera-streams-messaging.html

https://dev.to/tspannhw/apache-nifi-and-nifi-registry-administration-3c92

https://dev.to/tspannhw/using-nifi-cli-to-restore-nifi-flows-from-backups-18p9

https://nifi.apache.org/docs/nifi-docs/html/toolkit-guide.html

https://www.datainmotion.dev/p/links.html

https://www.tutorialspoint.com/apache_nifi/apache_nifi_monitoring.htm

https://community.cloudera.com/t5/Community-Articles/Building-a-Custom-Apache-NiFi-Operations-Dashboard-Part-1/ta-p/249060

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-metrics-reporting-nar/1.11.4/org.apache.nifi.metrics.reporting.task.MetricsReportingTask/

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-scripting-nar/1.11.4/org.apache.nifi.reporting.script.ScriptedReportingTask/index.html

Ingesting All The Weather Data With Apache NiFi

Ingesting All The Weather Data With Apache NiFi

Step By Step NiFi Flow

GenerateFlowFile - build a schedule matching when NOAA updates weather
InvokeHTTP - download all weather ZIP
CompressContent - decompress ZIP
UnpackContent - extract files from ZIP
*RouteOnAttribute - just give us ones that are airports (${filename:startsWith('K')}). optional.
*QueryRecord - XMLReader to JsonRecordSetWriter. Query: SELECT * FROM FLOWFILE WHERE NOT location LIKE '%Unknown%'. This is to remove some locations that are not identified. optional.
Send it somewhere for storage. Could put PutKudu, PutORC, PutHDFS, PutHiveStreaming, PutHbaseRecord, PutDatabaseRecord, PublishKafkaRecord2* or others.

URL For All US Data

invokehttp.request.url

https://w1.weather.gov/xml/current_obs/all_xml.zip

Example Record As Converted JSON

[ {

"credit" : "NOAA's National Weather Service",

"credit_URL" : "http://weather.gov/",

"image" : {

"url" : "http://weather.gov/images/xml_logo.gif",

"title" : "NOAA's National Weather Service",

"link" : "http://weather.gov"

"suggested_pickup" : "15 minutes after the hour",

"suggested_pickup_period" : 60,

"location" : "Stanley Municipal Airport, ND",

"station_id" : "K08D",

"latitude" : 48.3008,

"longitude" : -102.4064,

"observation_time" : "Last Updated on Jul 10 2020, 9:55 am CDT",

"observation_time_rfc822" : "Fri, 10 Jul 2020 09:55:00 -0500",

"weather" : "Fair",

"temperature_string" : "66.0 F (19.0 C)",

"temp_f" : 66.0,

"temp_c" : 19.0,

"relative_humidity" : 83,

"wind_string" : "South at 6.9 MPH (6 KT)",

"wind_dir" : "South",

"wind_degrees" : 180,

"wind_mph" : 6.9,

"wind_kt" : 6,

"pressure_in" : 30.03,

"dewpoint_string" : "60.8 F (16.0 C)",

"dewpoint_f" : 60.8,

"dewpoint_c" : 16.0,

"visibility_mi" : 10.0,

"icon_url_base" : "http://forecast.weather.gov/images/wtf/small/",

"two_day_history_url" : "http://www.weather.gov/data/obhistory/K08D.html",

"icon_url_name" : "skc.png",

"ob_url" : "http://www.weather.gov/data/METAR/K08D.1.txt",

"disclaimer_url" : "http://weather.gov/disclaimer.html",

"copyright_url" : "http://weather.gov/disclaimer.html",

"privacy_policy_url" : "http://weather.gov/notice.html"

} ]

Source Code

https://github.com/tspannhw/ClouderaFlowManagementWorkshop/tree/main/flows

Resources