- https://github.com/linkedin/datahub
- https://github.com/amundsen-io/amundsen
- https://flyte.org/
- https://eng.uber.com/fiberdistributed/
- https://petastorm.readthedocs.io/en/latest/readme_include.html
- https://github.com/Netflix/metacat
- https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cm_dg_chart_time_series_data.html#cmug_topic_11
- https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cm_metrics_kafka_broker.html
- https://developers.redhat.com/blog/2020/08/24/java-development-on-top-of-kubernetes-using-eclipse-jkube/
- https://github.com/confluentinc/demo-scene/blob/master/community-components-only/docker-compose.yml#L22-L51
- https://javadoc.io/
Data and Streaming News September 2020
Apache NiFi 1.12 Released! 18-August-2020
Apache NiFi 1.12 Released! 18-August-2020
Release Notes
https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-Version1.12.0
Issues
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12316020&version=12346778
Major Feature List
https://twitter.com/pvillard31/status/1296469452180119553
Release Date: August 18, 2020.
Major Features:
- New processor to write scripted record transforms live in the flow (ScriptedTransformRecord)
- Expose a REST Endpoint for easy metric scraping by Prometheus
- Ability to specify group level flow file concurrency - for instance run a single flow file end to end for traditional job handling
- Improved several capabilities related to Azure service interaction including ADLS Gen2
- Improved AMQP and MQTT support as well as JMS improvements
- Support for latest Kafka 2.6 clients
- Search UI Improvements
I will be posting a few demos and test drives soon.
ScriptedTransformRecord
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-scripting-nar/1.12.0/org.apache.nifi.processors.script.ScriptedTransformRecord/index.html
Deleting Schemas From Cloudera Schema Registry
Deleting Schemas From Cloudera Schema Registry
Did the user really ask for Exactly Once? Fault Tolerance
Exactly Once Requirements
- Apache Kafka - must have Exactly-Once selected, transactions enabled and correct driver.
- HDFS BucketingSink
- Apache Kafka
Reference
FLaNK in the Cloud!!!! Huge Cloudera Data Platform Public Cloud Updates - July 2020 - Data Flow Releases
FLaNK in the Cloud!!!!
Huge Cloudera Data Platform Public Cloud Updates
July 2020 - Data Flow Releases
- Data source reading from Kafka
- Data sinks writing to Kafka, HBase and Kudu
- Apache Atlas integration
- SQL/Table API and SQL Client
- Table connectors
- Kafka
- Kudu
- Hive (through catalog)
Phoenix / HBase / NiFi Resources
Tutorial
Excellent tutorial, step-by-step
Videos
- Collections-Data Hub library of videos
- Collections-Data Flow / Streaming library of videos
Blogs
- Building a Scalable Process Using NiFi, Kafka and HBase on CDP
- Overview of the Operational Database performance in CDP
Meetup
Other
- Have a question? Join Cloudera Community
- Cloudera Data Hub documentation
- Operational Database product information
Sizing Your Apache NiFi Cluster For Production Workloads
Sizing Your Apache NiFi Cluster For Production Workloads
Report on This: Apache NiFi 1.11.4 - Monitor All The Things
The easiest way to grab monitoring data is via the NiFi REST API. Also everything in the NiFi UI is done through REST calls which you can call programmatically. Please read the NiFi docs they are linked directly from your running NiFi application or on the web. They are very thorough and have all the information you could want: https://nifi.apache.org/docs/nifi-docs/. If you are not running NiFi 1.11.4, I recommend you please upgrade. This is supported by Cloudera on multiple platforms.
NiFi Rest API
https://nifi.apache.org/docs/nifi-docs/rest-api/
There's also an awesome Python wrapper for that REST API: https://pypi.org/project/nipyapi/
Also in NiFi flow programming, every time you produce data to Kafka you get metadata back in FlowFile Attributes. You can push those attributes directly to a kafka topic if you want.
So after your PublishKafkaRecord_2_0 1.11.4 so for success read the attributes on # of record and other data then AttributesToJson and push to another topic. you may want a mergerecord in there to aggregate a few of those together.
If you are interested in Kafka metrics/record counts/monitoring then you must use Cloudera Streams Messaging Manager, it provides a full Web UI, Monitoring Tool, Alerts, REST API and everything you need for monitoring every producer, consumer, broker, cluster, topic, message, offset and Kafka component.
The best way to get NiFi stats is to use the NiFi Reporting Tasks, I like the SQL Reporting task.
SQL Reporting Tasks are very powerful and use standard SELECT * FROM JVM_METRICS style reporting, see my article:
https://www.datainmotion.dev/2020/04/sql-reporting-task-for-cloudera-flow.html
Monitoring Articles
https://www.datainmotion.dev/2019/04/monitoring-number-of-of-flow-files.html
https://www.datainmotion.dev/2019/03/apache-nifi-operations-and-monitoring.html
Other Resources
https://www.datainmotion.dev/2019/10/migrating-apache-flume-flows-to-apache_9.html
https://www.datainmotion.dev/2019/08/using-cloudera-streams-messaging.html
https://dev.to/tspannhw/apache-nifi-and-nifi-registry-administration-3c92
https://dev.to/tspannhw/using-nifi-cli-to-restore-nifi-flows-from-backups-18p9
https://nifi.apache.org/docs/nifi-docs/html/toolkit-guide.html
https://www.datainmotion.dev/p/links.html
https://www.tutorialspoint.com/apache_nifi/apache_nifi_monitoring.htm
Ingesting All The Weather Data With Apache NiFi
- GenerateFlowFile - build a schedule matching when NOAA updates weather
- InvokeHTTP - download all weather ZIP
- CompressContent - decompress ZIP
- UnpackContent - extract files from ZIP
- *RouteOnAttribute - just give us ones that are airports (${filename:startsWith('K')}). optional.
- *QueryRecord - XMLReader to JsonRecordSetWriter. Query: SELECT * FROM FLOWFILE WHERE NOT location LIKE '%Unknown%'. This is to remove some locations that are not identified. optional.
- Send it somewhere for storage. Could put PutKudu, PutORC, PutHDFS, PutHiveStreaming, PutHbaseRecord, PutDatabaseRecord, PublishKafkaRecord2* or others.
- https://www.datainmotion.dev/2020/05/cloudera-flow-management-101-lets-build.html
- https://www.datainmotion.dev/2019/03/advanced-xml-processing-with-apache.html
- https://www.datainmotion.dev/2020/01/analyzing-wood-burning-stoves-with_23.html
- https://www.datainmotion.dev/2020/01/cloudera-edge2ai-minifi-java-agent-with.html
- https://community.cloudera.com/t5/Community-Articles/Tracking-Air-Quality-with-HDP-and-HDF-Part-1-Apache-NiFi/ta-p/248265
- https://community.cloudera.com/t5/Community-Articles/Part-2-IoT-Augmenting-GPS-Data-with-Weather/ta-p/245685
Using Cloudera Data Platform with Flow Management and Streams on Azure
Using Cloudera Data Platform with Flow Management and Streams on Azure
Apache NiFi on Azure CDP Data Hub |
- Streams Messaging Heavy Duty for AWS
- Streams Messaging Heavy Duty for Azure
- Flow Management Heavy Duty for AWS
- Flow Management Heavy Duty for Azure
- Apache Kafka 2.4.1
- Cloudera Schema Registry 0.8.1
- Cloudera Streams Messaging Manager 2.1.0
- Apache NiFi 1.11.4
- Apache NiFi Registry 0.5.0
NiFi and Kafka are autoconfigured to work with Apache Atlas under our environments Data Lake SDX. We can browse through the lineage for all the Kafka topics we use.
- https://www.cloudera.com/about/enterprise-data-cloud.html
https://docs.cloudera.com/cdf-datahub/7.2.0/release-notes/topics/cdf-datahub-whats-new.html
https://dzone.com/articles/lets-build-a-simple-ingest-to-cloud-data-warehouse
https://www.datainmotion.dev/2020/06/no-more-spaghetti-flows.html
The Rise of the Mega Edge (FLaNK)
Explore Enterprise Apache Flink with Cloudera Streaming Analytics - CSA 1.2
- Kafka
- Kudu
- Hive (through catalog)
- JSON
- Avro
- CSV
mvn archetype:generate \
-DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-java \
-DarchetypeVersion=1.10.0
Using Apache Kafka Using Cloudera Data Platform Data Center 7.1.1
Unboxing the Most Amazing Edge AI Device Part 1 of 3 - NVIDIA Jetson Xavier NX
- https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-xavier-nx/
- https://elinux.org/Jetson_Zoo
- https://ngc.nvidia.com/catalog/containers/nvidia:l4t-ml
- https://www.nvidia.com/en-us/deep-learning-ai/education/?ncid=so-dis-dldlwsd1-72342
- https://developer.nvidia.com/embedded/jetpack
- https://elinux.org/Jetson_Nano#Cameras
- https://developer.nvidia.com/embedded/community/jetson-projects
- https://github.com/neuralet/neuralet/tree/master/applications/smart-distancing
- https://developer.nvidia.com/embedded/downloads
- https://www.jetsonhacks.com/
- https://docs.nvidia.com/jetson/jetpack/introduction/
- https://devblogs.nvidia.com/bringing-cloud-native-agility-to-edge-ai-with-jetson-xavier-nx/
- https://github.com/tspannhw/minifi-jetson-nano
- https://community.cloudera.com/t5/Community-Articles/Edge-Data-Processing-with-Jetson-Nano-Part-3-AI-Integration/ta-p/93642
- https://www.slideshare.net/bunkertor/iot-edge-data-processing-with-nvidia-jetson-nano-oct-3-2019
- https://dzone.com/articles/edge-data-processing-with-jetson-nano
- https://github.com/dusty-nv/jetson-inference/blob/master/python/examples/segnet-camera.py
- https://github.com/tspannhw/nvidiajetsontx1-mxnet/blob/master/classify.py
- https://github.com/tspannhw/ApacheDeepLearning202
- https://github.com/tspannhw/OpenSourceComputerVision sudo /usr/sbin/nvpmodel -q