Showing posts with label cloudera. Show all posts
Showing posts with label cloudera. Show all posts

FLaNK Stack For 19 February 2024

 

19-February-2024

Monday Feb 19, 2024 is Presidents Day

FLaNK Stack Weekly

Tim Spann @PaaSDev

https://pebble.is/PaaSDev

https://vimeo.com/flankstack

https://www.youtube.com/@FLaNK-Stack

https://www.threads.net/@tspannhw

https://medium.com/@tspann/subscribe

Get your new Apache NiFi for Dummies!

https://www.cloudera.com/campaign/apache-nifi-for-dummies.html

https://ossinsight.io/analyze/tspannhw

Trial: https://console.us-west-1.cdp.cloudera.com/trial/register.html#/

Building Realtime AI Applications with Apache Flink

image

CODE + COMMUNITY

Please join my meetup group NJ/NYC/Philly/Virtual.

http://www.meetup.com/futureofdata-princeton/

https://www.meetup.com/futureofdata-newyork/

https://www.meetup.com/futureofdata-philadelphia/

image

**This is Issue #125 **

https://github.com/tspannhw/FLiPStackWeekly

https://www.cloudera.com/solutions/dim-developer.html

Articles

NYC Traffic?? (NiFi, Kafka, Flink) https://medium.com/@tspann/nyc-traffic-are-you-kidding-me-6d3fa853903b

Subways and Transit Updates in Real-Time https://medium.com/@tspann/subways-and-transit-updates-in-real-time-30c104c359ef

Open Source Data Infrastructure Meetup - Feb 2024 https://medium.com/@tspann/open-source-data-infrastructure-meetup-feb-2024-9e8048666828

Catalogs in Flink SQL: A Primer https://www.decodable.co/blog/catalogs-in-flink-sql-a-primer

https://www.wired.com/story/goody-2-worlds-most-responsible-ai-chatbot/

https://lilianweng.github.io/posts/2024-02-05-human-data-quality

https://eugeneyan.com/writing/synthetic/

https://www.alexmolas.com/2024/02/05/a-search-engine-in-80-lines.html?

https://vectorize.io/2024/01/25/openai-text-embedding-3-embedding-models-first-look/

https://txt.cohere.com/aya/

https://www.infoq.com/presentations/virtual-threads-lightweight-concurrency/

https://medium.com/@james.li/how-to-visualise-real-time-order-book-data-and-host-your-own-dashboard-part-1-2-c77aa0fc5f59

https://docs.coinapi.io/how-to-guides/real-time-data-visualization-with-javascript

https://evidentinsights.com/ai-index/

https://fmirkes.github.io/articles/20190827.html

https://blog.dagworks.io/p/using-ipython-jupyter-magic-commands

https://medium.com/practice-in-public/these-words-make-it-obvious-that-your-text-is-written-by-ai-9b04f399d88c

https://technology.amis.nl/big-data-database/apache-nifi-forwarding-http-headers/

https://medium.com/@masreis/text-extraction-and-ocr-with-apache-tika-302464895e5f

Videos

Unlocking Financial Data with Real-Time Pipelines (OSACon 2023) https://www.youtube.com/watch?v=Q7gF7m4yFi4&ab_channel=OSACon

The Never Landing Stream https://www.youtube.com/watch?v=M8Bp0tRGvV0

Tips

https://community.cloudera.com/t5/Support-Questions/Apache-NiFi-to-split-incoming-data-from-a-file-based-on/m-p/220283

February 8, 2024 Meetup

https://www.slideshare.net/slideshows/ny-open-source-data-meetup-feb-8-2024-building-realtime-pipelines-with-flank-a-case-study-with-transit-data/266227433

Events

Feb 2024: Webinar https://www.cloudera.com/about/events/webinars/stay-ahead-of-cyber-threats-by-utilizing-data-in-motion.html?utm_medium=virtual-event&utm_source=resources-module&keyplay=ALL&utm_campaign=FY25-Q1-CorporateWebinar-AMER-cyber-threats&cid=701Hr000001pXCQIA2

Feb 20, 2024: 12-1PM EST. Virtual. Azure Data Tech Groups: DBA Fundamentals Group https://www.meetup.com/dba-fundamentals-group/events/296855261/

Feb 22, 2024: NYC. AI Camp Meetup. https://www.aicamp.ai/event/eventdetails/W2024022214

Feb 28, 2024: NYC. Cloudera Meetup. Flink https://www.meetup.com/futureofdata-princeton/events/298661947/

Feb 29, 2024: Virtual. Conf42 Python. https://www.conf42.com/Python_2024_Tim_Spann_apache_nifi_2_processors

https://www.conf42.com/Python_2024_Karin_Wolok_nifi__kafka_risingwave_iceberg_llm

Soon, 2024: Princeton. TigerLabs New Location. Meetup. GenAI. https://www.meetup.com/applied-generative-artificial-intelligence-applications/

March 15, 2024: TCF Pro. Princeton, NJ. IT Professional Conference at Trenton Computer Festival IEEE Information Technology Professional Conference on Friday, March 15th, 2024 https://princetonacm.acm.org/tcfpro/

April 2024: XtremeJ 2024. Virtual. https://xtremej.dev/2023/schedule/

April 11, 2024: Conf42 LLM. Virtual. https://www.conf42.com/llms2024

May 8-9, 2024: Data Summit 2024. Boston, MA. https://www.dbta.com/DataSummit/2024/default.aspx

Cloudera Events https://www.cloudera.com/about/events.html

More Events: https://www.linkedin.com/pulse/schedule-2024-tim-spann--y4coe

Code

Models

Tools

© 2020-2024 Tim Spann

Using Cloudera Data Platform with Flow Management and Streams on Azure


Using Cloudera Data Platform with Flow Management and Streams on Azure

Today I am going to be walking you through using Cloudera Data Platform (CDP) with Flow Management and Streams on Azure Cloud.  To see a streaming demo video, please join my webinar (or see it on demand) at Streaming Data Pipelines with CDF in Azure.  I'll share some additional how-to videos on using Apache NiFi and Apache Kafka in Azure very soon.   



Apache NiFi on Azure CDP Data Hub
Sensors to ADLS/HDFS and Kafka




In the above process group we are using QueryRecord to segment JSON records and only pick ones where the Temperature in Fahrenheit is over 80 degrees then we pick out a few attributes to display from the record and send them to a slack channel.

To become a Kafka Producer you set a Record Reader for the type coming in, this is JSON in my case and then set a Record Writer for the type to send to the sensors topic.    In this case we kept it as JSON, but we could convert to AVRO.   I usually do that if I am going to be reading it with Cloudera Kafka Connect.



Our security is automagic and requires little for you to do in NiFi.   I put in my username and password from CDP.   The SSL context is setup for my when I create my datahub.


When I am writing to our Real-Time Data Mart (Apache Kudu), I enter my Kudu servers that I copied from the Kudu Data Mart Hardware page, put in my table name and your login info.   I recommend UPSERT and use your Record Reader JSON.


For real use cases, you will need to spin up:

Public Cloud Data Hubs:
  • Streams Messaging Heavy Duty for AWS
  • Streams Messaging Heavy Duty for Azure
  • Flow Management Heavy Duty for AWS
  • Flow Management Heavy Duty for Azure
Software:
  • Apache Kafka 2.4.1
  • Cloudera Schema Registry 0.8.1
  • Cloudera Streams Messaging Manager 2.1.0
  • Apache NiFi 1.11.4
  • Apache NiFi Registry 0.5.0
Demo Source Code:


Let's configure out Data Hubs in CDP in an Azure Environment.   It is a few clicks and some naming and then it builds.












Under the Azure Portal


In Azure, we can examine the files we uploaded to the Azure object store.





Under the Data Lake SDX


NiFi and Kafka are autoconfigured to work with Apache Atlas under our environments Data Lake SDX.  We can browse through the lineage for all the Kafka topics we use.






We can also see the flow for NiFi, HDFS and Kudu.

SMM

We can examine all of our Kafka infrastructure from Kafka Brokers, Topics, Consumers, Producers, Latency and Messages.  We can also create and update topics.




Cloudera Manager

We still have access to all of our traditional items like Cloudera Manager to manage configuration of servers.



Under Real-Time Data Mart

We can view tables, create tables and query our table.   Apache Hue is a great tool for accessing data in my Real-Time Data Mart in a datahub.



We can also look at table details in the Impala UI.


References
©2020 Timothy Spann