FLaNK Stack Weekly for 5 February 2024

 

05-February-2024

FLaNK Stack Weekly

Tim Spann @PaaSDev

https://pebble.is/PaaSDev

https://vimeo.com/flankstack

https://www.youtube.com/@FLaNK-Stack

https://www.threads.net/@tspannhw

https://medium.com/@tspann/subscribe

Get your new Apache NiFi for Dummies!

https://www.cloudera.com/campaign/apache-nifi-for-dummies.html

https://ossinsight.io/analyze/tspannhw

Trial: https://console.us-west-1.cdp.cloudera.com/trial/register.html#/

Building Realtime AI Applications with Apache Flink

CODE + COMMUNITY

Please join my meetup group NJ/NYC/Philly/Virtual.

http://www.meetup.com/futureofdata-princeton/

https://www.meetup.com/futureofdata-newyork/

https://www.meetup.com/futureofdata-philadelphia/

image

**This is Issue #123 **

https://github.com/tspannhw/FLiPStackWeekly

https://www.cloudera.com/solutions/dim-developer.html

Qualified Developers

https://www.linkedin.com/in/satya-n99999/

Articles

NiFi 2.0.0-M2 is Out! https://medium.com/@tspann/apache-nifi-2-0-0-m2-out-314a1d4c8b20

Apache NiFi and Amazon Textract for Machine Learning https://medium.com/@tspann/apache-nifi-and-amazon-textract-for-machine-learning-e45f4af12e68

Apache Kafka: Streams Replication Manager Replication https://blog.cloudera.com/streams-replication-manager-prefixless-replication-part-1/

Doom on Bacteria https://www.rockpapershotgun.com/you-can-play-doom-using-gut-bacteria-but-the-framerate-is-atrocious

Enterprises using Open Source LLM https://venturebeat.com/ai/how-enterprises-are-using-open-source-llms-16-examples/

Flink Deep Dive https://www.waitingforcode.com/apache-flink/apache-flink-cluster-components-deep-dive/read

A Cheat Sheet for RAG https://blog.llamaindex.ai/a-cheat-sheet-and-some-recipes-for-building-advanced-rag-803a9d94c41b

Prompt Engineering Guides https://github.com/dair-ai/Prompt-Engineering-Guide

https://platform.openai.com/docs/guides/prompt-engineering/six-strategies-for-getting-better-results

Hikari Connection Pool https://medium.com/@guptadiksha88/hikari-cp-efficient-database-connection-pooling-d458c0bdf7df

LLM Prompting https://www.infoq.com/articles/large-language-models-llms-prompting/

Incremental Iceberg https://netflixtechblog.com/incremental-processing-using-netflix-maestro-and-apache-iceberg-b8ba072ddeeb

Gen AI Images https://rmoff.net/2023/12/07/productivity-tools-ai-image-generators/

Java Links https://graciano.dev/2023/08/03/weekend-reading-list-187/

IoT with MQTT & NiFi https://www.baeldung.com/iot-data-pipeline-mqtt-nifi

CDC with NiFi and Snowflake https://www.clearpeaks.com/change-data-capture-cdc-with-nifi-and-snowflake/

Host Apache NiFi with Docker https://medium.com/geekculture/host-a-fully-persisted-apache-nifi-service-with-docker-ffaa6a5f54a3

Videos

Seven Videos on Real-Time Streaming https://medium.com/@tspann/seven-videos-on-real-time-streaming-02711320afa8

Unlocking Financial Data with Real-Time Pipelines (OSACon 2023) https://www.youtube.com/watch?v=Q7gF7m4yFi4&ab_channel=OSACon

Processing Cisco ASA Logs with CFM https://medium.com/cloudera-inc/processing-cisco-asa-logs-with-cloudera-flow-management-f09cdf7382c3

Collecting NetFlow Records with Cloudera DataFlow https://medium.com/cloudera-inc/collecting-netflow-records-with-cloudera-dataflow-f47d9f57c98

Events

Feb 8, 2024: NYC. https://www.meetup.com/new-york-open-source-data-infrastructure-meetup/events/297484047/

18:00 - 18:30 Welcome: Networking & snacks 18:30 - 18:35 Kickoff: Welcome Aiven 18:35 - 19:00 A Guide to Product Experimentation (Erin Mikail Staples, LaunchDarkly) 19:00 - 19:30 Building Real-time Pipelines: A Case Study with Transit Data (Tim Spann, Cloudera) 19:30 ~ 21:00 Food & networking

Feb 2024: Webinar https://www.cloudera.com/about/events/webinars/stay-ahead-of-cyber-threats-by-utilizing-data-in-motion.html?utm_medium=virtual-event&utm_source=resources-module&keyplay=ALL&utm_campaign=FY25-Q1-CorporateWebinar-AMER-cyber-threats&cid=701Hr000001pXCQIA2

Feb 20, 2024: 12-1PM EST. Virtual. Azure Data Tech Groups: DBA Fundamentals Group https://www.meetup.com/dba-fundamentals-group/events/296855261/

Feb 28, 2024: NYC. Cloudera Meetup. Flink https://www.meetup.com/futureofdata-princeton/events/298661947/

Feb 29, 2024: Virtual. Conf42 Python. https://www.conf42.com/Python_2024_Tim_Spann_apache_nifi_2_processors

https://www.conf42.com/Python_2024_Karin_Wolok_nifi__kafka_risingwave_iceberg_llm

March 5, 2024: Princeton. Meetup. GenAI. https://www.meetup.com/applied-generative-artificial-intelligence-applications/

March 15, 2024: TCF Pro. Princeton, NJ. IT Professional Conference at Trenton Computer Festival IEEE Information Technology Professional Conference on Friday, March 15th, 2024 https://princetonacm.acm.org/tcfpro/

April 2024: XtremeJ 2024. Virtual. https://xtremej.dev/2023/schedule/

May 8-9, 2024: Data Summit 2024. Boston, MA. https://www.dbta.com/DataSummit/2024/default.aspx

Cloudera Events https://www.cloudera.com/about/events.html

More Events: https://www.linkedin.com/pulse/schedule-2024-tim-spann--y4coe

Code

Models

Data

Tools

© 2020-2024 Tim Spann

Apache NiFi 2.0.0-M2 Out!

 https://medium.com/@tspann/apache-nifi-2-0-0-m2-out-314a1d4c8b20

Apache NiFi 2.0.0-M2 Out!

New NiFi Features and Updates

More NiFi and faster.

So where’s the beef in this new upgrade?

New Features and Changes

New Schema Registry Options Added

  • DatabaseTableSchemaRegistryService
  • StandardJsonSchemaRegistry

New Components

  • StandardKustoIngestService
  • ZendeskRecordSink

New Processors

  • CalculateParquetOffsets
  • CalculateParquetRowGroupOffsets
  • FilterAttribute
  • PublishSlack
  • PutMongoBulk
  • PutAzureDataExplorer
  • PutZendeskTicket

NiFi Used Libraries / Upgraded

  • Spring Framework 6
  • Jetty 12
  • Jakarta Servlet API 6
  • Jakarta XML Binding 4
  • Swagger 2 annotations
  • OpenAPI 3.0 REST API specification

These removals may hurt

  • Removed MiNiFi C2 Server modules
  • Removed Docker image configuration
  • Relocated JoltTransformJSON and JoltTransformRecord from nifi-standard-nar to nifi-jolt-nar
  • Removed InfluxDB Processors (use processors from an older version if needed)
  • Removed Bootstrap Notification Services

ListenSlack (WebSockets API)

Now we can use this processor (and not need to leave a port open for Slack to call us), to get the current stream of Slack messages. This is great and fast. WebSockets are nice and not having to write this myself is nice.

{
"clientMsgId" : "3434ad0d-0afe-4563-8c21-91bda87cf41c",
"type" : "message",
"team" : "E2TE1MAG",
"channel" : "C1SD6N197",
"user" : "ULMRENSE4",
"botId" : null,
"botProfile" : null,
"text" : "Q: What is the weather at Newark?",
"blocks" : [ {
"type" : "rich_text",
"elements" : [ {
"type" : "rich_text_section",
"elements" : [ {
"type" : "text",
"text" : "Q: What is the weather at Newark?",
"style" : null
} ]
} ],
"blockId" : "z/mKt"
} ],
"attachments" : null,
"files" : null,
"ts" : "1706648003.547529",
"parentUserId" : null,
"threadTs" : null,
"eventTs" : "1706648003.547529",
"channelType" : "channel",
"edited" : null,
"subtype" : null
}

ConsumeSlack

For Consume Slack we are able to grab the history from Slack, which is great.

[{"type":"message","subtype":null,"team":"A7TE32HJKA",
"channel":"C1SD6N197","user":"ULMS1759T","username":null,
"text":"Q: When did Emirates Airlines start?",
"blocks":[{"type":"rich_text",
"elements":[{"type":"rich_text_section",
"elements":[{"type":"text","text":
"Q: When did Emirates Airlines start?","style":null}]}],
"blockId":"bTFip"}],"attachments":null, ...

PublishSlack

slack.channel.id
C05QAAVEC0H
slack.ts
1706642875.023669

This new one is nice as you just pass in a FlowFile, but warning. PutSlack is gone!!! No more incoming webhooks used. You need an Access Token from Slack.

DatabaseTableSchemaRegistry

Table Schema Used For Registry
Lots of options from AmazonGlue, API Curio, Built-in Avro Schema Reg, Confluent SR and the Database Table SR

FilterAttribute

NiFi 1.25

For a production branch, NiFi 1.x has been upgraded and has some goodies. It has the Slack, FilterAttributes and DatabaseSchemaRegistry. So this is where you should be running your main production flows (or preferrably in Cloudera DataFlow with full support).

See: