FLaNK AI - 01 April 2024

 

01-April-2024



image

FLaNK / KNIFe AI Weekly

https://knifeai.blogspot.com/

Tim Spann @PaaSDev

https://pebble.is/PaaSDev

https://vimeo.com/flankstack

https://www.youtube.com/@FLaNK-Stack

https://www.threads.net/@tspannhw

https://medium.com/@tspann/subscribe

https://www.cloudera.com/campaign/apache-nifi-for-dummies.html

https://ossinsight.io/analyze/tspannhw

image

COOL CHARITY by KIDS!

https://www.unveilx.org/

CODE + COMMUNITY

Please join my meetup group NJ/NYC/Philly/Virtual.

http://www.meetup.com/futureofdata-princeton/

https://www.meetup.com/futureofdata-newyork/

https://www.meetup.com/futureofdata-philadelphia/

image

**This is Issue #131 **

https://github.com/tspannhw/FLiPStackWeekly

https://www.cloudera.com/solutions/dim-developer.html

New Releases

Apache Hive 4.0.0 https://hub.docker.com/r/apache/hive

Articles

Meetup Report https://medium.com/@tspann/march-2024-meetup-report-61e82b00cf57

Real-Time Irish Transit Analytics https://medium.com/@tspann/real-time-irish-transit-analytics-ea76164c9595

Adding Generative AI Results to SQL Streams https://medium.com/@tspann/adding-generative-ai-results-to-sql-streams-513e1fd2a6af

Image Processing with Custom Python and Apache NiFi 2.0 https://medium.com/@tspann/image-processing-with-custom-python-and-nifi-2-0-06eadc62c03c

Cloudera + GenAI + NVIDIA NIM Microservices https://menews247.com/cloudera-to-enhance-genai-with-nvidia-nim-microservices/

https://blog.cloudera.com/data-architecture-and-strategy-in-the-ai-era/

https://blog.cloudera.com/clouderas-rhel-volution-powering-the-cloud-with-red-hat/

https://developer.nvidia.com/blog/translate-your-enterprise-data-into-actionable-insights-with-nvidia-nemo-retriever/

https://drive.google.com/file/d/11lCJAB272ruBa7AAVwYxaN2E2xooWizG/view

https://jack-vanlightly.com/blog/2024/3/26/the-sisyphean-struggle-and-the-new-era-of-data-infrastructure

https://pypi.org/project/streaming-jupyter-integrations/

https://thenewstack.io/how-nvidia-gpu-acceleration-supercharged-milvus-vector-database/

NiFi 2.0 Python https://medium.com/@sudeep.singh99/a-beginners-guide-to-nifi-2-0-custom-python-processor-ac6d8c7bda7b

Make sure you are on the write MacOS version for new Java https://blogs.oracle.com/java/post/java-on-macos-14-4

https://www.datanami.com/2024/03/22/zilliz-unveils-game-changing-features-for-vector-search

https://towardsdatascience.com/automated-detection-of-data-quality-issues-54a3cb283a91

https://mlops.community/7-methods-to-secure-llm-apps-from-prompt-injections-and-jailbreaks/?

https://www.startdataengineering.com/post/change-data-capture-using-debezium-kafka-and-pg/

https://medium.com/@hubert.dulay/stream-processing-vs-real-time-olap-vs-streaming-database-339c75ca6772

https://www.cloudera.com/about/news-and-blogs/press-releases/2024-03-28-global-survey-reveals-90-of-it-leaders-believe-that-unifying-the-data-lifecycle-on-a-single-platform-is-critical-for-analytics-and-ai.html

https://nvidianews.nvidia.com/news/nvidia-blackwell-platform-arrives-to-power-a-new-era-of-computing

https://netflixtechblog.com/bending-pause-times-to-your-will-with-generational-zgc-256629c9386b

https://www.uber.com/en-GB/blog/balancing-hdfs-datanodes-in-the-uber-datalake/

https://techcrunch.com/2024/03/31/why-aws-google-and-oracle-are-backing-the-valkey-redis-fork/

Videos

Meetup Talk NYC https://youtu.be/u8XNNEPEnKQ?si=VWe6n8OKOF7qk6Fl

Irish Rail Preview https://youtu.be/EIpH7RPO2Yo

TCF Pro 2024 https://www.youtube.com/watch?v=tLbdrOxg5Rs

Streaming Traffic Cameras https://www.youtube.com/watch?v=85ECRGJBEQU&ab_channel=DatainMotion-HowToBeaStreamingEngineer

NiFi 101 https://www.youtube.com/watch?v=8cZJ9CyLYyI&t=3114s

March 11, 2024 Princeton 23 Orchard Event

https://www.slideshare.net/slideshows/2024-build-generative-ai-for-nonprofits/266748822

march 15, 2024 Trenton TCF

https://www.slideshare.net/slideshows/tcfpro24-building-realtime-generative-ai-pipelines/266807785

Events

April 2, 2024: XtremeJ 2024. Virtual. https://xtremej.dev/2023/schedule/

April 8-11, 2024: NLIT Summit. Seattle. https://www.fbcinc.com/e/nlit/default.aspx image

April 11, 2024: Conf42 LLM. Virtual. https://www.conf42.com/llms2024

April 12, 2024: AI Max Conference. 23 Orchard Princeton https://www.startupgrind.com/events/details/startup-grind-princeton-presents-startup-grind-hosts-ai-max-summit/

April 2024: AI Meetup NJ https://www.meetup.com/nj-gai/

EMEA | APAC: April 24, 2024 9:30 AM CEST | 1:00 PM IST AMER EVENT: Apr 25, 2024 9:00 AM PDT | 12:00 PM EDT Register Now: http://spr.ly/6047Z3AjN

May 8-9, 2024: Data Summit 2024. Boston, MA. https://www.dbta.com/DataSummit/2024/default.aspx https://www.dbta.com/DataSummit/2024/Timothy-Spann.aspx

May 21, 2024: Gen AI and Beyond with NiFi 2.0. Virtual.

June 12, 2024: Budapest Data + ML Forum. Virtual. image https://budapestdata.hu/2024/en/

Cloudera Events https://www.cloudera.com/about/events.html

https://www.cloudera.com/events/cloudera-now-cdp.html?internal_keyplay=ALL&internal_campaign=FY25-Q1-AMER-WS-Cloudera-Now-Events-Page-P06&cid=701Hr000000tW6qIAE&internal_link=p06

More Events: https://www.linkedin.com/pulse/schedule-2024-tim-spann--y4coe

Code

Models

Tools

New

Vector Db built on clickhouse https://github.com/myscale/myscaledb

Cool Tool - LLM Synthetic Data Generators

https://github.com/geraldyong/OpenAI_Synthetic/tree/main

https://github.com/quentinlintz/synthetic-data-generator

https://medium.com/@n-demia/how-to-prepare-test-data-via-openai-api-in-postman-7e378dde1f53

https://github.com/datadreamer-dev/DataDreamer

https://huggingface.co/collections/rbiswasfc/synthetic-data-generation-65ee68e821ddaff47073ed02

Flink Connectors (scroll down)

https://flink.apache.org/downloads/

Avro

Can't handle numbers bigger than 19 decimals

Throwback Article

https://community.cloudera.com/t5/Community-Articles/Ingesting-RDBMS-Data-As-New-Tables-Arrive-Automagically-into/ta-p/246214

https://docs.cloudera.com/csp-ce/latest/ce-overview/topics/csp-ce-overview.html

Discount

Discount access to DataSummit 2024 https://secure.infotoday.com/RegForms/DataSummit/?Priority=24SPKR

© 2020-2024 Tim Spann

March 28, 2024 Hybrid Meetup

Real-Time Irish Transit Analytics

 


Real-Time Irish Transit Analytics

Apache NiFi, Postgresql, GenAI, Apache Kafka, Apache Flink, JSON, GTFS

Let’s hop on a bus in Ireland!

We need to load static (rarely changing lookup data). We can do this with NiFi very easily. We build and insert these into new Postgresql tables.

See me here:

ChatGPT Authored Introduction:

Unlocking the Future of Transportation: Real-Time Irish Transit Analytics

In the bustling landscape of modern transportation, the ability to harness real-time data is not just a competitive advantage; it’s a necessity. In Ireland, where efficient transit systems are the lifeblood of daily commutes and city connectivity, the fusion of cutting-edge technologies is revolutionizing how we understand and optimize public transportation. This article delves into the world of Real-Time Irish Transit Analytics, where Apache NiFi, PostgreSQL, GenAI, Apache Kafka, Apache Flink, JSON, and GTFS converge to create a dynamic and responsive ecosystem.

Every day, thousands of passengers rely on Ireland’s public transit systems to navigate cities, reach work, or simply explore the beauty of the countryside. Yet, behind the scenes of this seemingly seamless operation lies a complex network of data streams, from vehicle locations to passenger counts, schedules to service updates. Here, Apache NiFi emerges as a pivotal tool, seamlessly orchestrating the flow of data from various sources into a unified pipeline.

PostgreSQL steps in as the reliable database backbone, providing a robust foundation for storing and querying vast amounts of transit data. With the power of GenAI, machine learning algorithms sift through this data trove, uncovering valuable insights into passenger behaviors, traffic patterns, and optimal routes.

But data is only as valuable as its timeliness, and this is where Apache Kafka and Apache Flink shine. Kafka acts as the real-time messaging hub, ensuring that updates from buses, trains, and stations are instantly propagated through the system. Flink’s stream processing capabilities then come into play, analyzing incoming data on the fly to generate actionable intelligence.

In the realm of data interchange, JSON (JavaScript Object Notation) emerges as the lingua franca, facilitating seamless communication between different components of the analytics ecosystem. And anchoring it all is the General Transit Feed Specification (GTFS), a standardized format for public transit schedules and geographic information, ensuring interoperability and accuracy across the board.

Join us on a journey through the intricacies of Real-Time Irish Transit Analytics, where these technologies converge to enhance efficiency, improve passenger experiences, and pave the way for the future of smart transportation.

An important source of data is the static GTFS lookup tables provided a zip file of CSV. We can download and parse this automagically in NiFi. No need to know and precreate tables. NiFi will determine the fields for you.

https://www.transportforireland.ie/transitData/Data/GTFS_Realtime.zip

GTFS Static Data Load

Skip shapes.txt as we aren’t loading those

Set a Default Primary Key

Setting All the Correct Primary Keys for all the Static Files/Tables

Split Up Tables into 1,000 Row Chunks to Make it Easier for Postgresql

We converted CSV to JSON and split up in 1 step
Loaded Results

Update the SQL Automagically

we do not manually set field names, no SQL injection here

Send this SQL to the Database

A list of Ireland Lookup Trips loaded from trips.txt

Let’s parse the real time transit information for Ireland.

GTFS Real-Time

Vehicle Positions is the primary API to get where the buses are.

API REST TEST

GET https://api.nationaltransport.ie/gtfsr/v2/gtfsr?format=json HTTP/1.1

Cache-Control: no-cache

x-api-key: dddddd

As opposed to most transit systems we have seen in GTFS and GTFS-R feeds they don’t have three types, just the two. They are missing alerts.

[ Trip Updates, Vehicle Positions]

The GTFS-R API contains real-time updates for services provided by Dublin Bus, Bus Éireann, and Go-Ahead Ireland.

You have to sign up and subscribe to the API to use this.

x-api-key is the header for our private key

Example Vehicle Position as JSON

[ {
"recordid" : "V56",
"route_id" : "3924_62692",
"directionid" : "0",
"latitude" : "53.3537788",
"tripid" : "3924_16321",
"starttime" : "22:50:00",
"vehicleid" : "274",
"startdate" : "20240322",
"uuid" : "8a50c084-0aea-496e-b4c3-dbed373e812e",
"longitude" : "-6.40118694",
"timestamp" : "1711150967",
"ts" : "1711167213555"
} ]

Vehicle Position Slack Message

Irish Transit Tracking
Direction ${directionid}
Request ${invokehttp.request.url} ${invokehttp.status.message} ${invokehttp.tx.id}
Lat/Long ${latitude}/${longitude}
Vehicle ${vehicleid}
Route ${route_id}
Scheduled? ${scheduled}
Start Date/Time/TS ${startdate} / ${starttime} / ${timestamp}
IDs ${uuid} ${recordid} TripID ${tripid}
Scheduled: ${scheduled}

Trip Updates

Example Trip Update as JSON

{
"triptimestamp" : "1711415067",
"stopsequence" : "10",
"schedulerelationship" : "SCHEDULED",
"tripstarttime" : "21:30:00",
"stopid" : "8530B1520901",
"departuredelay" : "-104",
"tripid" : "3950_45558",
"tripschedulerelationship" : "SCHEDULED",
"tripstartdate" : "20240325",
"uuid" : "46595e37-4fdd-48db-8431-216bcabe4dd7",
"departuretime" : "",
"tripdirectionid" : "0",
"arrivaltime" : "",
"arrivaldelay" : "-104",
"triprouteid" : "3950_62756",
"ts" : "1711476673867",
"route_long_name" : "Dublin - Airport - Cavan - Donegal",
"stop_name" : "Topaz Belleek"
}

Trip Update Slack Message

Irish Transit Tracking Trip Updates
Request ${invokehttp.request.url} ${invokehttp.status.message} ${invokehttp.tx.id}
IDs ${uuid}
Arrival Delay / Time: ${arrivaldelay} / ${arrivaltime}
Departure Delay / Time: ${departuredelay} / ${departuretime}
Schedule: ${schedulerelationship} ${tripschedulerelationship}
Stop ID/Sequence: ${stopid} / ${stopsequence}
Trip Direction: ${tripdirectionid} ${tripid}
Trip Route: ${triprouteid}
Trip Start Date / Time / TS: ${tripstartdate} / ${tripstarttime} / ${triptimestamp}

Create Table in Flink

Query Kafka Topic — Flink SQL Table in SSB

Send Messages

Lookups From Postgresql Table

Finally Send Messages to Slack

NATIONAL ROADS WEATHER STATION

PUBLIC TRANSPORT DATA

LOOKUP DATA FROM GTFS

stop_idUnique ID

Primary key (trip_id, stop_sequence)

DUBLIN BIKES

RAILROAD

IRISH STATIONS

{"StationDesc":"Millstreet","StationAlias":null,
"StationLatitude":52.0776,"StationLongitude":-9.06973,
"StationCode":"MLSRT","StationId":24,"ts":"1711496919762",
"uuid":"f6e71a76-41cc-4a8e-8795-323c3b43d62f"}

IRISH TRAIN RECORD

{
"TrainStatus":"R","TrainLatitude":53.4169,
"TrainLongitude":-6.1512,"TrainCode":"P617",
"TrainDate":"27 Mar 2024",
"PublicMessage":"P617\\n16:02 - Drogheda to Dublin Pearse (1 mins late)\\nDeparted Portmarnock next stop Dublin Connolly",
"Direction":"Southbound","ts":"1711557932947",
"uuid":"b485cefb-67e8-482d-86ba-1ca43e0b523a"
}

IRISH STATION RECORD

{
"StationDesc":"Midleton",
"StationAlias":null,
"StationLatitude":51.9212,
"StationLongitude":-8.17579,
"StationCode":"MDLTN",
"StationId":68,
"ts":"1711558009615",
"uuid":"1f5ae394-4726-4f3e-8e53-7f50f95ae05e"
}

SOURCE CODE

FLINK SQL KAFKA TABLE

CREATE TABLE `ssb`.`Meetups`.`irelandvehicle` (
`recordid` VARCHAR(2147483647),
`route_id` VARCHAR(2147483647),
`directionid` VARCHAR(2147483647),
`latitude` VARCHAR(2147483647),
`tripid` VARCHAR(2147483647),
`starttime` VARCHAR(2147483647),
`vehicleid` VARCHAR(2147483647),
`startdate` VARCHAR(2147483647),
`uuid` VARCHAR(2147483647),
`longitude` VARCHAR(2147483647),
`timestamp` VARCHAR(2147483647),
`ts` VARCHAR(2147483647),
`route_long_name` VARCHAR(2147483647),
`trip_short_name` VARCHAR(2147483647),
`trip_headsign` VARCHAR(2147483647),
`eventTimeStamp` TIMESTAMP(3) WITH LOCAL TIME ZONE METADATA FROM 'timestamp',
WATERMARK FOR `eventTimeStamp` AS `eventTimeStamp` - INTERVAL '3' SECOND
) WITH (
'scan.startup.mode' = 'group-offsets',
'deserialization.failure.policy' = 'ignore_and_log',
'properties.request.timeout.ms' = '120000',
'properties.auto.offset.reset' = 'earliest',
'format' = 'json',
'properties.bootstrap.servers' = 'kafka:9092',
'connector' = 'kafka',
'properties.transaction.timeout.ms' = '900000',
'topic' = 'irelandvehicle',
'properties.group.id' = 'irelandconsumersbb1'
)

RESOURCES