Upcoming Apache Pulsar and Apache Flink Talks - ApacheCon Asia and ApacheCon 2021

ApacheCon Asia 2021


#messaging

#streaming

StreamNative - David Kjerrumgaard's Talk

ENGLISH SESSION 2021-08-08 15:30 GMT+8

In this talk I will present a technique for deploying machine learning models to provide real-time predictions using Apache Pulsar Functions. In order to provide a prediction in real-time, the model usually receives a single data point from the caller, and is expected to provide an accurate prediction within a few milliseconds.

Throughout this talk, I will demonstrate the steps required to productionize a fully-trained ML that predicts the delivery time for a food delivery service based upon real-time traffic information, the customer;s location, and the restaurant that will be fulfilling the order.

Speaker:

David Kjerrumgaard: David is the author of “Pulsar in Action”


StreamNative - Tim Spann's Talk

ENGLISH SESSION 2021-08-08 14:50 GMT+8

oday, data is being generated from devices and containers living at the edge of networks, clouds and data centers. We need to run business logic, analytics and deep learning at the edge before we start our real-time streaming flows. Fortunately using the all FLiP & FLaNK stacks we can do this with ease! Streaming AI Powered Analytics From the Edge to the Data Center is now a simple use case. With MiNiFi we can ingest the data, do data checks, cleansing, run machine learning and deep learning models and route our data in real-time to Apache NiFi and Apache Pulsar for further transformations and processing. Apache Flink will provide our advanced streaming capabilities fed real-time via Apache Pulsar topics. Apache MXNet models will run both at the edge and in our data centers via Apache NiFi and MiNiFi. 

Tools: Apache Flink, Apache Pulsar, Apache NiFi, MiNiFi, Apache MXNet

Speaker:

Timothy Spann: Tim Spann is a Developer Advocate at StreamNative where he works with Apache NiFi, MiniFi, Kafka, Apache Flink, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a senior solutions architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.

https://www.datainmotion.dev/p/about-me.html 

https://dzone.com/users/297029/bunkertor.html https://dev.to/tspannhw




ApacheCon Global 2021






StreamNative Talks





Tuesday 17:10 UTC - Apache NIFi Deep Dive 300 - Tim Spann
Tuesday 18:00 UTC - Apache Deep Learning 302 - Tim Spann
Wednesday 15:00 UTC - Smart Transit: Real-Time Transit Information with FLaNK- Tim Spann 
Wednesday 17:10 UTC - Cracking the Nut, Solving Edge AI with Apache Tools and Frameworks - Tim Spann
Thursday 14:10 UTC - Apache NiFi 101: Introduction and Best Practices - Tim Spann


Apache Flink and Apache Pulsar










A New FLiP!

A New FLiP! 



As some have noticed, I have left Cloudera. It has been an incredible journey. I joined Hortonworks in April of 2016 and then we merged with Cloudera in 2019. This is was my first article on Apache NiFi https://lnkd.in/e4pxg43. I got to grow with Apache NiFi as it grew from 1.0 to 1.14 during my time!  A lot of things changed, evolved and the tech grew so much.

I got to do my first major conference talks at DataWorks Summit which will always be one of my favorite event series ever. I am excited to be involved with Pulsar Summit (https://pulsar-summit.org/) and many other conferences now

My Final Tallies at Hortonworks/Cloudera:
11 videos on my Youtube channel https://lnkd.in/eeRRCJv
1,719 members Future of Data Meetup Princeton from 0
Over 48 Meetups events around the world
Over 230K Blog Views
Over 192 Blog Articles
344 DZone Articles for 3 Million Views https://lnkd.in/ejdbXte 
Over 41 Conferences Spoken at.   
Hosted One Mardis Gras at Client, it was awesome
60 Slideshares https://lnkd.in/eUgtpxY 
266 Github Repos https://lnkd.in/eM9JGks

I got to work with some of the best tech people in the world and also the best people. I really enjoyed the community and the teamwork.

Reports from 2017, 2018, 2019, 2020
https://lnkd.in/exxZVJc 
https://lnkd.in/ehX6RE6
https://lnkd.in/enhJgQs 
https://lnkd.in/eFGzHYV

I am really excited at what we are doing at StreamNative with Apache Pulsar. I still get to work with the amazing ASF open source community and all the great Streaming friends with Apache Flink and Apache NiFi. I am working on a FLiP Stack to demonstrate some cool apps you can build with Flink, Pulsar and Friends. Stay tuned. I will remain involved in the Apache NiFi community and I have a talk on Apache NiFi at ApacheCon later this year.

Please join me for new streaming adventures with Apache Pulsar, Apache Flink and the FLiP(N) Stack!


Reference:   https://www.linkedin.com/feed/update/urn:li:activity:6825792846759563264/


Upcoming Events 2021

 Upcoming Events 2021


ApacheCon Asia - 06-August-2021




Scenic City Summit - 24-September-2021


ApacheCon 2021 - 21-September-2021 to 23-September-2021


Tuesday 17:10 UTC - Apache NIFi Deep Dive 300 

Tuesday 18:00 UTC - Apache Deep Learning 302 

Wednesday 15:00 UTC - Smart Transit: Real-Time Transit Information with FLaNK 

Wednesday 17:10 UTC - Cracking the Nut, Solving Edge AI with Apache Tools and Frameworks 

Thursday 14:10 UTC - Apache NiFi 101: Introduction and Best Practices 


Big Data Conference EU - 28-September-2021 to 29-September-2021

https://bigdataconference.eu/Timothy-J-Spann/


API World - 26-October-2021 to 28-October-2021

https://apiworld.co/conference/




NiFi on Cloudera Data Platform Upgrade - April 2021

CFM 2.1.1 on CDP 7.1.6

There is a new Cloudera release of Apache NiFi now with SAML support.

Apache NiFi 1.13.2.2.1.1.0
Apache NiFi Registry 0.8.0.2.1.1.0





https://docs.cloudera.com/cfm/2.1.1/release-notes/topics/cfm-whats-new.html

https://docs.cloudera.com/cfm/2.1.1/upgrade-paths/topics/cfm-upgrade-paths.html 

For changes:   https://www.datainmotion.dev/2021/02/new-features-of-apache-nifi-1130.html

Get your download on:  https://docs.cloudera.com/cfm/2.1.1/download/topics/cfm-download-locations.html

To start researching for the future, take a look at some of the technical preview features around Easy Rules engine and handlers.

https://docs.cloudera.com/cfm/2.1.1/release-notes/topics/cfm-technical-preview.html

Make sure you use the latest possible JDK 8 as there are some bugs out there.   Use a recent version of the JDK like 8u282 or newer.

Size your cluster correctly!   https://docs.cloudera.com/cfm/2.1.1/nifi-sizing/topics/cfm-sizing-recommendations.html.  Make sure you have at least 3 nodes.





References




Populating Your Secure Cloud Data Estates

Populating Your Secure Cloud Data Estates

Hydrating Your Clean Cloud Data Lake


I am hard pressed to keep up with Data Store + Query terminology du jour.    Was it Data Lake House?   All these giant bodies of water mostly stored in buckets (S3)?    I agree there are lots of nuances and many different query engines on top of those various means for storing that data.   I don't think everytime we add a twist we need to add increasingly silly terms on top.   Is it to confuse users?  developers?  data engineers?  companies?   executives?   Perhaps if we change our data warehouse name again we can get them to buy the same thing again.

Clearly it can't be one size fit all for all this different things?   I know a lot of companies of various types and sizes and most don't approach the size of the data that companies like Netflix and LinkedIn have.   I really like their innovation, but often those projects get released and then wither in obscurity.

A few projects look really good:


For me, if I can do the basic CRUD operations that applications, reports, dashboards and queries require then it works for me.    With Apache NiFi, Apache Kafka, Apache Spark and Apache Flink supporting a data store then it is should be good.   The one thing I have to be wary of is that datastores like Apache Kudu, Apache HBase and HDFS have been around for a long time and have many of the production killing bugs flushed out of it, multiple company support and robust Open Source Apache communities around them.   If a new project doesn't it won't survive, get traction or will just sit out there orphaned.    Let's build on what we have and try not to have a million half supported projects that are often abandoned or of unknown status.   Apache Parquet and Apache ORC have shown themselves as really solid and having engines like Apache Hive and Apache Impala to query them is really important.   Apache Ozone is looking very interesting for when Object Stores are not available.  http://ozone.apache.org/







Cloudera SQL Stream Builder (SSB) - Update Your FLaNK Stack

Cloudera SQL Stream Builder (SSB) Released!

CSA 1.3.0 is now available with Apache Flink 1.12 and SQL Stream Builder!   Check out this white paper for some details.    You can get full details on the Stream Processing and Analytics available from Cloudera here.


This is awesome way to query Kafka topics with continuous SQL that is deployed to scalable Flink nodes in YARN or K8.   We can also easily define functions in JavaScript to enhance, enrich and augment our data streams.   No Java to write, no heavy deploys or build scripts, we can build, test and deploy these advanced streaming applications all from your secure browser interface.


References:



Example Queries:


SELECT location, max(temp_f) as max_temp_f, avg(temp_f) as avg_temp_f,
                 min(temp_f) as min_temp_f
FROM weather2 
GROUP BY location


SELECT HOP_END(eventTimestamp, INTERVAL '1' SECOND, INTERVAL '30' SECOND) as        windowEnd,
       count(`close`) as closeCount,
       sum(cast(`close` as float)) as closeSum, avg(cast(`close` as float)) as closeAverage,
       min(`close`) as closeMin,
       max(`close`) as closeMax,
       sum(case when `close` > 14 then 1 else 0 end) as stockGreaterThan14 
FROM stocksraw
WHERE symbol = 'CLDR'
GROUP BY HOP(eventTimestamp, INTERVAL '1' SECOND, INTERVAL '30' SECOND)
                                                         

SELECT scada2.uuid, scada2.systemtime, scada2.temperaturef, scada2.pressure, scada2.humidity, scada2.lux, scada2.proximity, 
scada2.oxidising,scada2.reducing , scada2.nh3, scada2.gasko,energy2.`current`,                   
energy2.voltage,energy2.`power`,energy2.`total`,energy2.fanstatus
FROM energy2 JOIN scada2 ON energy2.systemtime = scada2.systemtime
                                                 


SELECT symbol, uuid, ts, dt, `open`, `close`, high, volume, `low`, `datetime`, 'new-high' message, 
'nh' alertcode, CAST(CURRENT_TIMESTAMP AS BIGINT) alerttime 
FROM stocksraw st 
WHERE symbol is not null 
AND symbol <> 'null' 
AND trim(symbol) <> '' and 
CAST(`close` as DOUBLE) > 
(SELECT MAX(CAST(`close` as DOUBLE))
FROM stocksraw s 
WHERE s.symbol = st.symbol);



SELECT  * 
FROM statusevents
WHERE lower(description) like '%fail%'



SELECT
  sensor_id as device_id,
  HOP_END(sensor_ts, INTERVAL '1' SECOND, INTERVAL '30' SECOND) as windowEnd,
  count(*) as sensorCount,
  sum(sensor_6) as sensorSum,
  avg(cast(sensor_6 as float)) as sensorAverage,
  min(sensor_6) as sensorMin,
  max(sensor_6) as sensorMax,
  sum(case when sensor_6 > 70 then 1 else 0 end) as sensorGreaterThan60
FROM iot_enriched_source
GROUP BY
  sensor_id,
  HOP(sensor_ts, INTERVAL '1' SECOND, INTERVAL '30' SECOND)



SELECT title, description, pubDate, `point`, `uuid`, `ts`, eventTimestamp
FROM transcomevents


Source Code:



Example SQL Stream Builder Run

We login then build our Kafka data source(s), unless they were predefined.

Next we build a few virtual table sources for Kafka topics we are going to read from.   If they are JSON we can let SSB determine the schema for us.   Or we can connect to the Cloudera Schema Registry for it to determine the schema for AVRO data.

We can then define virtual table syncs to Kafka or webhooks.

We then run a SQL query with some easy to determine parameters and if we like the results we can create a materialized view.