CFM 2.1.1 on CDP 7.1.6

There is a new Cloudera release of Apache NiFi now with SAML support.

Apache NiFi 1.13.2.2.1.1.0

Apache NiFi Registry 0.8.0.2.1.1.0

See: https://blog.cloudera.com/the-new-releases-of-apache-nifi-in-public-cloud-and-private-cloud/

https://docs.cloudera.com/cfm/2.1.1/release-notes/topics/cfm-component-support.html

https://docs.cloudera.com/cfm/2.1.1/release-notes/topics/cfm-whats-new.html

https://docs.cloudera.com/cfm/2.1.1/upgrade-paths/topics/cfm-upgrade-paths.html

For changes: https://www.datainmotion.dev/2021/02/new-features-of-apache-nifi-1130.html

Get your download on: https://docs.cloudera.com/cfm/2.1.1/download/topics/cfm-download-locations.html

To start researching for the future, take a look at some of the technical preview features around Easy Rules engine and handlers.

https://docs.cloudera.com/cfm/2.1.1/release-notes/topics/cfm-technical-preview.html

Make sure you use the latest possible JDK 8 as there are some bugs out there. Use a recent version of the JDK like 8u282 or newer.

Size your cluster correctly! https://docs.cloudera.com/cfm/2.1.1/nifi-sizing/topics/cfm-sizing-recommendations.html. Make sure you have at least 3 nodes.

References

https://github.com/tspannhw/EverythingApacheNiFi

https://www.datainmotion.dev/2020/06/no-more-spaghetti-flows.html

https://www.datainmotion.dev/2019/10/apache-nifi-load-balancing-via-load.html

Populating Your Secure Cloud Data Estates

Hydrating Your Clean Cloud Data Lake

I am hard pressed to keep up with Data Store + Query terminology du jour. Was it Data Lake House? All these giant bodies of water mostly stored in buckets (S3)? I agree there are lots of nuances and many different query engines on top of those various means for storing that data. I don't think everytime we add a twist we need to add increasingly silly terms on top. Is it to confuse users? developers? data engineers? companies? executives? Perhaps if we change our data warehouse name again we can get them to buy the same thing again.

Clearly it can't be one size fit all for all this different things? I know a lot of companies of various types and sizes and most don't approach the size of the data that companies like Netflix and LinkedIn have. I really like their innovation, but often those projects get released and then wither in obscurity.

A few projects look really good:

Apache Iceberg - I have a good feeling on this one. https://conferences.oreilly.com/strata/strata-ny-2018/public/schedule/detail/69503.html https://thenewstack.io/apache-iceberg-a-different-table-design-for-big-data/
Apache Hudi https://hudi.apache.org/

For me, if I can do the basic CRUD operations that applications, reports, dashboards and queries require then it works for me. With Apache NiFi, Apache Kafka, Apache Spark and Apache Flink supporting a data store then it is should be good. The one thing I have to be wary of is that datastores like Apache Kudu, Apache HBase and HDFS have been around for a long time and have many of the production killing bugs flushed out of it, multiple company support and robust Open Source Apache communities around them. If a new project doesn't it won't survive, get traction or will just sit out there orphaned. Let's build on what we have and try not to have a million half supported projects that are often abandoned or of unknown status. Apache Parquet and Apache ORC have shown themselves as really solid and having engines like Apache Hive and Apache Impala to query them is really important. Apache Ozone is looking very interesting for when Object Stores are not available. http://ozone.apache.org/

Cloudera SQL Stream Builder (SSB) - Update Your FLaNK Stack

Cloudera SQL Stream Builder (SSB) Released!

CSA 1.3.0 is now available with Apache Flink 1.12 and SQL Stream Builder! Check out this white paper for some details. You can get full details on the Stream Processing and Analytics available from Cloudera here.

This is awesome way to query Kafka topics with continuous SQL that is deployed to scalable Flink nodes in YARN or K8. We can also easily define functions in JavaScript to enhance, enrich and augment our data streams. No Java to write, no heavy deploys or build scripts, we can build, test and deploy these advanced streaming applications all from your secure browser interface.

References:

Example Queries:

SELECT location, max(temp_f) as max_temp_f, avg(temp_f) as avg_temp_f,
min(temp_f) as min_temp_f

FROM weather2

GROUP BY location

SELECT HOP_END(eventTimestamp, INTERVAL '1' SECOND, INTERVAL '30' SECOND) as windowEnd,
count(`close`) as closeCount,

sum(cast(`close` as float)) as closeSum, avg(cast(`close` as float)) as closeAverage,
min(`close`) as closeMin,

max(`close`) as closeMax,

sum(case when `close` > 14 then 1 else 0 end) as stockGreaterThan14

FROM stocksraw
WHERE symbol = 'CLDR'
GROUP BY HOP(eventTimestamp, INTERVAL '1' SECOND, INTERVAL '30' SECOND)

SELECT scada2.uuid, scada2.systemtime, scada2.temperaturef, scada2.pressure, scada2.humidity, scada2.lux, scada2.proximity,

scada2.oxidising,scada2.reducing , scada2.nh3, scada2.gasko,energy2.`current`,

energy2.voltage,energy2.`power`,energy2.`total`,energy2.fanstatus

FROM energy2 JOIN scada2 ON energy2.systemtime = scada2.systemtime

SELECT symbol, uuid, ts, dt, `open`, `close`, high, volume, `low`, `datetime`, 'new-high' message,

'nh' alertcode, CAST(CURRENT_TIMESTAMP AS BIGINT) alerttime

FROM stocksraw st

WHERE symbol is not null

AND symbol <> 'null'

AND trim(symbol) <> '' and

CAST(`close` as DOUBLE) >

(SELECT MAX(CAST(`close` as DOUBLE))
FROM stocksraw s

WHERE s.symbol = st.symbol);

SELECT *

FROM statusevents

WHERE lower(description) like '%fail%'

SELECT

sensor_id as device_id,

HOP_END(sensor_ts, INTERVAL '1' SECOND, INTERVAL '30' SECOND) as windowEnd,

count(*) as sensorCount,

sum(sensor_6) as sensorSum,

avg(cast(sensor_6 as float)) as sensorAverage,

min(sensor_6) as sensorMin,

max(sensor_6) as sensorMax,

sum(case when sensor_6 > 70 then 1 else 0 end) as sensorGreaterThan60

FROM iot_enriched_source

GROUP BY

sensor_id,

HOP(sensor_ts, INTERVAL '1' SECOND, INTERVAL '30' SECOND)

SELECT title, description, pubDate, `point`, `uuid`, `ts`, eventTimestamp

FROM transcomevents

Source Code:

Example SQL Stream Builder Run

We login then build our Kafka data source(s), unless they were predefined.

Next we build a few virtual table sources for Kafka topics we are going to read from. If they are JSON we can let SSB determine the schema for us. Or we can connect to the Cloudera Schema Registry for it to determine the schema for AVRO data.

We can then define virtual table syncs to Kafka or webhooks.

We then run a SQL query with some easy to determine parameters and if we like the results we can create a materialized view.