Skip to main content

Posts

Migrating from Apache Storm to Apache Flink

 Migrating from Apache Storm to Apache Flink The first thing you need to do is to not just pick up and dump to a new system, but to see what can be reconfigured, refactored or reimagined.   For some routing, transformation or simple ingest type applications or solution parts you may want to use Apache NiFi. For others Spark or Spark Streaming can quickly meet your needs.   For simple Thing to Kafka or Kafka to Thing flows, a flow with Kafka Connect is appropriate.   For things that need to run in individual devices, containers, pods you may want to move a small application to NiFi Stateless.    There are also sometimes a simple Kafka Stream application will meet your needs. For many use cases you can replace a compiled application with some solid Flink SQL code.   For some discussions, check this out . For some really good information on how to migrate Storm solutions to Flink,  Cloudera has a well documented solution for you: https://docs.cloudera.com/csa/1.2.0/stormflink-migration/to

Upcoming Demo Jam and The Latest Articles - 13 January 2020

 Upcoming Demo Jam and The Latest Articles - 13 January 2020 Upcoming Demo Jam (Ask Live Questions of Pierre) - 21 Jan https://www.cloudera.com/about/events/webinars/demo-jam-live-returns-build-data-flow-with-apache-nifi.html?utm_medium=tspann Your Top 5 Apache NiFi Questions Answered By Experts https://blog.cloudera.com/top-5-questions-about-apache-nifi/ Apache NiFi - The Data Movement Enabler in Hybrid Cloud https://blog.cloudera.com/apache-nifi-the-data-movement-enabler-in-a-hybrid-cloud-environment/ Real-Time Transit Information Ingest https://www.datainmotion.dev/2021/01/flank-real-time-transit-information-for.html Using Kudu as a Cache for REST Streams https://www.datainmotion.dev/2021/01/flank-using-apache-kudu-as-cache-for.html Simple Change Data Capture with SQL https://www.datainmotion.dev/2020/12/simple-change-data-capture-cdc-with-sql.html Ingesting Websocket Data Live https://www.datainmotion.dev/2020/12/ingesting-websocket-data-for-live-stock.html Smart Stock Processing w

FLaNK: Real-Time Transit Information For NY/NJ/CT (TRANSCOM)

 FLaNK:   Real-Time Transit Information For NY/NJ/CT (TRANSCOM) SOURCE : XML/RSS REST ENDPOINT  FREQUENCY :  Every Minute DESTINATIONS :   HDFS, Kudu/Impala, Cloud, Kafka The main source of this real-time transit updates for New Jersey, New York and Connecticut is TRANSCOM.   I will read from this datasource every minute to know about real-time traffic events that occurring on the roads and transportation systems near me.   We will be reading the feed that is in XML/RSS format and parse out the hundreds of events that come with each minutes update.    I want to store the raw XML/RSS file in S3/ADLS2/HDFS or GCS, that's an easy step.   I will also parse and enhance this data for easier querying and tracking. I will add to all events a unique ID and a timestamp as the data is streaming by.   I will store my data in Impala/Kudu for fast queries and upserts.   I can then build some graphs, charts and tables with Apache Hue and Cloudera Visual Applications.   I will also publish my data

FLaNK: Using Apache Kudu As a Cache For FDA Updates

 FLaNK:  Using Apache Kudu As a Cache For FDA Updates InvokeHTTP :   We invoke the RSS feed to get our data. QueryRecord :   Convert RSS to JSON SplitJson :  Split one file into individual records.   (Should refactor to ForkRecord ) EvaluateJSONPath :  Extract attributes you need. ProcessGroup for SPL Processing We call and check the RSS feed frequently and we parse out the records from the RSS(XML) feed and check against our cache.   We use Cloudera's Real-Time Datahub with Apache Impala/Apache Kudu as our cache to see if we already received a record for that.   If it hasn't arrived yet, it will be added to the Kudu cache and processed. SPL Processing We use the extracted SPL to grab a detailed record for that SPL using InvokeHTTP . We have created an Impala/Kudu table to cache this information with Apache Hue. We use the LookupRecord to read from our cache. If we don't have that value yet, we send it to the table for an UPSERT. We send our raw data as XML/RSS to HDFS fo

My Year in Review (2020)

My Year in Review (2020)   Last year, I thought a few things would be coming in 2020.    What's Coming in 2020 Cloud Enterprise Data Platforms Hybrid Cloud Streaming with Flink, Kafka, NiFi AI at the Edge with Microcontrollers and Small Devices Voice Data In Queries Event Handler as a Service (Automatic Kafka Message Reading) More Powerful Parameter Based Modular Streaming  Cloud First For Big Data Log Handling Moves to MiNiFi Full AI At The Edge with Deployable Models More Powerful Edge TPU/GPU/VPU Kafka is everywhere Open Source UI Driven Event Engines FLaNK Stack gains popularity FLINK Everywhere Some of this was deferred due to the global pandemic and I had a big miss with Voice Data and some advances in streaming getting delayed. Here's a list bit of new news for 2020, SRM was added to Kafka DataHub in the Public Cloud: https://docs.cloudera.com/runtime/7.2.6/srm-overview/topics/srm-replication-overview.html I also just did a best of 2020 video you can check out: Articles

Simple Change Data Capture (CDC) with SQL Selects via Apache NiFi (FLaNK)

 Simple Change Data Capture (CDC) with SQL Selects via Apache NiFi (FLaNK) Sometimes you need real CDC and you have access to transaction change logs and you use a tool like QLIK REPLICATE or GoldenGate to pump out records to Kafka and then Flink SQL or NiFi can read them and process them. Other times you need something easier for just some basic changes and inserts to some tables you are interested in receiving new data as events.   Apache NiFi can do this easily for you with QueryDatabaseTableRecord, you don't need to know anything but the database connection information, table name and what field may change.  NiFi will query, watch state and give you new records.   Nothing is hardcoded, parameterize those values and you have a generic Any RDBMS to Any Other Store data pipeline.   We are reading as records which means each FlowFile in NiFi can have thousands of records that we know all the fields, types and schema related information for.   This can be ones that NiFi infers the s