Implementing Streaming Use Case From REST to Hive with Apache NiFi and Apache Kafka

Implementing Streaming Use Case From REST to Hive with Apache NiFi and Apache Kafka
Part 1
With Apache Kafka 2.0, Apache NiFi 1.8 and many new features and abilities coming out. It's time to put them to the test.
So to plan out what we are going to do, I have a high level architecture diagram. We are going to ingest a number of sources including REST feeds, Social Feeds, Messages, Images, Documents and Relational Data.
We will ingest with NiFi, filter and process and segment it into Kafka topics. Kafka data will be in Apache Avro format with schemas specified in Hortonworks Schema Registry. Kafka Streams, Spark and NiFi will do additional event processing along with machine learning and deep learning. it will be stored in Druid for real-time analytics and summaries. Hive, HDFS and S3 will store for permanent storage. We will do dashboards with Superset and Spark SQL + Zeppelin. We will integrate machine learning with Spark ML, TensorFlow and Apache MXNet.
We will also push back cleaned and aggregated data to subscribers via Kafka and NiFi. We will push to Dockerized applications, message listeners, web clients, Slack channels and to email mailing lists.
To be useful in our enterprise, we will have full authorization, authentication, auditing, data encryption and data lineage via Apache Ranger, Apache Atlas and Apache NiFi. NiFi Registry and github will be used for source code control.
We will have administration capabilities via Apache Ambari.
An example server layout:
NiFi Flows
Real-time free stock data is available from IEX with no license key. The data streams in very fast, thankfully that's no issue for Apache NiFi and Kafka.
Consume the Different Records from topics and store to HDFS in separate directories and tables.

Let's split up one big REST file into individual records of interest. Our REST feed has quote, chart and news arrays.
Let's Push Some Messages to Slack
We can easily consume from multiple topics in Apache NiFi.
Querying data is easy as it's in motion, since we have schemas
We create schemas for each of our Kafka Topics
We can monitor all these messages going through Kafka in Ambari (and also in much better detail in Hortonworks SMM).
I read in data and then can push it to Kafka 1.0 and 2.0 brokers.
Once data is sent, NiFi let's us know.
Projects Used
  • Apache Kafka
  • Apache Kafka Streams
  • Apache MXNet
  • NLTK
  • Stanford CoreNLP
  • Apache OpenNLP
  • TextBlob
  • SpaCy
  • Apache NiFi
  • Apache Druid
  • Apache Hive on Kafka
  • Apache Hive on Druid
  • Apache Hive on JDBC
  • Apache Zeppelin
  • NLP - Apache OpenNLP and Stanford CoreNLP
  • Hortonworks Schema Registry
  • NiFi Registry
  • Apache Ambari
  • Log Search
  • Hortonworks SMM
  • Hortonworks Data Plane Services (DPS)
Sources
  • REST
  • Twitter
  • JDBC
  • Sensors
  • MQTT
  • Documents

Sinks
  • Apache Hadoop HDFS
  • Apache Kafka
  • Apache Hive
  • Slack
  • S3
  • Apache Druid
  • Apache HBase
Topics
  • iextradingnews
  • iextradingquote
  • iextradingchart
  • stocks
  • cyber
HDFS Directories

  1. hdfs dfs -mkdir -p /iextradingnews

  2.  

  3. hdfs dfs -mkdir -p /iextradingquote

  4.  

  5. hdfs dfs -mkdir -p /iextradingchart

  6.  

  7. hdfs dfs -mkdir -p /stocks

  8.  

  9. hdfs dfs -mkdir -p /cyber

  10.  

  11. hdfs dfs -chmod -R 777 /


PutHDFS
  • /${kafka.topic}
  • /iextradingchart/859496561256574.orc
  • /iextradingnews/855935960267509.orc
  • /iextradingquote/859143934804532.orc
Hive Tables

  1. CREATE EXTERNAL TABLE IF NOT EXISTS iextradingchart (`date` STRING, open DOUBLE, high DOUBLE, low DOUBLE, close DOUBLE, volume INT, unadjustedVolume INT, change DOUBLE, changePercent DOUBLE, vwap DOUBLE, label STRING, changeOverTime INT)

  2. STORED AS ORC

  3. LOCATION '/iextradingchart';

  4.  

  5. CREATE EXTERNAL TABLE IF NOT EXISTS iextradingquote (symbol STRING, companyName STRING, primaryExchange STRING, sector STRING, calculationPrice STRING, open DOUBLE, openTime BIGINT, close DOUBLE, closeTime BIGINT, high DOUBLE, low DOUBLE, latestPrice DOUBLE, latestSource STRING, latestTime STRING, latestUpdate BIGINT, latestVolume INT, iexRealtimePrice DOUBLE, iexRealtimeSize INT, iexLastUpdated BIGINT, delayedPrice DOUBLE, delayedPriceTime BIGINT, extendedPrice DOUBLE, extendedChange DOUBLE, extendedChangePercent DOUBLE, extendedPriceTime BIGINT, previousClose DOUBLE, change DOUBLE, changePercent DOUBLE, iexMarketPercent DOUBLE, iexVolume INT, avgTotalVolume INT, iexBidPrice INT, iexBidSize INT, iexAskPrice INT, iexAskSize INT, marketCap INT, peRatio DOUBLE, week52High DOUBLE, week52Low DOUBLE, ytdChange DOUBLE)

  6. STORED AS ORC

  7. LOCATION '/iextradingquote';

  8.  

  9. CREATE EXTERNAL TABLE IF NOT EXISTS iextradingnews (`datetime` STRING, headline STRING, source STRING, url STRING, summary STRING, related STRING, image STRING)

  10. STORED AS ORC

  11. LOCATION '/iextradingnews';


Schemas

  1. { "type": "record", "name": "iextradingchart", "fields": [ { "name": "date", "type": [ "string", "null" ] }, { "name": "open", "type": [ "double", "null" ] }, { "name": "high", "type": [ "double", "null" ] }, { "name": "low", "type": [ "double", "null" ] }, { "name": "close", "type": [ "double", "null" ] }, { "name": "volume", "type": [ "int", "null" ] }, { "name": "unadjustedVolume", "type": [ "int", "null" ] }, { "name": "change", "type": [ "double", "null" ] }, { "name": "changePercent", "type": [ "double", "null" ] }, { "name": "vwap", "type": [ "double", "null" ] }, { "name": "label", "type": [ "string", "null" ] }, { "name": "changeOverTime", "type": [ "int", "null" ] } ]}{ "type": "record", "name": "iextradingquote", "fields": [ { "name": "symbol", "type": [ "string", "null" ], "doc": "Type inferred from '\"HDP\"'" }, { "name": "companyName", "type": [ "string", "null" ], "doc": "Type inferred from '\"Hortonworks Inc.\"'" }, { "name": "primaryExchange", "type": [ "string", "null" ], "doc": "Type inferred from '\"Nasdaq Global Select\"'" }, { "name": "sector", "type": [ "string", "null" ], "doc": "Type inferred from '\"Technology\"'" }, { "name": "calculationPrice", "type": [ "string", "null" ], "doc": "Type inferred from '\"close\"'" }, { "name": "open", "type": [ "double", "null" ], "doc": "Type inferred from '16.3'" }, { "name": "openTime", "type": [ "long", "null" ], "doc": "Type inferred from '1542033000568'" }, { "name": "close", "type": [ "double", "null" ], "doc": "Type inferred from '15.76'" }, { "name": "closeTime", "type": [ "long", "null" ], "doc": "Type inferred from '1542056400520'" }, { "name": "high", "type": [ "double", "null" ], "doc": "Type inferred from '16.37'" }, { "name": "low", "type": [ "double", "null" ], "doc": "Type inferred from '15.2'" }, { "name": "latestPrice", "type": [ "double", "null" ], "doc": "Type inferred from '15.76'" }, { "name": "latestSource", "type": [ "string", "null" ], "doc": "Type inferred from '\"Close\"'" }, { "name": "latestTime", "type": [ "string", "null" ], "doc": "Type inferred from '\"November 12, 2018\"'" }, { "name": "latestUpdate", "type": [ "long", "null" ], "doc": "Type inferred from '1542056400520'" }, { "name": "latestVolume", "type": [ "int", "null" ], "doc": "Type inferred from '4012339'" }, { "name": "iexRealtimePrice", "type": [ "double", "null" ], "doc": "Type inferred from '15.74'" }, { "name": "iexRealtimeSize", "type": [ "int", "null" ], "doc": "Type inferred from '43'" }, { "name": "iexLastUpdated", "type": [ "long", "null" ], "doc": "Type inferred from '1542056397411'" }, { "name": "delayedPrice", "type": [ "double", "null" ], "doc": "Type inferred from '15.76'" }, { "name": "delayedPriceTime", "type": [ "long", "null" ], "doc": "Type inferred from '1542056400520'" }, { "name": "extendedPrice", "type": [ "double", "null" ], "doc": "Type inferred from '15.85'" }, { "name": "extendedChange", "type": [ "double", "null" ], "doc": "Type inferred from '0.09'" }, { "name": "extendedChangePercent", "type": [ "double", "null" ], "doc": "Type inferred from '0.00571'" }, { "name": "extendedPriceTime", "type": [ "long", "null" ], "doc": "Type inferred from '1542059622726'" }, { "name": "previousClose", "type": [ "double", "null" ], "doc": "Type inferred from '16.24'" }, { "name": "change", "type": [ "double", "null" ], "doc": "Type inferred from '-0.48'" }, { "name": "changePercent", "type": [ "double", "null" ], "doc": "Type inferred from '-0.02956'" }, { "name": "iexMarketPercent", "type": [ "double", "null" ], "doc": "Type inferred from '0.03258'" }, { "name": "iexVolume", "type": [ "int", "null" ], "doc": "Type inferred from '130722'" }, { "name": "avgTotalVolume", "type": [ "int", "null" ], "doc": "Type inferred from '2042809'" }, { "name": "iexBidPrice", "type": [ "int", "null" ], "doc": "Type inferred from '0'" }, { "name": "iexBidSize", "type": [ "int", "null" ], "doc": "Type inferred from '0'" }, { "name": "iexAskPrice", "type": [ "int", "null" ], "doc": "Type inferred from '0'" }, { "name": "iexAskSize", "type": [ "int", "null" ], "doc": "Type inferred from '0'" }, { "name": "marketCap", "type": [ "int", "null" ], "doc": "Type inferred from '1317308142'" }, { "name": "peRatio", "type": [ "double", "null" ], "doc": "Type inferred from '-7.43'" }, { "name": "week52High", "type": [ "double", "null" ], "doc": "Type inferred from '26.22'" }, { "name": "week52Low", "type": [ "double", "null" ], "doc": "Type inferred from '15.2'" }, { "name": "ytdChange", "type": [ "double", "null" ], "doc": "Type inferred from '-0.25696247383444343'" } ]}{ "type" : "record", "name" : "iextradingchart", "fields" : [ { "name" : "date", "type" : ["string","null"] }, { "name" : "open", "type" : ["double","null"] }, { "name" : "high", "type" : ["double","null"] }, { "name" : "low", "type" : ["double","null"] }, { "name" : "close", "type" : ["double","null"] }, { "name" : "volume", "type" : ["int","null"] }, { "name" : "unadjustedVolume", "type" : ["int","null"] }, { "name" : "change", "type" : ["double","null"] }, { "name" : "changePercent", "type" : ["double","null"] }, { "name" : "vwap", "type" : ["double","null"] }, { "name" : "label", "type" : ["string","null"] }, { "name" : "changeOverTime", "type" : ["int","null"] } ] }


Messages to Slack
File: ${'filename'}
Offset: ${'kafka.offset'}
Partition: ${'kafka.partition'}
Topic: ${'kafka.topic'}
UUID: ${'uuid'}
Record Count: ${'record.count'}
File Size: ${fileSize:divide(1024)}K
Splits
  • $.*.quote
  • $.*.chart
  • $.*.news

Array to Single
$.*
GETHTTP
URL
FileName
marketbatch.hdp.${'hdp':append(${now():format('yyyymmddHHMMSS'):append(${md5}):append('.json')})}
Data provided for free by IEX. View IEX’s Terms of Use.
IEX Real-Time Price https://iextrading.com/developer/
Queries
SELECT * FROM FLOWFILE
WHERE latestPrice > week52Low
SELECT * FROM FLOWFILE
WHERE latestPrice <= week52Low

Example Output
File: 855957937589894
Offset: 22460
Partition: 0
Topic: iextradingquote
UUID: b2a8e797-2249-4689-9a78-4339ddb5ecb4
Record Count:
File Size: 3K
Data Visualization in Apache Zeppelin with Hive and Spark SQL
Creating tables on top of Apache ORC files in HDFS is easy.
Push Some Messages to Slack
Resources

Other Data Sources
Source

Apache NiFi Processor for Apache MXNet SSD: Single Shot MultiBox Object Detector (Deep Learning)

Apache NiFi Processor for Apache MXNet SSD: Single Shot MultiBox Object Detector (Deep Learning)
The news is out, Apache MXNet has added a Java API. So as soon as I could I got my hands on the maven repo and an example program and got to work writing a new Apache NiFi processor for it.
I have run this on standalone Apache NiFi 1.8.0 and on HDF 3.3 - Apache NiFi 1.8.0 and both work. So anyone who wants to be an alpha tester, please download it and give it a try.
Apache MXNet SSD is a good example of a pretrained deep learning model that works pretty well for general images in a use cases especially around people and cars. You can fine-tune this with some more images and runs: https://mxnet.incubator.apache.org/faq/finetune.html
The nice thing is now we can start including Apache MXNet as part of Java applications such as Kafka Streams, Apache Storm, Apache Spark, Spring Boot and other use cases using Java. I could potentially inject this into a Hive UDF (https://community.hortonworks.com/articles/39980/creating-a-hive-udf-in-java.html#comment-40026) or Pig UDF. The performance may be fast enough. We now have four Java options for Deep Learning: DL4J, H2O, Tensorflow and Apache MXNet. Unfortunately, both TensorFlow and MXNet Java APIs are not quite production ready.
I may do some further research on running MXNet as a Hive UDF, it would be cool to have in a query.
For those who don't want to setup a development environment with JDK 8+, Maven 3.3+ and git, you can download a pre-built nar file here: https://github.com/tspannhw/nifi-mxnetinference-processor/releases/tag/v1.0.
As part of the recent release of HDF 3.3, I have upgraded my OpenStack Centos 7 cluster.
Important CaveatsNotice, the Java API is in preview and so is this processor. Do not use this in production! This is in development and I am the only one working on it. The Java API from Apache MXNet is in flux and will be changing. See the POM as it is tied to the OSX/Mac version of the library. You will need to change that. You will need to download the pre-built MXNet model and place it in a directory accessible to Apache NiFi server/cluster. I am still cleaning up the rectangle code for identifying objects in the pictures.
As you will notice, my rectangle drawing is a bit off. I need to work on that.
Once you drop your built nar file and models in the nifi/lib directory and restart Apache NiFi, you can add it to your canvas.
We need to feed it some images. You can use my web cam processor, an image URL feed or local files.
To grab images from an HTTPS site, you need an SSL Context Service like this StandardSSLContextService below. You will need to point to the cacerts used by the JRE/JDK running your Apache NiFi node. The default password in Java is changeme. Hopefully you have changed it.
To configure my new processor, just put in the full path to the model directory and then "/resnet50_ssd_model" as that is the prefix for the model.
Our example flow with new processor being fed by traffic cameras, webcams, local files and local webcam.
Some output of our flow:
Our top 5 probabilities and labels
Example Data:

  1. {

  2. "ymin_1" : "456.01",

  3. "ymin_5" : "159.29",

  4. "ymin_4" : "235.83",

  5. "ymin_3" : "206.64",

  6. "ymin_2" : "383.84",

  7. "label_5" : "person",

  8. "xmax_5" : "121.14",

  9. "label_4" : "bicycle",

  10. "xmax_4" : "137.89",

  11. "label_3" : "dog",

  12. "xmax_3" : "179.14",

  13. "ymax_1" : "150.66",

  14. "ymax_2" : "418.95",

  15. "ymax_3" : "476.79",

  16. "label_2" : "bicycle",

  17. "label_1" : "car",

  18. "probability_4" : "0.22",

  19. "probability_5" : "0.13",

  20. "probability_2" : "0.90",

  21. "xmin_5" : "88.93",

  22. "probability_3" : "0.82",

  23. "ymax_4" : "413.43",

  24. "probability_1" : "1.00",

  25. "ymax_5" : "190.04",

  26. "xmax_2" : "149.96",

  27. "xmax_1" : "72.03",

  28. "xmin_3" : "83.82",

  29. "xmin_4" : "93.05",

  30. "xmin_1" : "312.21",

  31. "xmin_2" : "155.96"

  32. }


Resources:
Source:
Maven POM (I used Java 8 and Maven 3.3.9)

  1. <?xml version="1.0" encoding="UTF-8"?>

  2. <!--

  3. Licensed to the Apache Software Foundation (ASF) under one or more

  4. contributor license agreements. See the NOTICE file distributed with

  5. this work for additional information regarding copyright ownership.

  6. The ASF licenses this file to You under the Apache License, Version 2.0

  7. (the "License"); you may not use this file except in compliance with

  8. the License. You may obtain a copy of the License at

  9. http://www.apache.org/licenses/LICENSE-2.0

  10. Unless required by applicable law or agreed to in writing, software

  11. distributed under the License is distributed on an "AS IS" BASIS,

  12. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

  13. See the License for the specific language governing permissions and

  14. limitations under the License.

  15. -->

  16. <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

  17. <modelVersion>4.0.0</modelVersion>

  18.  

  19.  

  20. <parent>

  21. <groupId>com.dataflowdeveloper.mxnet</groupId>

  22. <artifactId>inference</artifactId>

  23. <version>1.0</version>

  24. </parent>

  25.  

  26.  

  27. <artifactId>nifi-mxnetinference-processors</artifactId>

  28. <packaging>jar</packaging>

  29.  

  30.  

  31. <dependencies>

  32. <dependency>

  33. <groupId>org.apache.nifi</groupId>

  34. <artifactId>nifi-api</artifactId>

  35. </dependency>

  36. <dependency>

  37. <groupId>org.apache.nifi</groupId>

  38. <artifactId>nifi-utils</artifactId>

  39. <version>1.8.0</version>

  40. </dependency>

  41. <dependency>

  42. <groupId>org.apache.nifi</groupId>

  43. <artifactId>nifi-mock</artifactId>

  44. <version>1.8.0</version>

  45. <scope>test</scope>

  46. </dependency>

  47. <dependency>

  48. <groupId>org.slf4j</groupId>

  49. <artifactId>slf4j-simple</artifactId>

  50. <scope>test</scope>

  51. </dependency>

  52. <dependency>

  53. <groupId>junit</groupId>

  54. <artifactId>junit</artifactId>

  55. <scope>test</scope>

  56. </dependency>

  57.  

  58.  

  59. <dependency>

  60. <groupId>org.apache.mxnet</groupId>

  61. <artifactId>mxnet-full_2.11-osx-x86_64-cpu</artifactId>

  62. <version>1.3.1-SNAPSHOT</version>

  63. </dependency>

  64.  

  65.  

  66. </dependencies>

  67. </project>


I have moved from Eclipse to IntelliJ from my builds. I am looking at Apache Netbeans as well.

Data In Motion