Using GrovePi with Raspberry Pi and MiNiFi Agents for Data Ingest to Parquet, Kudu, ORC, Kafka, Hive and Impala

Using GrovePi with Raspberry Pi and MiNiFi Agents for Data Ingest


Source Code:  https://github.com/tspannhw/minifi-grove-sensors

Acquiring sensor data from Grove sensors is easy using a GrovePi Hat and some compatible sensors.


Just before my talk at the Future of Data Meetup @ Bell Works in Holmdel, NJ, I thought I should ingest some data from a grove sensor interface.

It's so easy a sleeping cat could do it.




So what does this device look like?  



I have a temperature and humidity sensor on there.




The distance sonic sensor is in there too, that's for the next article.




Let's do this with minimal RAM.




That's a 64GB hard drive underneath in the white case with the RPI.





I need more data and BACON.



We design our MiNiFi Agent Flow in CEM/EFM.   Grab JSON data stream and run sensors.


Apache NiFi 1.9.2 / CFM 1.0 Received HTTPS S2S Events From MiNiFi Agent




A simple flow to query and convert our JSON data, then store it to Kudu and HDFS (ORC) as well as push it to Kafka with a schema.




Let's read that Kafka message and store to Parquet, we will push to MQTT and JMS in the next article.   This is our universal proxy/gateway.



We could infer a schema and not save it.   But by saving a schema to the schema registry it makes SMM, Kafka, NiFi and others schema aware and easy to automagically query and convert between CSV/JSON/XML/AVRO/Parquet and more.

Let's store the data in Parquet files on HDFS with an Impala table.   In Apache NiFi 1.10 there is a ParquetWriter



Before we push to Kafka, let's create a topic for it with Cloudera SMM



Let's build an impala table for that Kudu data.



We can query our tables with ease as data rapidly is added.





Let's Examine the Parquet Files that NiFi Generated





 Let's query that parquet data with Impala in Hue



 Let's monitor that data in Kafka with Cloudera SMM






That was easy from device to enterprise cloud data store(s) with enterprise messages, security, governance, lineage, data catalog, SDX, monitoring and more.   How easy can you ingest IoT data, query it mid stream and store it in multiple data stores.   It took longer to write the article then to do the project and code.   All graphical, Single Sign On, multiple schemas/verisons/data types/engines, multiple OSs, edge, cloud and laptop.   Easy.

Table DDL


CREATE EXTERNAL TABLE IF NOT EXISTS grovesensors2 
(humidity STRING, uuid STRING, systemtime STRING, runtime STRING, cpu DOUBLE, id STRING, te STRING, host STRING, `end` STRING, 
macaddress STRING, temperature STRING, diskusage STRING, memory DOUBLE, ipaddress STRING, host_name STRING) 
STORED AS ORC
LOCATION '/tmp/grovesensors'

CREATE TABLE grovesensors ( uuid STRING,  `end` STRING,humidity STRING, systemtime STRING, runtime STRING, cpu DOUBLE, id STRING, te STRING, 
host STRING,
macaddress STRING, temperature STRING, diskusage STRING, memory DOUBLE, ipaddress STRING, host_name STRING,
PRIMARY KEY (uuid, `end`)
)
PARTITION BY HASH PARTITIONS 16
STORED AS KUDU
TBLPROPERTIES ('kudu.num_tablet_replicas' = '1')

hdfs dfs -mkdir -p /tmp/grovesensors
hdfs dfs -mkdir -p /tmp/groveparquet

CREATE  EXTERNAL TABLE grove_parquet 
 (
 diskusage STRING, 
  memory DOUBLE,  host_name STRING,
  systemtime STRING,
  macaddress STRING,
  temperature STRING,
  humidity STRING,
  cpu DOUBLE,
  uuid STRING,  ipaddress STRING,
  host STRING,
  `end` STRING,  te STRING,
  runtime STRING,
  id STRING
)
STORED AS PARQUET
LOCATION '/tmp/groveparquet/'

Parquet Format



message org.apache.nifi.grove {
  optional binary diskusage (STRING);
  optional double memory;
  optional binary host_name (STRING);
  optional binary systemtime (STRING);
  optional binary macaddress (STRING);
  optional binary temperature (STRING);
  optional binary humidity (STRING);
  optional double cpu;
  optional binary uuid (STRING);
  optional binary ipaddress (STRING);
  optional binary host (STRING);
  optional binary end (STRING);
  optional binary te (STRING);
  optional binary runtime (STRING);
  optional binary id (STRING);
}

References







Migrating Apache Flume Flows to Apache NiFi: Kafka Source to HTTP REST Sink and HTTP REST Source to Kafka Sink

Migrating Apache Flume Flows to Apache NiFi:  Kafka Source to HTTP REST Sink and HTTP REST Source to Kafka Sink


This is a simple use case of being a gateway between REST API and Kafka.   We can do a lot more than that in NiFi.  We can be a Kafka Consumer and Producer as well as POST REST calls and receive any REST calls on configurable ports.  All with No Code.

NiFi can act as a listener for HTTP Requests and provide HTTP Responses in a scriptable full Web Server mechanism with JETTY.   Or it can listen for HTTP REST calls on a port and route those files anywhere. https://community.cloudera.com/t5/Community-Articles/Parsing-Web-Pages-for-Images-with-Apache-NiFi/ta-p/248415 .  We can also do websockets https://community.cloudera.com/t5/Community-Articles/An-Example-WebSocket-Application-in-Apache-NiFi-1-1/ta-p/248598.  https://community.cloudera.com/t5/Community-Articles/Accessing-Feeds-from-EtherDelta-on-Trades-Funds-Buys-and/ta-p/248316

It is extremely easy to do this in NiFi.




Kafka Consumer to REST POST



HTTP REST to Kafka Producer




Full Monitoring on Apache NiFi






A Very Common Use Case:  Ingesting Stock Feeds From REST to Kafka



References

Migrating Apache Flume Flows to Apache NiFi: SYSLOG to KAFKA

Migrating Apache Flume Flows to Apache NiFi:  SYSLOG  to KAFKA




This is a simple use case of being a smart gateway/proxy between SYSLOGand Kafka.   We can do a lot more than that in NiFi.  We can be a Kafka Consumer and Producer as well as read and parse all types of logs including SYSLOG.  We have a GrokReader for converting semistructured logs into manageable tabular style data with schemas.   Log->JSON/CSV/AVRO/PARQUET.  All with No Code.

It is extremely easy to do this in NiFi.

See:  Log Parsing:   https://www.datainmotion.dev/2019/08/migrating-apache-flume-flows-to-apache.html

Detailed Tutorials

https://blog.davidvassallo.me/2018/09/19/apache-nifi-from-syslog-to-elasticsearch/
https://community.cloudera.com/t5/Community-Articles/NiFi-Send-to-syslog/ta-p/248638
http://www.youritgoeslinux.com/impl/bigdata/nifi/syslog

Processors

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.9.2/org.apache.nifi.processors.standard.ListenSyslog/

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.9.2/org.apache.nifi.processors.standard.ParseSyslog/index.html



References

Migrating Apache Flume Flows to Apache NiFi: Twitter Source to Kafka Sink

Migrating Apache Flume Flows to Apache NiFi:  Twitter Source to Kafka Sink


Article 4 -  This.

This is a simple use case of pushing Tweets to Kafka.   We can do a lot more than that in NiFi.   If you see the referenced article I can do Deep Learning on Tweet Images, Run Sentiment Analysis, Query the Tweets in Stream, Send messages to email / slack based on certain criteria and retweet automagically.  All with No Code.

It is extremely easy to do this in NiFi.





Create a Kafka Topic To Send Tweets To


Example Tweet



Configure Your Connection to Twitter



Configure Kafka Producer



NiFi Flow to Send Two Twitter Sources To Kafka


Additional Apache NiFi Flow Details For Other Processing







Kafka Messages in Topic Sent From NiFi Producer 



Source Code


References

Migrating Apache Flume Flows to Apache NiFi: Kafka Source to Apache Parquet on HDFS

Migrating Apache Flume Flows to Apache NiFi: Kafka Source to Apache Parquet on HDFS


Article 3 - This

This is one possible simple, fast replacement for "Flafka".   I can read any/all Kafka topics, route and transform them with SQL and store them in Apache ORC, Apache Avro, Apache Parquet, Apache Kudu, Apache HBase, JSON, CSV, XML or compressed files of many types in S3, Apache HDFS, File Systems or anywhere you want to stream this data in Real-time.   Also with a fast easy to use Web UI.   Everything you liked doing in Flume but now easier and with more Source and Sink options.







Consume Kafka And Store to Apache Parquet


Kafka to Kudu, ORC, AVRO and Parquet 


With Apache 1.10 I can send those Parquet files anywhere not only HDFS.


JSON (or CSV or AVRO or ...) and Parquet Out

In Apache 1.10, Parquet has a dedicated reader and writer


Or I can use PutParquet



Create A Parquet Table and Query It









References