Skip to main content

Using GrovePi with Raspberry Pi and MiNiFi Agents for Data Ingest to Parquet, Kudu, ORC, Kafka, Hive and Impala

Using GrovePi with Raspberry Pi and MiNiFi Agents for Data Ingest


Source Code:  https://github.com/tspannhw/minifi-grove-sensors

Acquiring sensor data from Grove sensors is easy using a GrovePi Hat and some compatible sensors.


Just before my talk at the Future of Data Meetup @ Bell Works in Holmdel, NJ, I thought I should ingest some data from a grove sensor interface.

It's so easy a sleeping cat could do it.




So what does this device look like?  



I have a temperature and humidity sensor on there.




The distance sonic sensor is in there too, that's for the next article.




Let's do this with minimal RAM.




That's a 64GB hard drive underneath in the white case with the RPI.





I need more data and BACON.



We design our MiNiFi Agent Flow in CEM/EFM.   Grab JSON data stream and run sensors.


Apache NiFi 1.9.2 / CFM 1.0 Received HTTPS S2S Events From MiNiFi Agent




A simple flow to query and convert our JSON data, then store it to Kudu and HDFS (ORC) as well as push it to Kafka with a schema.




Let's read that Kafka message and store to Parquet, we will push to MQTT and JMS in the next article.   This is our universal proxy/gateway.



We could infer a schema and not save it.   But by saving a schema to the schema registry it makes SMM, Kafka, NiFi and others schema aware and easy to automagically query and convert between CSV/JSON/XML/AVRO/Parquet and more.

Let's store the data in Parquet files on HDFS with an Impala table.   In Apache NiFi 1.10 there is a ParquetWriter



Before we push to Kafka, let's create a topic for it with Cloudera SMM



Let's build an impala table for that Kudu data.



We can query our tables with ease as data rapidly is added.





Let's Examine the Parquet Files that NiFi Generated





 Let's query that parquet data with Impala in Hue



 Let's monitor that data in Kafka with Cloudera SMM






That was easy from device to enterprise cloud data store(s) with enterprise messages, security, governance, lineage, data catalog, SDX, monitoring and more.   How easy can you ingest IoT data, query it mid stream and store it in multiple data stores.   It took longer to write the article then to do the project and code.   All graphical, Single Sign On, multiple schemas/verisons/data types/engines, multiple OSs, edge, cloud and laptop.   Easy.

Table DDL


CREATE EXTERNAL TABLE IF NOT EXISTS grovesensors2 
(humidity STRING, uuid STRING, systemtime STRING, runtime STRING, cpu DOUBLE, id STRING, te STRING, host STRING, `end` STRING, 
macaddress STRING, temperature STRING, diskusage STRING, memory DOUBLE, ipaddress STRING, host_name STRING) 
STORED AS ORC
LOCATION '/tmp/grovesensors'

CREATE TABLE grovesensors ( uuid STRING,  `end` STRING,humidity STRING, systemtime STRING, runtime STRING, cpu DOUBLE, id STRING, te STRING, 
host STRING,
macaddress STRING, temperature STRING, diskusage STRING, memory DOUBLE, ipaddress STRING, host_name STRING,
PRIMARY KEY (uuid, `end`)
)
PARTITION BY HASH PARTITIONS 16
STORED AS KUDU
TBLPROPERTIES ('kudu.num_tablet_replicas' = '1')

hdfs dfs -mkdir -p /tmp/grovesensors
hdfs dfs -mkdir -p /tmp/groveparquet

CREATE  EXTERNAL TABLE grove_parquet 
 (
 diskusage STRING, 
  memory DOUBLE,  host_name STRING,
  systemtime STRING,
  macaddress STRING,
  temperature STRING,
  humidity STRING,
  cpu DOUBLE,
  uuid STRING,  ipaddress STRING,
  host STRING,
  `end` STRING,  te STRING,
  runtime STRING,
  id STRING
)
STORED AS PARQUET
LOCATION '/tmp/groveparquet/'

Parquet Format



message org.apache.nifi.grove {
  optional binary diskusage (STRING);
  optional double memory;
  optional binary host_name (STRING);
  optional binary systemtime (STRING);
  optional binary macaddress (STRING);
  optional binary temperature (STRING);
  optional binary humidity (STRING);
  optional double cpu;
  optional binary uuid (STRING);
  optional binary ipaddress (STRING);
  optional binary host (STRING);
  optional binary end (STRING);
  optional binary te (STRING);
  optional binary runtime (STRING);
  optional binary id (STRING);
}

References







Popular posts from this blog

Ingesting Drone Data From DJII Ryze Tello Drones Part 1 - Setup and Practice

Ingesting Drone Data From DJII Ryze Tello Drones Part 1 - Setup and Practice In Part 1, we will setup our drone, our communication environment, capture the data and do initial analysis. We will eventually grab live video stream for object detection, real-time flight control and real-time data ingest of photos, videos and sensor readings. We will have Apache NiFi react to live situations facing the drone and have it issue flight commands via UDP. In this initial section, we will control the drone with Python which can be triggered by NiFi. Apache NiFi will ingest log data that is stored as CSV files on a NiFi node connected to the drone's WiFi. This will eventually move to a dedicated embedded device running MiniFi. This is a small personal drone with less than 13 minutes of flight time per battery. This is not a commercial drone, but gives you an idea of the what you can do with drones. Drone Live Communications for Sensor Readings and Drone Control You must connect t

Advanced XML Processing with Apache NiFi 1.9.1

Advanced XML Processing with Apache NiFi 1.9.1 With the latest version of Apache NiFi, you can now directly convert XML to JSON or Apache AVRO, CSV or any other format supported by RecordWriters.   This is a great advancement.  To make it even easier, you don't even need to know the schema before hand.   There is a built-in option to Infer Schema. The results of an RSS (XML) feed converted to JSON and displayed in a slack channel. Besides just RSS feeds, we can grab regular XML data including XML data that is wrapped in a Zip file (or even in a Zipfile in an email, SFTP server or Google Docs). Get the Hourly Weather Observation for the United States Decompress That Zip  Unpack That Zip into Files One ZIP becomes many XML files of data. An example XML record from a NOAA weather station. Converted to JSON Automagically Let's Read Those Records With A Query and Convert the results to JSON Records

Connecting Apache NiFi to Apache Atlas For Data Governance At Scale in Streaming

Connecting Apache NiFi to Apache Atlas For Data Governance At Scale in Streaming Once connected you can see NiFi and Kafka flowing to Atlas. You must add Atlas Report to NiFi cluster. Add a ReportLineageToAtlas under Controller Settings / Reporting Tasks You must add URL for Atlas, Authentication method and if basic, username/password. You need to set Atlas Configuration directory, NiFi URL to use, Lineage Strategy - Complete Path Another example with an AWS hosted NiFi and Atlas: IMPORTANT NOTE:   Keep your Atlas Default Cluster Name  consistent with other applications for Cloudera clusters, usually the name cm is a great option or default . You can now see the lineage state: Configure Atlas to Be Enabled and Have Kafka Have Atlas Service enabled in NiFi configuration Example Configuration You must have access to Atlas Application Properties. /etc/atlas/conf atlas-application.properties   #Generated by Apache