Philadelphia Open Crime Data on Phoenix / HBase

This is an update to a previous article on accessing Philadelphia Open Crime Data and storing it in Apache Phoenix on HBase.

It seems an update to Spring Boot, Phoenix and Zeppelin make for a cleaner experience.

I also added a way to grab years of historical Policing data.

All NiFi, Zeppelin and Source is here:   https://github.com/tspannhw/phillycrime-springboot-phoenix




Part 1: https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html












We convert JSON to Phoenix Upserts.
We push JSON Records to HBase with PutHBaseReord.


Query Phoenix at the Command Line, Super Fast SQL



Resources




Example Data
"dc_dist":"18",
"dc_key":"200918067518",
"dispatch_date":"2009-10-02",
"dispatch_date_time":"2009-10-02T14:24:00.000",
"dispatch_time":"14:24:00",
"hour":"14",
"location_block":"S 38TH ST / MARKETUT ST",
"psa":"3",
"text_general_code":"Other Assaults",
"ucr_general":"800"}


Create a Phoenix Table

/usr/hdp/current/phoenix-client/bin/sqlline.py localhost:2181:/hbase-unsecure

CREATE TABLE phillycrime (dc_dist varchar,
dc_key varchar not null primary key,dispatch_date varchar,dispatch_date_time varchar,dispatch_time varchar,hour varchar,location_block varchar,psa varchar,
text_general_code varchar,ucr_general varchar);


Add NiFi / Spring Boot Connectivity to Phoenix
org.apache.phoenix.jdbc.PhoenixDriver
jdbc:phoenix:localhost:2181:/hbase-unsecure
/usr/hdp/3.1/phoenix/phoenix-client.jar
/usr/hdp/3.1/hbase/lib/hbase-client.jar
/etc/hbase/conf/hbase-site.xml



Run spring boot …

Powering Edge AI with the Powerful Jetson Nano

NVidia Jetson Nano Deep Learning Edge Device


Nano The Cat





Hardware:
Jetson Nano developer kit. Built around a 128-core Maxwell GPU and quad-core ARM A57 CPU running at 1.43 GHz and coupled with 4GB of LPDDR4 memory! This is power at the edge. I now have a favorite new device.

You need to add some kind of USB WiFi adaptor if you are not hardwired to ethernet. This is cheap and easy, I added a tiny $15 WiFi adapter and was off to the races.


Operating System:
Ubuntu 18.04

Library Setup:


sudo apt-get update -y
sudo apt-get install git cmake -y
sudo apt-get install libatlas-base-dev gfortran -y
sudo apt-get install libhdf5-serial-dev hdf5-tools -y

sudo apt-get install python3-dev -y
sudo apt-get install libcv-dev libopencv-dev -y
sudo apt-get install fswebcam -y
sudo apt-get install libv4l-dev -y
sudo apt-get install python-opencv -y
pip3 install psutil
pip2 install psutil
pip3.6 install easydict -U
pip3.6 install scikit-learn -U
pip3.6 install opencv-python -U --user
pip3.6 install numpy -U
pip3.6 install mxnet -U
pip3.6 install mxnet-mkl -U
pip3.6 install gluoncv --upgrade
sudo apt-get install libhdf5-serial-dev hdf5-tools libhdf5-dev zlib1g-dev zip libjpeg8-dev -y
sudo apt-get install python3-pip
sudo pip3 install -U pip
sudo pip3 install --pre --extra-index-url https://developer.download.nvidia.com/compute/redist/jp/v42 tensorflow-gpu
sudo nvpmodel -q --verbose
pip3 install numpy
pip3 install keras
git clone https://github.com/dusty-nv/jetson-inference
cd jetson-inference
git submodule update --init
tegrastats
pip3 install -U jetson-stats

Source:
https://github.com/tspannhw/iot-device-install
https://github.com/tspannhw/minifi-jetson-nano

IoT Setup

Download MiNiFi 0.6.0 Source from Cloudera and Build.
Download MiNiFi Java Agent (Binary)  and Unzip.

Follow these instructions.

On a Server

We want to hookup to EFM to make flow development, deploy, management and monitoring of MiNiFi agents trivial.   Download NiFi Registry.    You will also need Apache NiFi.

For a good walkthrough and hands-on demonstration see this workshop.

See these cool Jetson Nano Projects:  https://developer.nvidia.com/embedded/community/jetson-projects

Monitor Status
https://github.com/rbonghi/jetson_stats

Example Flow

It's easy to add MiNiFi Java or CPP Agents to the Jetson Nano.   I did a custom NiFi CPP 0.6.0 build for Jetson.  I did a quick flow to run the jetson-inference imagenet-console CPP binary on an image captured from a compatible Logitech USB Webcam with fswebcam.   I store the images to /opt/demo/images and pass it on the command line to the CPP console as a proof of concept.

#!/bin/bash

DATE=$(date +"%Y-%m-%d_%H%M")

fswebcam -q -r 1280x720 --no-banner /opt/demo/images/$DATE.jpg

/opt/demo/jetson-inference/build/aarch64/bin/imagenet-console  /opt/demo/images/$DATE.jpg  /opt/demo/images/out_$DATE.jpg
==
imagenet-console
  args (3):  0 [/opt/demo/jetson-inference/build/aarch64/bin/imagenet-console]  1 [/opt/demo/images/2019-07-01_1405.jpg]  2 [/opt/demo/images/out_2019-07-01_1405.jpg]


imageNet -- loading classification network model from:
         -- prototxt     networks/googlenet.prototxt
         -- model        networks/bvlc_googlenet.caffemodel
         -- class_labels networks/ilsvrc12_synset_words.txt
         -- input_blob   'data'
         -- output_blob  'prob'
         -- batch_size   2

[TRT]  TensorRT version 5.0.6
[TRT]  detected model format - caffe  (extension '.caffemodel')
[TRT]  desired precision specified for GPU: FASTEST
[TRT]  requested fasted precision for device GPU without providing valid calibrator, disabling INT8
[TRT]  native precisions detected for GPU:  FP32, FP16
[TRT]  selecting fastest native precision for GPU:  FP16
[TRT]  attempting to open engine cache file /opt/demo/jetson-inference/build/aarch64/bin/networks/bvlc_googlenet.caffemodel.2.1.GPU.FP16.engine
[TRT]  loading network profile from engine cache... /opt/demo/jetson-inference/build/aarch64/bin/networks/bvlc_googlenet.caffemodel.2.1.GPU.FP16.engine
[TRT]  device GPU, /opt/demo/jetson-inference/build/aarch64/bin/networks/bvlc_googlenet.caffemodel loaded
[TRT]  device GPU, CUDA engine context initialized with 2 bindings
[TRT]  binding -- index   0
               -- name    'data'
               -- type    FP32
               -- in/out  INPUT
               -- # dims  3
               -- dim #0  3 (CHANNEL)
               -- dim #1  224 (SPATIAL)
               -- dim #2  224 (SPATIAL)
[TRT]  binding -- index   1
               -- name    'prob'
               -- type    FP32
               -- in/out  OUTPUT
               -- # dims  3
               -- dim #0  1000 (CHANNEL)
               -- dim #1  1 (SPATIAL)
               -- dim #2  1 (SPATIAL)
[TRT]  binding to input 0 data  binding index:  0
[TRT]  binding to input 0 data  dims (b=2 c=3 h=224 w=224) size=1204224
[cuda]  cudaAllocMapped 1204224 bytes, CPU 0x100e30000 GPU 0x100e30000
[TRT]  binding to output 0 prob  binding index:  1
[TRT]  binding to output 0 prob  dims (b=2 c=1000 h=1 w=1) size=8000
[cuda]  cudaAllocMapped 8000 bytes, CPU 0x100f60000 GPU 0x100f60000
device GPU, /opt/demo/jetson-inference/build/aarch64/bin/networks/bvlc_googlenet.caffemodel initialized.
[TRT]  networks/bvlc_googlenet.caffemodel loaded
imageNet -- loaded 1000 class info entries
networks/bvlc_googlenet.caffemodel initialized.








Reference:




Performance Testing Apache NiFi - Part 1 - Loading Directories of CSV

Performance Testing Apache NiFi - Part 1 - Loading Directories of CSV

I am running a lot of different flows on different Apache NiFi configurations to get some performance numbers in different situations.

One situation I thought of was access directories of CSV files from HTTP.  Fortunately there's some really nice data available from NOAA (https://www.ncei.noaa.gov/data/global-hourly/access/2019/).

Example Flow:  NOAA





In this example performance testing flow I use my LinkProcessor to grab all of the links to CSV files on the HTTP download site.  I then split this JSON list into individual records and pull out the URL.   If it's a valid URL with a .CSV ending then I call invokeHTTP to download the CSV.   I then query the CSV for all the records (SELECT *) and for a count (SELECT COUNT(*)).   As part of this the records are written to JSON.



In this example we grab a specific CSV file and get 739 records.


 This CSVReader uses Jackson to parse the CSV files and figures out fields from the header.



I pull out the URL returned from the Link Processor.



This is my JSON Record Set Writer, it doesn't include a schema since I never built one.



I am looking at some performance stats for my NiFi instance which has 31GB of JVM space.  32GB causes issues due to the JVM's problem with 32bit addressing.









In this flow I generate unique JSON files in mass quantities at about 250bytes, merge them together, compress them, then push them to a file system.   This is to see how many records I can push.




QueryRecord is easy on CSV files even with no known schema.



The Results of the recordCount query:


I can also test with really fast multithreaded calls to a popular btc.com BitCoin exchange REST API.


Even encrypting and compressing won't slow me down.






Example Translated Data Segment
[{"STATION":"16541099999","DATE":"2019-01-07T05:55:00","SOURCE":"4","LATITUDE":"39.6666667","LONGITUDE":"9.4333333","ELEVATION":"645.0","NAME":"PERDASDEFOGU, IT","REPORT_TYPE":"FM-15","CALL_SIGN":"99999","QUALITY_CONTROL":"V020","WND":"330,1,N,0010,1","CIG":"99999,9,9,Y","VIS":"999999,9,9,9","TMP":"+0030,1","DEW":"+0020,1","SLP":"99999,9","MA1":null,"MD1":null,"REM":null},{"STATION":"16541099999","DATE":"2019-01-07T06:55:00","SOURCE":"4","LATITUDE":"39.6666667","LONGITUDE":"9.4333333","ELEVATION":"645.0","NAME":"PERDASDEFOGU, IT","REPORT_TYPE":"FM-15","CALL_SIGN":"99999","QUALITY_CONTROL":"V020","WND":"330,1,N,0010,1","CIG":"99999,9,9,Y","VIS":"999999,9,9,9","TMP":"+0030,1","DEW":"+0030,1","SLP":"99999,9","MA1":null,"MD1":null,"REM":null},{"STATION":"16541099999","DATE":"2019-01-07T07:55:00","SOURCE":"4","LATITUDE":"39.6666667","LONGITUDE":"9.4333333","ELEVATION":"645.0","NAME":"PERDASDEFOGU, IT","REPORT_TYPE":"FM-15","CALL_SIGN":"99999","QUALITY_CONTROL":"V020","WND":"300,1,N,0010,1","CIG":"99999,9,9,Y","VIS":"999999,9,9,9","TMP":"+0030,1","DEW":"+0020,1","SLP":"99999,9","MA1":null,"MD1":null,"REM":null},{"STATION":"16541099999","DATE":"2019-01-07T09:55:00","SOURCE":"4","LATITUDE":"39.6666667","LONGITUDE":"9.4333333","ELEVATION":"645.0","NAME":"PERDASDEFOGU, IT","REPORT_TYPE":"FM-15","CALL_SIGN":"99999","QUALITY_CONTROL":"V020","WND":"280,1,N,0026,1","CIG":"99999,9,9,Y","VIS":"999999,9,9,9","TMP":"+0070,1","DEW":"+0050,1","SLP":"99999,9","MA1":null,"MD1":null,"REM":null},{"STATION":"16541099999","DATE":"2019-01-07T10:55:00","SOURCE":"4","LATITUDE":"39.6666667","LONGITUDE":"9.4333333","ELEVATION":"645.0","NAME":"PERDASDEFOGU, IT","REPORT_TYPE":"FM-15","CALL_SIGN":"99999","QUALITY_CONTROL":"V020","WND":"260,1,N,0046,1","CIG":"99999,9,9,Y","VIS":"999999,9,9,9","TMP":"+0080,1

DataWorks Summit DC 2019 Report


While some lucky people were in DataWorkSummit Training, others of us were in the NoSQL Day.





After NoSQL Day's end party, it was time for meetups includings Apache NiFi and Apache Kafka sessions!    The Apache NiFi meetup was packed and had most of the Apache NiFi team on-site.















Tuesday May 21, 2019


NoSQL Day

Tracking Crime ... Phoenix/HBase/NiFi


Wednesday May 22, 2019

Expo Theatre 20 minute talk 1:35 pm - 
Apache Deep Learning 202


Thursday May 23, 2019

Cold Supply Chain Logistics using Sensors, Apache NiFi and the Hyperledger Fabric Blockchain Platform



1:35 - Expo Theatre 20 minute talk - Introduction to Apache NiFi



Edge to AI:  Analytics from Edge to Cloud with Efficient Movement of Machine Data



Reading OpenData JSON and Storing into Apache HBase / Phoenix Tables - Part 1

JSON Batch to Single Row Phoenix
I grabbed open data on Crime from Philly's Open Data (https://www.opendataphilly.org/dataset/crime-incidents), after a free sign up you get access to JSON crime data (https://data.phila.gov/resource/sspu-uyfa.json) You can grab individual dates or ranges for thousands of records. I wanted to spool each JSON record as a separate HBase row. With the flexibility of Apache NiFi 1.0.0, I can specify run times via cron or other familiar setup. This is my master flow.
First I use GetHTTP to retrieve the SSL JSON messages, I split the records up and store them as RAW JSON in HDFS as well as send some of them via Email, format them for Phoenix SQL and store them in Phoenix/HBase. All with no coding and in a simple flow. For extra output, I can send them to Reimann server for monitoring.
Setting up SSL for accessing HTTPS data like Philly Crime, require a little configuration and knowing what Java JRE you are using to run NiFi. You can run service nifi status to quickly get which JRE.
Split the Records
The Open Data set has many rows of data, let's split them up and pull out the attributes we want from the JSON.
Phoenix
Another part that requires specific formatting is setting up the Phoenix connection. Make sure you point to the correct driver and if you have security make sure that is set.
Load the Data (Upsert)
Once your data is loaded you can check quickly with /usr/hdp/current/phoenix-client/bin/sqlline.py localhost:2181:/hbase-unsecure
The SQL for this data set is pretty straight forward.
  1. CREATE TABLE phillycrime (dc_dist varchar,
  2. dc_key varchar not null primary key,dispatch_date varchar,dispatch_date_time varchar,dispatch_time varchar,hour varchar,location_block varchar,psa varchar,
  3. text_general_code varchar,ucr_general varchar);
  4.  
  5.  
  6. {"dc_dist":"18","dc_key":"200918067518","dispatch_date":"2009-10-02","dispatch_date_time":"2009-10-02T14:24:00.000","dispatch_time":"14:24:00","hour":"14","location_block":"S 38TH ST / MARKETUT ST","psa":"3","text_general_code":"Other Assaults","ucr_general":"800"}
  7. upsert into phillycrime values ('18', '200918067518', '2009-10-02','2009-10-02T14:24:00.000','14:24:00','14', 'S 38TH ST / MARKETUT ST','3','Other Assaults','800');
  8. !tables
  9. !describe phillycrime
The DC_KEY is unique so I used that as the Phoenix key. Now all the data I get will be added and any repeats will safely update. Sometimes during the data we may reget some of the same data, that's okay, it will just update to the same value.