Ingest Salesforce Data into Hive Using Apache Nifi

Ingest Salesforce Data into Hive Using Apache NiFi

TOOLS AND ACCOUNT USED: 

Salesforce JDBC driver : https://www.progress.com/jdbc/salesforce
Java : 1.8.0_221 
OSS: MacOS Mojave 10.14.5
SDFC Developer account: https://developer.salesforce.com/signup - 15 day trial


Overall Flow :

Install Hive - Thrift, Metastore, and Hive Table schema that matches SDFC opportunity table. 
Install NiFi - Add PutHive3streaming nar 
Install and Configure Drivers : SalesforceConnect and Avroreader
Add Processors - QueryDatabaseTable and Puthive3Streaming


Install Salesforce JDBC and Hive drivers

  • Download DataDirect Salesforce JDBC driver from here.
  • Download Cloudera Hive driver from here.
  • To install the driver, execute the .jar package by running the following command in the terminal or just by double clicking on the jar package.


java -jar PROGRESS_DATADIRECT_JDBC_SF_ALL.jar
java -jar PROGRESS_DATADIRECT_JDBC_INSTALL.jar




Add Drivers to the Apache NiFi Classpath

Copy drivers

From - /home/user/Progress/DataDirect/Connect_for_JDBC_51/{sforce.jar,hive.jar}
To - /path/to/nifi/lib [/usr/local/Cellar/nifi/1.9.2/libexec/lib/

Restart Apache NiFi



Build the Apache NiFi Flow

Launch Apache NiFi: http://localhost:8080/nifi

Add and Configure the JDBC drivers
DBCPConnectionPool
AvroReader
Add DBCPConnectionPool

Choose DBCPConnectionPool as your controller service

Add Properties below:

Database Connection URL = jdbc:datadirect:sforce://login.salesforce.com;SecurityToken=****

- NOTE below on SecurityToken

  • Database Driver Class Name : com.ddtek.jdbc.sforce.SForceDriver
  • Database User - Developer account username 
  • Password - the Password for the Developer account
  • Name - SalesforceConnect


Add Avro reader Controller : 
This controller is used by the subsequent co-processor(PutHive3Streaming) to read Avro format data
from the previous processor(QueryDatabaseTable). 



Add and Configure Processors


QueryDatabaseTable
PutHive3Streaming


QueryDatabaseTable

To read the data from Salesforce use a processor called QueryDatabaseTable that supports
incremental data pull.

Choose QueryDatabaseTable and add it on to canvas as below.





Configure QueryDatabaseTable processor:

Right click on the QueryDatabaseTable, Choose scheduling type, manual or cron.

Under Properties tab and add following properties


Database Connection Pooling Service : SalesForceConnect

Database Type : Generic

Table Name : Opportunity 

Apply and save the processor configuration.


PutHive3Streaming 


NOTE: PutHivestreaming was used in the beginning, but for the downstream hive version 3.x 
@Matthew Burgess recommended to use PutHive3Streaming processor.
Restart NiFi and PutHive3Streaming should be available for use. PutHiveStreaming kept throwing
error not able connect to hivemetastore URI. Like below:




Configure PutHive3Streaming Processor 

Drag another processor from the menu and choose PutHive3streaming as your processor from the list.
Under Properties tab and add following properties:

Record Reader = AvroReader
Hive Metastore URI = thrift://localhost:9083
Hive Configuration Resources = /usr/local/Cellar/hive/3.1.2/libexec/conf/hive-site.xml
Database Name = default 
Table Name = opportunity


             Properties: 


              Relationship termination settings:



Connect the processor QueryDatabaseTable to PutHive3streaming.

Make sure to check success relationship. Then click on Add/Apply.




Opportunity table schema in Hive:


create table opportunity2
(ID string,
ISDELETED boolean,
ACCOUNTID string,
ISPRIVATE boolean,
NAME string,
DESCRIPTION string,
STAGENAME string,
AMOUNT string,
PROBABILITY string,
EXPECTEDREVENUE string,
TOTALOPPORTUNITYQUANTITY double,
CLOSEDATE string,
TYPE string,
NEXTSTEP string,
LEADSOURCE string,
ISCLOSED boolean,
ISWON boolean,
FORECASTCATEGORY string,
FORECASTCATEGORYNAME string,
CAMPAIGNID string,
HASOPPORTUNITYLINEITEM boolean,
PRICEBOOK2ID string,
OWNERID string,
CREATEDDATE string,
CREATEDBYID string,
LASTMODIFIEDDATE string,
LASTMODIFIEDBYID string,
SYSTEMMODSTAMP string,
LASTACTIVITYDATE string,
FISCALQUARTER int,
FISCALYEAR int,
FISCAL string,
LASTVIEWEDDATE string,
LASTREFERENCEDDATE string,
HASOPENACTIVITY boolean,
HASOVERDUETASK boolean,
DELIVERYINSTALLATIONSTATUS__C string,
TRACKINGNUMBER__C string,
ORDERNUMBER__C string,
CURRENTGENERATORS__C string,
MAINCOMPETITORS__C string )
STORED AS ORC TBLPROPERTIES ('transactional'='true');

NOTES: 


SDFC Security Token : 


This Token is needed to connect to SDFC via jdbc. 
Login to sdfc dev env and reset it, it will send the token in the email attached to the account.



Error while connecting to sdfc without security token.


Hive3 Acid table issues: 


Hive version being used was hive 3.1.2. By default table created was not transactional. This led to the error below. 


Error
2019-11-07 13:25:39,730 ERROR [Timer-Driven Process Thread-3] o.a.h.streaming.HiveStreamingConnection HiveEndPoint { metaStoreUri: thrift://localhost:9083, database: default, table: opportunity2 } must use an acid table. 
2019-11-07 13:25:39,731 ERROR [Timer-Driven Process Thread-3] o.a.n.processors.hive.PutHive3Streaming PutHive3Streaming[id=46a1d083-016e-1000-280f-5a026f1ceef7] PutHive3Streaming[id=46a1d083-016e-1000-280f-5a026f1ceef7] failed to process session due to java.lang.NullPointerException; Processor Administratively Yielded for 1 sec: java.lang.NullPointerException 


Note:{ metaStoreUri: thrift://localhost:9083, database: default, table: opportunity2 } must use an acid table.


Hive Command line properties set:

set hive.support.concurrency = true;
set hive.enforce.bucketing = true;
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on = true;
set hive.compactor.worker.threads = 1 


Only after this creating table with TBLPROPERTIES ('transactional'='true') was successful.

END PRODUCT


Apache NiFi Flow:



HIVE Result: 

Beeline >  !connect jdbc:hive2://localhost:10000;
0: jdbc:hive2://localhost:10000> select opportunity2.id, opportunity2.accountid,
opportunity2.name, opportunity2.stagename, opportunity2.amount 
from opportunity2 limit 10;

hive





Exploring Apache NiFi 1.10: Stateless Engine and Parameters

Exploring Apache NiFi 1.10:   Stateless Engine and Parameters

Apache NiFi is now available in 1.10!


https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12316020&version=12344993

You can now use JDK 8 or JDK 11!   I am running in JDK 11, seems a bit faster.

A huge feature is the addition of Parameters!   And you can use these to pass parameters to Apache NiFi Stateless!

A few lesser Processors have been moved from the main download, see here for migration hints:
https://cwiki.apache.org/confluence/display/NIFI/Migration+Guidance

Release Notes:   https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-Version1.10.0

Example Source Code:   https://github.com/tspannhw/stateless-examples

More New Features:

  • ParquetReader/Writer (See:  https://www.datainmotion.dev/2019/10/migrating-apache-flume-flows-to-apache_7.html)
  • Prometheus Reporting Task.   Expect more Prometheus stuff coming.
  • Experimental Encrypted content repository.   People asked me for this one before.
  • Parameters!!    Time to replace Variables/Variable Registry.   Parameters are better in every way.
  • Toolkit module to generate and build Swagger API library for NiFi
  • PostSlack processor
  • PublishKafka Partition Support
  • GeoEnrichIPRecord Processor
  • Remote Input Port in a Process Group
  • Command Line Diagnostics
  • RocksDB FlowFile Repository
  • PutBigQueryStreaming Processor
  • nifi.analytics.predict.enabled - Turn on Back Pressure Prediction
  • More Lookup Services for ETL/ELT:   DatabaseRecordLookupService
  • KuduLookupService
  • HBase_2_ListLookupService
Stateless

First we will run in the command line straight from the NiFi Registry.    This is easiest.   Then we will run from YARN!   Yes you can now run your Apache NiFi flows on your giant Cloudera CDH/HDP/CDP YARN clusters!   Let's make use of your hundreds of Hadoop nodes.



Stateless Examples


Let's Build A Stateless Flow

The first thing to keep in mind, is we will want anything that might change to be a parameter that we can pass with our JSON file.    It's very easy to set parameters even for drop downs!   You even get prompted to pick a parameter from a selection list.   Before parameters are available you will need to add them to a parameter list and assign that parameter context to your Process Group.


A Parameter in a Processor Configuration is shown as #{broker}

Parameter Context Connected to a Process Group, Controller Service, ...


Apply those parameters



Param(eter) is now an option for properties


Pop-up Hint for Using Parameters


Edit a Parameter in a Parameter Context



We can configure parameters in Controller Services as well.




So easy to choose an existing one.

Use them for anything that can change or is a something you don't want to hardcode.




Apache Kafka Consumer to Sink

This is a simple two step Apache NiFi flow the reads from Kafka and sends to a sink, for example a File.



Let's make sure we use that Parameter Context 




To Build Your JSON Configuration File you will need the bucket ID and flow ID from your Apache NiFi Registry.   You will also need the URL for that registry.   You can browse that registry at a URL similiar to http://tspann-mbp15-hw14277:18080.





My Command Line Runner


/Users/tspann/Documents/nifi-1.10.0-SNAPSHOT/bin/nifi.sh stateless RunFromRegistry Continuous --file /Users/tspann/Documents/nifi-1.10.0-SNAPSHOT/logs/kafkaconsumer.json


RunFromRegistry [Once|Continuous] --file <File Name>

This is the basic use case of running from the command-line using a file.   The flow must exist in the reference Apache NiFi Registry.

JSON Configuration File (kafkaconsumer.json)

{
  "registryUrl": "http://tspann-mbp15-hw14277:18080",
  "bucketId": "140b30f0-5a47-4747-9021-19d4fde7f993",
  "flowId": "0540e1fd-c7ca-46fb-9296-e37632021945",
  "ssl": {
    "keystoreFile": "",
    "keystorePass": "",
    "keyPass": "",
    "keystoreType": "",
    "truststoreFile": "/Library/Java/JavaVirtualMachines/amazon-corretto-11.jdk/Contents/Home/lib/security/cacerts",
    "truststorePass": "changeit",
    "truststoreType": "JKS"
  },
  "parameters": {
    "broker" : "4.317.852.100:9092",
    "topic" : "iot",
    "group_id" : "nifi-stateless-kafka-consumer",
    "DestinationDirectory" : "/tmp/nifistateless/output2/",
    "output_dir": "/Users/tspann/Documents/nifi-1.10.0-SNAPSHOT/logs/output"
  }
}

Example Run


12:25:38.725 [main] DEBUG org.apache.nifi.processors.kafka.pubsub.ConsumeKafka_2_0 - ConsumeKafka_2_0[id=e405df7f-87ca-305a-95a9-d25e3c5dbb56] Running ConsumeKafka_2_0.onTrigger with 0 FlowFiles
12:25:38.728 [main] DEBUG org.apache.kafka.clients.FetchSessionHandler - [Consumer clientId=consumer-1, groupId=nifi-stateless-kafka-consumer] Node 8 sent an incremental fetch response for session 1943199939 with 0 response partition(s), 10 implied partition(s)
12:25:38.728 [main] DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - [Consumer clientId=consumer-1, groupId=nifi-stateless-kafka-consumer] Added READ_UNCOMMITTED fetch request for partition iot-8 at offset 15 to node ip-10-0-1-244.ec2.internal:9092 (id: 8 rack: null)
12:25:38.728 [main] DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - [Consumer clientId=consumer-1, groupId=nifi-stateless-kafka-consumer] Added READ_UNCOMMITTED fetch request for partition iot-9 at offset 16 to node ip-10-0-1-244.ec2.internal:9092 (id: 8 rack: null)
12:25:38.728 [main] DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - [Consumer clientId=consumer-1, groupId=nifi-stateless-kafka-consumer] Added READ_UNCOMMITTED fetch request for partition iot-6 at offset 17 to node ip-10-0-1-244.ec2.internal:9092 (id: 8 rack: null)
12:25:38.728 [main] DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - [Consumer clientId=consumer-1, groupId=nifi-stateless-kafka-consumer] Added READ_UNCOMMITTED fetch request for partition iot-7 at offset 17 to node ip-10-0-1-244.ec2.internal:9092 (id: 8 rack: null)
12:25:38.728 [main] DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - [Consumer clientId=consumer-1, groupId=nifi-stateless-kafka-consumer] Added READ_UNCOMMITTED fetch request for partition iot-4 at offset 18 to node ip-10-0-1-244.ec2.internal:9092 (id: 8 rack: null)
12:25:38.728 [main] DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - [Consumer clientId=consumer-1, groupId=nifi-stateless-kafka-consumer] Added READ_UNCOMMITTED fetch request for partition iot-5 at offset 16 to node ip-10-0-1-244.ec2.internal:9092 (id: 8 rack: null)
12:25:38.728 [main] DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - [Consumer clientId=consumer-1, groupId=nifi-stateless-kafka-consumer] Added READ_UNCOMMITTED fetch request for partition iot-2 at offset 17 to node ip-10-0-1-244.ec2.internal:9092 (id: 8 rack: null)
12:25:38.728 [main] DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - [Consumer clientId=consumer-1, groupId=nifi-stateless-kafka-consumer] Added READ_UNCOMMITTED fetch request for partition iot-3 at offset 19 to node ip-10-0-1-244.ec2.internal:9092 (id: 8 rack: null)
12:25:38.728 [main] DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - [Consumer clientId=consumer-1, groupId=nifi-stateless-kafka-consumer] Added READ_UNCOMMITTED fetch request for partition iot-0 at offset 16 to node ip-10-0-1-244.ec2.internal:9092 (id: 8 rack: null)
12:25:38.728 [main] DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - [Consumer clientId=consumer-1, groupId=nifi-stateless-kafka-consumer] Added READ_UNCOMMITTED fetch request for partition iot-1 at offset 20 to node ip-10-0-1-244.ec2.internal:9092 (id: 8 rack: null)
12:25:38.728 [main] DEBUG org.apache.kafka.clients.FetchSessionHandler - [Consumer clientId=consumer-1, groupId=nifi-stateless-kafka-consumer] Built incremental fetch (sessionId=1943199939, epoch=5) for node 8. Added 0 partition(s), altered 0 partition(s), removed 0 partition(s) out of 10 partition(s)
12:25:38.729 [main] DEBUG org.apache.kafka.clients.consumer.internals.Fetcher - [Consumer clientId=consumer-1, groupId=nifi-stateless-kafka-consumer] Sending READ_UNCOMMITTED IncrementalFetchRequest(toSend=(), toForget=(), implied=(iot-8, iot-9, iot-6, iot-7, iot-4, iot-5, iot-2, iot-3, iot-0, iot-1)) to broker ip-10-0-1-244.ec2.internal:9092 (id: 8 rack: null)
12:25:38.737 [main] DEBUG org.apache.nifi.processors.kafka.pubsub.ConsumeKafka_2_0 - ConsumeKafka_2_0[id=e405df7f-87ca-305a-95a9-d25e3c5dbb56] Running ConsumeKafka_2_0.onTrigger with 0 FlowFiles


Example Output


cat output/247361879273711.statelessFlowFile
{"id":"20191105113853_350b493f-9308-4eb2-b71f-6bcdbaf5d6c1_Timer-Driven Process Thread-13","te":"0.5343","diskusage":"0.2647115097153814.3 MB","memory":57,"cpu":132.87,"host":"192.168.1.249/tspann-MBP15-HW14277","temperature":"72","macaddress":"dd73eadf-1ac1-4f76-aecb-14be86ce46ce","end":"48400221819907","systemtime":"11/05/2019 11:38:53"}



We can also run Once in this example to send one Kafka message.

Generator to Apache Kafka Producer


My Command Line Runner

/Users/tspann/Documents/nifi-1.10.0-SNAPSHOT/bin/nifi.sh stateless RunFromRegistry Once --file /Users/tspann/Documents/nifi-1.10.0-SNAPSHOT/logs/kafka.json


JSON Configuration File (kafka.json)

{
  "registryUrl": "http://tspann-mbp15-hw14277:18080",
  "bucketId": "140b30f0-5a47-4747-9021-19d4fde7f993",
  "flowId": "402814a2-fb7a-4b19-a641-9f4bb191ed67",
  "flowVersion": "1",
  "ssl": {
    "keystoreFile": "",
    "keystorePass": "",
    "keyPass": "",
    "keystoreType": "",
    "truststoreFile": "/Library/Java/JavaVirtualMachines/amazon-corretto-11.jdk/Contents/Home/lib/security/cacerts",
    "truststorePass": "changeit",
    "truststoreType": "JKS"
  },
  "parameters": {
    "broker" : "3.218.152.236:9092"
  }
}

Example Output


12:32:37.717 [kafka-producer-network-thread | producer-1] DEBUG org.apache.kafka.common.network.Selector - [Producer clientId=producer-1] Created socket with SO_RCVBUF = 33304, SO_SNDBUF = 131768, SO_TIMEOUT = 0 to node 8
12:32:37.717 [kafka-producer-network-thread | producer-1] DEBUG org.apache.kafka.clients.NetworkClient - [Producer clientId=producer-1] Completed connection to node 8. Fetching API versions.
12:32:37.717 [kafka-producer-network-thread | producer-1] DEBUG org.apache.kafka.clients.NetworkClient - [Producer clientId=producer-1] Initiating API versions fetch from node 8.
12:32:37.732 [kafka-producer-network-thread | producer-1] DEBUG org.apache.kafka.clients.NetworkClient - [Producer clientId=producer-1] Recorded API versions for node 8: (Produce(0): 0 to 7 [usable: 6], Fetch(1): 0 to 10 [usable: 8], ListOffsets(2): 0 to 5 [usable: 3], Metadata(3): 0 to 7 [usable: 6], LeaderAndIsr(4): 0 to 2 [usable: 1], StopReplica(5): 0 to 1 [usable: 0], UpdateMetadata(6): 0 to 5 [usable: 4], ControlledShutdown(7): 0 to 2 [usable: 1], OffsetCommit(8): 0 to 6 [usable: 4], OffsetFetch(9): 0 to 5 [usable: 4], FindCoordinator(10): 0 to 2 [usable: 2], JoinGroup(11): 0 to 4 [usable: 3], Heartbeat(12): 0 to 2 [usable: 2], LeaveGroup(13): 0 to 2 [usable: 2], SyncGroup(14): 0 to 2 [usable: 2], DescribeGroups(15): 0 to 2 [usable: 2], ListGroups(16): 0 to 2 [usable: 2], SaslHandshake(17): 0 to 1 [usable: 1], ApiVersions(18): 0 to 2 [usable: 2], CreateTopics(19): 0 to 3 [usable: 3], DeleteTopics(20): 0 to 3 [usable: 2], DeleteRecords(21): 0 to 1 [usable: 1], InitProducerId(22): 0 to 1 [usable: 1], OffsetForLeaderEpoch(23): 0 to 2 [usable: 1], AddPartitionsToTxn(24): 0 to 1 [usable: 1], AddOffsetsToTxn(25): 0 to 1 [usable: 1], EndTxn(26): 0 to 1 [usable: 1], WriteTxnMarkers(27): 0 [usable: 0], TxnOffsetCommit(28): 0 to 2 [usable: 1], DescribeAcls(29): 0 to 1 [usable: 1], CreateAcls(30): 0 to 1 [usable: 1], DeleteAcls(31): 0 to 1 [usable: 1], DescribeConfigs(32): 0 to 2 [usable: 2], AlterConfigs(33): 0 to 1 [usable: 1], AlterReplicaLogDirs(34): 0 to 1 [usable: 1], DescribeLogDirs(35): 0 to 1 [usable: 1], SaslAuthenticate(36): 0 to 1 [usable: 0], CreatePartitions(37): 0 to 1 [usable: 1], CreateDelegationToken(38): 0 to 1 [usable: 1], RenewDelegationToken(39): 0 to 1 [usable: 1], ExpireDelegationToken(40): 0 to 1 [usable: 1], DescribeDelegationToken(41): 0 to 1 [usable: 1], DeleteGroups(42): 0 to 1 [usable: 1], UNKNOWN(43): 0)
12:32:37.739 [kafka-producer-network-thread | producer-1] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name topic.iot.records-per-batch
12:32:37.739 [kafka-producer-network-thread | producer-1] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name topic.iot.bytes
12:32:37.739 [kafka-producer-network-thread | producer-1] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name topic.iot.compression-rate
12:32:37.739 [kafka-producer-network-thread | producer-1] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name topic.iot.record-retries
12:32:37.740 [kafka-producer-network-thread | producer-1] DEBUG org.apache.kafka.common.metrics.Metrics - Added sensor with name topic.iot.record-errors
12:32:37.745 [main] DEBUG org.apache.nifi.parameter.ExpressionLanguageAwareParameterParser - For input iot found 0 Parameter references: []
12:32:37.745 [main] DEBUG org.apache.nifi.parameter.ExpressionLanguageAwareParameterParser - For input iot found 0 Parameter references: []
Flow Succeeded


Other Runtime Options:


RunYARNServiceFromRegistry        <YARN RM URL> <Docker Image Name> <Service Name> <# of Containers> --file <File Name>


RunOpenwhiskActionServer          <Port>

References:

© 2019 Timothy Spann

Learning Flink and Analyzing Flink Metrics via REST API














Running a Demo Apache Flink Application With Apache NiFi and Apache Kafka on CDH 6.3

Running a Demo Apache Flink Application With Apache NiFi and Apache Kafka on CDH 6.3














As a simple first Flink example program, I built the bonus exercise from QuickStart for Apache Flink + Apache Kafka.

https://ci.apache.org/projects/flink/flink-docs-release-1.9/getting-started/tutorials/datastream_api.html