FLaNK: Using Apache Kudu As a Cache For FDA Updates

January 06, 2021

FLaNK: Using Apache Kudu As a Cache For FDA Updates

InvokeHTTP: We invoke the RSS feed to get our data.
QueryRecord: Convert RSS to JSON
SplitJson: Split one file into individual records. (Should refactor to ForkRecord)
EvaluateJSONPath: Extract attributes you need.
ProcessGroup for SPL Processing

We call and check the RSS feed frequently and we parse out the records from the RSS(XML) feed and check against our cache. We use Cloudera's Real-Time Datahub with Apache Impala/Apache Kudu as our cache to see if we already received a record for that. If it hasn't arrived yet, it will be added to the Kudu cache and processed.

SPL Processing

We use the extracted SPL to grab a detailed record for that SPL using InvokeHTTP.