FLaNK: Using Apache Kudu As a Cache For FDA Updates
- InvokeHTTP: We invoke the RSS feed to get our data.
- QueryRecord: Convert RSS to JSON
- SplitJson: Split one file into individual records. (Should refactor to ForkRecord)
- EvaluateJSONPath: Extract attributes you need.
- ProcessGroup for SPL Processing
SPL Processing
We use the extracted SPL to grab a detailed record for that SPL using InvokeHTTP.
We use the LookupRecord to read from our cache.
If we don't have that value yet, we send it to the table for an UPSERT.
We send our raw data as XML/RSS to HDFS for archival use and audits.
We can see the results of our flow in Apache Atlas to see full governance of our solution.
So with some simple NiFi flow, we can ingest all the new updates to DailyMed and not reprocess anything that we already have.
Resources
- https://dailymed.nlm.nih.gov/dailymed/
- https://github.com/tspannhw/EverythingApacheNiFi
- https://dzone.com/articles/smart-stocks-with-nifi-kafka-flink-sql
- https://github.com/tspannhw/ApacheConAtHome2020/tree/main/flows/DailyMed