Best in Flow Competition Tutorials Part 2 - Tutorial 1

  1. Reading and filtering a stream of syslog data

You have been tasked with filtering a noisy stream of syslog events which are available in a Kafka topic. The goal is to identify critical events and write them the Kafka topic you just created.


Related documentation is here.


1.1 Open ReadyFlow & start Test Session


  1. Navigate to DataFlow from the Home Page 




  1. Navigate to the ReadyFlow Gallery

  2. Explore the ReadyFlow Gallery


Info:
The ReadyFlow Gallery is where you can find out-of-box templates for common data movement use cases. You can directly create deployments from a ReadyFlow or create new drafts and modify the processing logic according to your needs before deploying.


  1. Select the “Kafka filter to Kafka” ReadyFlow. 

  2. Get your user id from your profile, it is usually the first part of your email, so my email is tim@sparkdeveloper.com so my user id is tim.  It is your “Workload User Name” that you are going to need for several things, remember that.   


  1. You already created a new topic to receive data in the setup section.   <<replace_with_userid>>_syslog_critical Ex:  tim_syslog_critical.


  1. Click on “Create New Draft” to open the ReadyFlow in the Designer

with the name youruserid_kafkafilterkafka, for example tim_kafkafilterkafka.   If your name has periods, underscores or other non-alphanumeric characters just leave those out.   Select from the available workspaces in the dropdown, you should only have one available.


  1. Start a Test Session by either clicking on the start a test session link in the banner or going to Flow Options and selecting Start in the Test Session section.

  2. In the Test Session creation wizard, select the latest NiFi version and click Start Test Session. Leave the other options to its default values. Notice how the status at the top now says “Initializing Test Session”. 


Info:
Test Sessions provision infrastructure on the fly and allow you to start and stop individual processors and send data through your flow By running data through processors step by step and using the data viewer as needed, you’re able to validate your processing logic during development in an iterative way without having to treat your entire data flow as one deployable unit. 


1.2 Modifying the flow to read syslog data

The flow consists of three processors and looks very promising for our use case. The first processor reads data from a Kafka topic, the second processor allows us to filter the events before the third processor writes the filtered events to another Kafka topic.
All we have to do now to reach our goal is to customize its configuration to our use case. 



  1. Provide values for predefined parameters

    1. Navigate to Flow Options→ Parameters

    2. For some settings there are some that are set already as parameters, for others they are not, you can set them manually.  Make sure you create a parameter for the Group Id.

    3. Configure the following parameters:



Name

Description

Value

CDP Workload User

CDP Workload User

<Your own workload user ID that you saved when you configured your workload password>

CDP Workload User Password

CDP Workload User Password

<Your own workload user password you configured in the earlier step>

Filter Rule

Filter Rule

SELECT * FROM FLOWFILE WHERE severity <= 2

Data Input Format


AVRO

Data Output Format


JSON

Kafka Consumer Group ID

ConsumeFromKafka

<<replace_with_userid>>_cdf

Ex:  tim_cdf

Group ID

ConsumeFromKafka

<<replace_with_userid>>_cdf

Ex:  tim_cdf

Kafka Broker Endpoint

Comma-separated list of Kafka Broker addresses

oss-kafka-demo-corebroker2.oss-demo.qsm5-opic.cloudera.site:9093,

oss-kafka-demo-corebroker1.oss-demo.qsm5-opic.cloudera.site:9093,

oss-kafka-demo-corebroker0.oss-demo.qsm5-opic.cloudera.site:9093

Kafka Destination Topic

Must be unique

<<replace_with_userid>>_syslog_critical

Ex:  tim_syslog_critical

Kafka Producer ID

Must be unique

<<replace_with_userid>>_cdf_producer1

Ex:  tim_cdf_producer1

Kafka Source Topic


syslog_avro

Schema Name


syslog

Schema Registry Hostname

Hostname from Kafka cluster

oss-kafka-demo-master0.oss-demo.qsm5-opic.cloudera.site

  1. Click Apply Changes to save the parameter values

  2. If confirmation is requested, Click Ok.


  1. Start Controller Services

    1. Navigate to Flow Options → Services

    2. Select CDP_Schema_Registry service and click Enable Service and Referencing Components action.   If this is not enabled, it may be an error or an extra space in any of the parameters for example AVRO must not have a new line or blank spaces.   The first thing to try if you have an issue is to stop the Design environment and then restart the test session.  Check the Tips guide for more help or contact us in the bestinflow.slack.com.
       

    3. Start from the top of the list and enable all remaining Controller services

    4. Make sure all services have been enabled.   You may need to reload the page or try it in a new tab.

  2. If your processors have all started because you started your controller services, it is best to stop them all by right clicking on each one and clicking ‘Stop’ and then start them one at a time so you can follow the process easier.  Start the ConsumeFromKafka processor using the right click action menu or the Start button in the configuration drawer.


After starting the processor, you should see events starting to queue up in the success_ConsumeFromKafka-FilterEvents connection.

  1. Verify data being consumed from Kafka

    1. Right-click on the success_ConsumeFromKafka-FilterEvents connection and select List Queue


Info:
The List Queue interface shows you all flow files that are being queued in this connection. Click on a flow file to see its metadata in the form of attributes. In our use case, the attributes tell us a lot about the Kafka source from which we are consuming the data. Attributes change depending on the source you’re working with and can also be used to store additional metadata that you generate in your flow. 



  1. Select any flow file in the queue and click the book icon to open it in the Data Viewer


Info: The Data Viewer displays the content of the selected flow file and shows you the events that we have received from Kafka. It automatically detects the data format - in this case JSON - and presents it in human readable format. 


  1. Scroll through the content and note how we are receiving syslog events with varying severity.



  1. Define filter rule to filter out low severity events

    1. Return to the Flow Designer by closing the Data Viewer tab and clicking Back To Flow Designer in the List Queue screen.

    2. Select the Filter Events processor on the canvas. We are using a QueryRecord processor to filter out low severity events. The QueryRecord processor is very flexible and can run several filtering or routing rules at once.

    3. In the configuration drawer, scroll down until you see the filtered_events property. We are going to use this property to filter out the events. Click on the menu at the end of the row and select Go To Parameter.

    4. If you wish to change this, you can change the Parameter value.

    5. Click Apply Changes to update the parameter value. Return back to the Flow Designer

    6. Start the Filter Events processor using the right-click menu or the Start icon in the configuration drawer.

  2. Verify that the filter rule works

    1. After starting the Filter Events processor, flow files will start queueing up in the filtered_events-FilterEvents-WriteToKafka connection

    2. Right click the filtered_events-FilterEvents-WriteToKafka connection and select List Queue.

    3. Select a few random flow files and open them in the Data Viewer to verify that only events with severity <=2 are present.

    4. Navigate back to the Flow Designer canvas.

  3.  Write the filtered events to the Kafka alerts topic
    Now all that is left is to start the WriteToKafka processor to write our filtered high severity events to syslog_critical Kafka topic.

    1. Select the WriteToKafka processor and explore its properties in the configuration drawer. 

    2. Note how we are plugging in many of our parameters to configure this processor. Values like Kafka Brokers, Topic Name, Username, Password and the Record Writer have all been parameterized and use the values that we provided in the very beginning.

    3. Start the WriteToKafka processor using the right-click menu or the Start icon in the configuration drawer.


Congratulations! You have successfully customized this ReadyFlow and achieved your goal of sending critical alerts to a dedicated topic! Now that you are done with developing your flow, it is time to deploy it in production!




1.3 Publishing your flow to the catalog


  1.  Stop the Test Session

    1. Click the toggle next to Active Test Session to stop your Test Session

    2. Click “End” in the dialog to confirm. The Test Session is now stopping and allocated resources are being released


  2. Publish your modified flow to the Catalog

    1. Open the “Flow Options” menu at the top

    2. Click “Publish” to make your modified flow available in the Catalog

    3. Prefix your username to the Flow Name and provide a Flow Description. Click Publish



    4. You are now redirected to your published flow definition in the Catalog. 



Info: The Catalog is the central repository for all your deployable flow definitions. From here you can create auto-scaling deployments from any version or create new drafts and update your flow processing logic to create new versions of your flow.



1.4 Creating an auto-scaling flow deployment


  1. As soon as you publish your flow, it should take you to the Catalog.   If it does not then locate your flow definition in the Catalog

    1. Make sure you have navigated to the Catalog

    2. If you have closed the sidebar, search for your published flow <<yourid>> into the search bar in the Catalog. Click on the flow definition that matches the name you gave it earlier.

    3. After opening the side panel, click Deploy, select the available environment from the drop down menu and click Continue to start the Deployment Wizard.




If you have any issues, log out, close your browser, restart your browser, try an incognito window and re-login.    Also see the “Best Practices Guide”.


  1. Complete the Deployment Wizard
    The Deployment Wizard guides you through a six step process to create a flow deployment. Throughout the six steps you will choose the NiFi configuration of your flow, provide parameters and define KPIs. At the end of the process, you are able to generate a CLI command to automate future deployments.


Note:   The Deployment name has a cap of 27 characters which needs to be considered as you write the prod name.

  1. Provide a name such as <<your_username>>_kafkatokafka_prod to indicate the use case and that you are deploying a production flow. Click Next.



  1. The NiFi Configuration screen allows you to customize the runtime that will execute your flow. You have the opportunity to pick from various released NiFi versions.

    Select the Latest Version and make sure Automatically start flow upon successful deployment is checked.

    Click Next.

  2. The Parameters step is where you provide values for all the parameters that you defined in your flow. In this example, you should recognize many of the prefilled values from the previous exercise - including the Filter Rule and our Kafka Source and Kafka Destination Topics.

    To advance, you have to provide values for all parameters. Select the No Value option to only display parameters without default values.


    You should now only see one parameter - the CDP Workload User Password parameter which is sensitive. Sensitive parameter values are removed when you publish a flow to the catalog to make sure passwords don’t leak.

    Provide your CDP Workload User Password and click Next to continue.


  3. The Sizing & Scaling step lets you choose the resources that you want to allocate for this deployment. You can choose from several node configurations and turn on Auto-Scaling.

    Let’s choose the Extra Small Node Size and turn on Auto-Scaling from 1-3 nodes. Click Next to advance.



  4. The Key Performance Indicators (KPI) step allows you to monitor flow performance. You can create KPIs for overall flow performance metrics or in-depth processor or connection metrics.

    Add the following KPI

  • KPI Scope: Entire Flow

  • Metric to Track: Data Out

  • Alerts:

    • Trigger alert when metric is less than: 1 MB/sec

    • Alert will be triggered when metrics is outside the boundary(s) for: 1 Minute


Add the following KPI

  • KPI Scope: Processor

  • Processor Name: ConsumeFromKafka

  • Metric to Track: Bytes Received

  • Alerts:

    • Trigger alert when metric is less than: 512 KBytes/sec

    • Alert will be triggered when metrics is outside the boundary(s) for: 30 seconds



Review the KPIs and click Next.


  1. In the Review page, review your deployment details.

    Notice that in this page there's a >_ View CLI Command link. You will use the information in the page in the next section to deploy a flow using the CLI. For now you just need to save the script and dependencies provided there:

  1. Click on the >_ View CLI Command link and familiarize yourself with the content.

  2. Download the 2 JSON dependency files by click on the download button:

    1. Flow Deployment Parameters JSON

    2. Flow Deployment KPIs JSON

  3. Copy the command at the end of this page and save that in a file called deploy.sh

  4. Close the Equivalent CDP CLI Command tab.


  1. Click Deploy to initiate the flow deployment! 

  2. You are redirected to the Deployment Dashboard where you can monitor the progress of your deployment. Creating the deployment should only take a few minutes.



  3. Congratulations! Your flow deployment has been created and is already processing Syslog events! 


Please wait until your application is done Deploying, Importing Flow.   Wait for Good Health.



1.5 Monitoring your flow deployment



  1. Notice how the dashboard shows you the data rates at which a deployment currently receives and sends data. The data is also visualized in a graph that shows the two metrics over time. 

  2. Change the Metrics Window setting at the top right. You can visualize as much as 1 Day.

  3. Click on the yourid_kafkafilterkafka_prod deployment. The side panel opens and shows more detail about the deployment. On the KPIs tab it will show information about the KPIs that you created when deploying the flow.

    Using the two KPIs Bytes Received and Data Out we can observe that our flow is filtering out data as expected since it reads more than it sends out.


Wait a number of minutes so some data and metrics can be generated.


  1. Switch to the System Metrics tab where you can observe the current CPU utilization rate for the deployment. Our flow is not doing a lot of heavy transformation, so it should hover around at ~10% CPU usage.

  2. Close the side panel by clicking anywhere on the Dashboard.

  3. Notice how your yourid_CriticalSyslogsProd deployment shows Concerning Health status. Hover over the warning icon and click View Details.

  4. You will be redirected to the Alerts tab of the deployment. Here you get an overview of active and past alerts and events. Expand the Active Alert to learn more about its cause.


    After expanding the alert, it is clear that it is caused by a KPI threshold breach for sending less than 1MB/s to external systems as defined earlier when you created the deployment.

1.6 Managing your flow deployment


  1. Click on the yourid_kafkafilterkafka_prod deployment in the Dashboard. In the side panel, click Manage Deployment at the top right.

  2. You are now being redirected to the Deployment Manager. The Deployment Manager allows you to reconfigure the deployment and modify KPIs, modify the number of NiFi nodes or turn auto-scaling on/off or update parameter values.


  3. Explore NiFi UI for deployment. Click the Actions menu and click on View in NiFi.


  4. You are being redirected to the NiFi cluster running the flow deployment. You can use this view for in-depth troubleshooting. Users can have read-only or read/write permissions to the flow deployment.