DEV Community

Cover image for Getting started with Apache NiFi
Mark Dsouza
Mark Dsouza

Posted on

Getting started with Apache NiFi

If you're looking to get a quick understanding about Apache NiFi, how you can get started, some key concepts, tips and tricks, and other resources on the open source tool, you've come to the right place

What is Apache NiFi and why should I use it?

"An easy to use, powerful, and reliable system to process and distribute data." - NiFi official website
NiFi is a great open source tool. With NiFi you can build a data flow from anywhere to anywhere - local file system, cloud, HDFS, nosql, rdbms, kafka. Litereally Anywhere to Anywhere.
If you have a fixed flow of steps with no computation in between or complex business logic that you need to do - NiFi makes a lot of sense.
A simple use case: You want to transport your data at the end of each day from a DB to say HDFS - you can easily do this with NiFi. This is a fixed set of steps that need to be done.

NiFi is EASY to use!

You do not need to be a technical person to use NiFi. It has a very robust interactive web browser UI and all you need to do is drag and drop in processors (actions to be done using data) and link them to each other. All you need to know is the configuration of that processor.
For instance, if you are connecting to the Database, you need to know the details of how you can connect to your Database.
For certain use cases you might need to know a little bit of the NiFi Expression Language. After building a few flows this becomes pretty easy to understand.
Docs link : https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html

Getting NiFi running on your machine

Where do I download NiFi?

The team releases updates every few months. So get the latest version here.
Make sure to download the binary : https://nifi.apache.org/download.html
There is even a docker image if you have docker installed on your machine
https://hub.docker.com/r/apache/nifi/

Launching NiFi

For windows users navigate to bin and run the run-nifi.bat file. You need to have java installed on your machine and added to the PATH variable.
For linux or mac run the command -

bin/nifi.sh run

This will launch your NiFi server on https://localhost:8443/nifi/
Login ScreenTo get your credentials, go to logs/nifi-app.txt and search for your username and password.
Login credentials in the log fileAfter a successful log in, you will see your home screen where you can start building
Nifi Home ScreenFor any issues : https://nifi.apache.org/docs/nifi-docs/html/getting-started.html

Key Features of NiFi

Flow Files

The data itself is referred to as a Flow File. It has 2 key parts. The content (raw data of the file) and Attributes which would contain some metadata about the flow file.
For example when you send a HTTP Request, the body of the HTTP request would be the Flow File Content and the headers would be the attributes.
The flow data does not get copied each step. It is temporarily stored on the NiFi server and an auto clean up happens once the flow file is finished processing. The flow file works sort of like a pointer as the file flows through your entire flow.

Processors

A processor is the task that you want to execute on the data/flow file.
Do you want to Read from HDFS Or read a file from a location on your machine? Do a HTTP GET Request? Generate a File? Insert into a NoSQL or RDBMS Database? Each of this is can be done with a processor. Overall there are 288 processors as of today that you can use. All ProcessorsAnd if you don't find the EXACT processor you need? You can write your own with custom code as well. The NiFi teams keeps adding new processors as well.
Once you drag in a processor, it will need to be configured. For example If you are generating a file - you can provide the data in the file. Configure Generate Flow File Not all configuration is mandatory.
If you miss a mandatory configuration you will see a warning across the processor with descriptions of what is missing
image
Note: Any processor can use any flow file which is what makes NiFi so powerful.

Linking processors with connectors

We can connect processors to each other using a connector. Depending on the processor, you may have multiple outcomes of the processor doing the intended action.
image In the example, I have a consume Kafka processor and there can be 2 outcomes - parse.failure or success. You can easily create different flow File paths for either outcome. Maybe you want to log the failure and not proceed further and if it is a success you might have 2 3 additional steps.
image If a processor is the end of a flow, we need to explicitly tell NiFi that it is the end. Otherwise NiFi expects us to keep going and will throw a warning message.
imageMoment you link 2 processors, there is a Queue created. If your 2nd processor is paused or has an error in the configuration, the flow Data will enter the queue in this connector.
TIP: If you are linking 2 different relationships between processorA and processorB draw 2 different lines creating 2 different connectors. This specially helps in debugging.

Controller Services

Controller Services in NiFi are centrally stored. If you have 10 different calls for various CRUD Operations depending on some business logic, you do not need to provide 10 different connections to the DB. All you need to do is create the Service once and then reference the service for each of your processors
image You will also need to provide a jar file for a driver (if required) so NiFi can make the appropriate connection

Other concepts

Process Groups - You can group a part of a flow helping you segregate different parts of your flow logically
Input Port/Output Port - An input to your Process Group is in the form of an input port (not the server port). This is how data enters the Process group. Similarly, in the group, there can be multiple path terminating in multiple output ports. One group can have multiple input ports.
Templates - You can create a template containing the entire or part of your flow. This can then be shared with other developers. You can save/import/delete templates.
Funnel - You can combine data from different sources with funnel
Label - For better visualization. Has no impact on the data.

Conditional Statements in Nifi

If Else Logic based on attributes/content

The most useful processor to implement dynamic behavior is the RouteOnAttribute/RouteOnContent processor. This basically helps you implement IF ELSE statements while the data flows. Say NiFi is reading a HTTP request, you can do some routing based on the type of request (a different route for POST and DELETE to the same HTTP endpoint).

For Loops in Nifi

NiFi isn't great for working with loops. But there are ways to do it. You can split one file into many files based on some condition. Say you have a list of fileNames as an attribute. FileA,FileB,FileC. Now you want to create 3 files and place them in a target folder. You can split your first FlowFile and create 3 flow files(or X Flow files) with each flow file with an attribute of the individual file name.
But things that might be just 2 3 lines in JS, Java or any other language, might take a little thinking to implement using NiFi Processors. Example: I had an array of strings. and wanted to remove 1 string based on some business logic - this was extremely hard to do and took way too many steps to implement which is otherwise 1 array filter function in any programming language. (You can opt to go for custom coding in NiFi as well which is ideal for these more complicated scenarios)
Note: you can remerge your flow files after splitting them up as well.

Scheduling

There are 2 ways of scheduling - Time Driven or CRON driven
image

Data Provenance

Data provenance is the documentation of where a piece of data comes from and the processes and methodology by which it was produced. NiFi lets you easily track data at the application level
imageimage

Tips on debugging

Let’s say you have a simple flow but you need to see the data that is coming to create your flow easily. You can create your connections and stop the processor you haven’t configured yetQueueI have a simple Generate log file connected to a Log Attribute
As you can see, the connector has a Queue count displayed.
Queue OptionsRight click to view the Queue
List QueueNow you can see all flow files that are currently in this queue. You can view the attributes and metadata of the file and even see the raw content.

Queue View
Queue ContentThis helps you code and see the data as it progresses through the entire flow.

When not to use NiFi

NiFi isn’t great for computation. Basic transformation? That’s fine. But if you want to do something like take a SUM of a column or Aggregations - NiFi isn’t meant for that.

Other Resources

Where can I get help?

I have to point you to the official docs first
https://nifi.apache.org/docs.html
https://nifi.apache.org/docs/nifi-docs/html/getting-started.html
https://nifi.apache.org/developer-guide.html
https://cwiki.apache.org/confluence/display/NIFI/FAQs
For a good quick read
https://www.guru99.com/apache-nifi-tutorial.html
Though these docs are pretty great, there aren't any real examples of how to use a processor. The 2 great forums you'll find your answers are Cloudera and Stack overflow. I found a lot of useful use cases in the Cloudera forums. So don't overlook those results in your google searches

Are there any good videos on NiFi?

When it comes to tools, I find it a lot easier to grasp things by watching unlike code which can be read since there is UI interaction and not static code.
A set of well curated videos touching many essential features of NiFi with plenty of hands on: https://www.youtube.com/watch?v=VVnFt54jUQ8&list=PL55symSEWBbMBSnNW_Aboh2TpYkNIFMgb
A long (bits can be skipped) overview with hands on:
https://www.youtube.com/watch?v=fblkgr1PJ0o

If you have any other suggestions/content for NiFi beginners, please share them in the comments section.
Happy Learning !!

Discussion (0)