Using Apache NiFi with Redpanda for your Kafka workloads

#redpanda #tutorial #tooling #kafka

Apache NiFi is a tool for moving and processing data between systems. It started as an internal project from the National Security Agency (NSA), where it was known by the codename NiagaraFiles. Through a technology transfer program, the NSA declassified it as an open source project under the Apache Software Foundation (ASF). NiFi models streaming data as a series of flows through a directed acyclic graph (DAG). Data is permitted to flow in one direction downstream, much the way water trickles through streams and rivers onto Niagara Falls and beyond.

A NiFi graph consists of processors and connectors, which correspond to the nodes and edges of the graph, respectively. Processors can do some lightweight processing such as data transformation, format conversion, and message routing, whereas connectors model the movement of data between processors. Flows start and end with special processors called sources and sinks. Sources and sinks are integration points to external systems like databases, message queues, and filesystems.

Compared with Redpanda

NiFi has functional overlaps with pub/sub messaging systems like Redpanda, Kafka, and Pulsar. All of these systems were designed to facilitate the movement of data from one system to another. However, there are key design differences to take into account when choosing one for your use case.

NiFi is flow-centric. It allows for a no-code/low-code style of development by allowing the user to visually draw the graph using a drag-and-drop interface. Since users define the dataflow as a graph, NiFi will have an explicit lineage showing the provenance of data, a critical feature in support of auditability and data governance. It also comes with an extensive set of processors (200+) for connecting with the most commonly deployed data systems in the market. If all you need is to move data from A to B, and perhaps do some light data transformation along the way, then NiFi has your needs covered.

Pub/sub systems like Redpanda are topic-centric: They organize data into topics and decouple producers from consumers. Clients can publish and subscribe to any topic as they please. This gives architects the ability to design a system that can evolve over time. Using Redpanda as a sort of data interchange, architects can add or remove system components like databases, search engines, or other services as the business need arises without affecting other components that rely on the same data. The ability to store and retain historical event data also means that you can replay events to support audits or backtesting. The topic-centric approach can also be used to implement a message bus in support of microservices or event driven architectures. It allows developers to concentrate on their individual component or service, publishing to or consuming from one or more topics without having to worry about the broader system.

In general, NiFi is focused on the user experience of building flows to route and process data moving between known systems, whereas Redpanda is a streaming substrate that acts as a conduit for messages between loosely coupled systems, with certain guarantees around throughput, latency, and durability.

NiFi with Redpanda

All that said, it is actually not uncommon to see NiFi used with pub/sub systems in the wild. NiFi includes the PublishKafka and ConsumeKafka processors that allow authors to start or end a flow with Kafka as the endpoint. In a typical integration, Kafka is used as a central message bus and NiFi is used as a ‘last mile’ connector that shuttles data between various systems and the central message bus. This was especially necessary in the early years of Kafka, which was initially released in 2011, and NiFi, which was open sourced in 2014. Kafka Connect was announced in 2016, and has since added a lot of functionality previously covered by NiFi, such as pluggable connectors and single message transforms. While newer Kafka deployments may favor Kafka Connect, the NiFi+Kafka combination gets you the best of both worlds: a loosely coupled event bus from Kafka, and a low-code graphical user interface for connecting to data sources and sinks via NiFi.

Given Redpanda’s compatibility with the Kafka ecosystem, you can use the existing PublishKafka and ConsumeKafka integration points to link it with NiFi. We’ve provided a docker-compose.yml file below so you can spin up your very own Redpanda-NiFi integration. There are existing examples out there that show you how to build a NiFi dataflow using the Kafka processors. Just remember to substitute "redpanda:29092" wherever you specify the Kafka broker address in the examples.

version: '3.7'
services:
  redpanda:
    image: docker.vectorized.io/vectorized/redpanda:v21.9.3
    command:
      - redpanda start
      - --smp 1
      - --memory 512M
      - --reserve-memory 0M
      - --overprovisioned
      - --node-id 0
      - --set redpanda.auto_create_topics_enabled=true
      - --kafka-addr PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092
      - --advertise-kafka-addr PLAINTEXT://redpanda:29092,OUTSIDE://localhost:9092
    ports:
      - "9092:9092"
      - "9644:9644"
    volumes:
      - /var/lib/redpanda/data
  nifi:
    image: apache/nifi:latest
    restart: unless-stopped
    ports:
      - "8080:8080"
    environment:
      - NIFI_WEB_HTTP_PORT=8080

Conclusion

Apache NiFi has 100s of connectors for most major data systems in the market today, and the graphical interface makes it accessible even to novice data engineers. Redpanda is a durable, scalable, fault-tolerant event storage system that is state of the art in terms of performance and ease of use. NiFi and Redpanda together make a powerful, easy-to-use combination for building a complete streaming system.

Whether you are an existing NiFi + Kafka user looking for a fast, simple, and reliable alternative to Kafka; or a Redpanda user looking for a GUI driven alternative to Kafka Connect; or into spycraft looking to understand anything declassified by the NSA, we hope this article helps in your current and future efforts. If you are a citizen or resident of the United States, you may as well use NiFi since your tax dollars have already paid for it. You actually lose money by not using NiFi, so at least give it a try. Apropos of nothing, we’ll end this article with a quote from Tesla: