Timothy Spann. 🇺🇦

Posted on Jul 3, 2020 • Originally published at datainmotion.dev on Jul 3, 2020

Using Cloudera Data Platform with Flow Management and Streams on Azure

#2020 #apacheatlas #apachekafka #apachenifi

Using Cloudera Data Platform with Flow Management and Streams on Azure

Today I am going to be walking you through using Cloudera Data Platform (CDP) with Flow Management and Streams on Azure Cloud. To see a streaming demo video, please join my webinar (or see it on demand) at Streaming Data Pipelines with CDF in Azure. I'll share some additional how-to videos on using Apache NiFi and Apache Kafka in Azure very soon.

| |
| Apache NiFi on Azure CDP Data Hub

|

| |
| Sensors to ADLS/HDFS and Kafka

In the above process group we are using QueryRecord to segment JSON records and only pick ones where the Temperature in Fahrenheit is over 80 degrees then we pick out a few attributes to display from the record and send them to a slack channel.

To become a Kafka Producer you set a Record Reader for the type coming in, this is JSON in my case and then set a Record Writer for the type to send to the sensors topic. In this case we kept it as JSON, but we could convert to AVRO. I usually do that if I am going to be reading it with Cloudera Kafka Connect.

Our security is automagic and requires little for you to do in NiFi. I put in my username and password from CDP. The SSL context is setup for my when I create my datahub.

When I am writing to our Real-Time Data Mart (Apache Kudu), I enter my Kudu servers that I copied from the Kudu Data Mart Hardware page, put in my table name and your login info. I recommend UPSERT and use your Record Reader JSON.

For real use cases, you will need to spin up:

Public Cloud Data Hubs:

Streams Messaging Heavy Duty for AWS
Streams Messaging Heavy Duty for Azure
Flow Management Heavy Duty for AWS
Flow Management Heavy Duty for Azure

Software:

Apache Kafka 2.4.1
Cloudera Schema Registry 0.8.1
Cloudera Streams Messaging Manager 2.1.0
Apache NiFi 1.11.4

Demo Source Code:

https://github.com/tspannhw/cdp-datahub-azure-nifikafka

Let's configure out Data Hubs in CDP in an Azure Environment. It is a few clicks and some naming and then it builds.

Under the Azure Portal

In Azure, we can examine the files we uploaded to the Azure object store.

Under the Data Lake SDX

NiFi and Kafka are autoconfigured to work with Apache Atlas under our environments Data Lake SDX. We can browse through the lineage for all the Kafka topics we use.

We can also see the flow for NiFi, HDFS and Kudu.

SMM

We can examine all of our Kafka infrastructure from Kafka Brokers, Topics, Consumers, Producers, Latency and Messages. We can also create and update topics.

Cloudera Manager

We still have access to all of our traditional items like Cloudera Manager to manage configuration of servers.

Under Real-Time Data Mart

We can view tables, create tables and query our table. Apache Hue is a great tool for accessing data in my Real-Time Data Mart in a datahub.

We can also look at table details in the Impala UI.

References

- https://www.cloudera.com/about/enterprise-data-cloud.html

https://docs.cloudera.com/cdf-datahub/7.1.0/nifi-azure-ingest/topics/cdf-datahub-fm-adls-ingest-overview.html

https://docs.cloudera.com/cdf-datahub/7.1.0/nifi-kafka-ingest/topics/cdf-datahub-fm-kafka-ingest-buildflow.html

https://docs.cloudera.com/cdf-datahub/7.1.0/nifi-kudu-ingest/topics/cdf-datahub-nifi-kudu-ingest.html

Top comments (1)

John Carter • Feb 27 '25

Urgent Help Needed
I use open-source NiFi and currently handle deployments and upgrades manually. With multiple environments, this process is becoming time-consuming and inefficient.

Current Setup:

400 CPUs
60 Nodes
200 Users

Requirements:

Automated (CI/CD) NiFi & Data Flow Management
Scheduled Deployments with History & Rollback
24x7 Reliable Support Partner Despite extensive research, I have not found a single tool that meets all these needs. Any recommendations?