Optimising Data Streaming Pipelines with Apache Flink and Apache Kafka: a panel recap

#bigdata #programming #opensource

Earlier this week we got to host a panel discussion at Flink Forward Global 2021, the conference on all things Apache Flink and stream processing. Being part of the conference for the first time (for me, as an attendee this time round 😉) was exhilarating and we really enjoyed interacting with the Flink community.

Our panellists, Olena Babenko (Aiven), Francesco Tisiot (Aiven), and Gyula Fora (Cloudera), sat down to discuss their favourite things about Flink SQL and how it’s a good fit in your data streaming pipeline with Kafka. We certainly got a lot of great questions from the audience and useful insights.

The panel recording is now available on the Flink Forward YouTube channel and we can’t wait for you to watch it! In the meantime, read on for a recap and some tips from our experts.

What makes Apache Kafka a great fit for streaming data and why does that fit together with Flink?

According to Francesco Tisiot, as we’re moving away from batch processing and going into streaming, we need a platform that is performant and scalable, and proven to be successful. Apache Kafka is the perfect match in this case with its beautiful features like Kafka Connect, allowing you to connect your Kafka instance to other systems. However, it often acts as a messenger, taking data from one place and pushing it to the other.

If you want to level up your game and perform analytics, Flink is a very powerful tool for that. It understands the architecture of Kafka (i.e. topics, partitions) and optimises the workload across those parameters.

Secondly, Flink supports a vast range of data platforms, both for input and output. This makes it a good choice for your data pipelines, no matter where your data sits. The combination of Flink and Kafka is truly powerful and takes your data streaming to the next level.

What are the common challenges of using both Flink & Kafka?

Working with Kafka and Flink services, we’re often faced with the issue of skewed data. For example, if you have 10 partitions in your Kafka service and only 1 partition has 5GB and the rest have 2MB. Usually this can happen because of a mismatch between the node keys. If there is a mismatch in requests, your Kafka and Flink service performance will suffer when they try to process the data.

According to Olena Babenko, you can avoid that by adjusting your metrics, such as maximum number of message bytes and patch sizes, to minimise the impact of the overload of data. Additionally, if you are the manager of the Kafka service, you can review your partitioning mechanism to ensure an even distribution of messages.

Where could Flink improve?

Our panellist had diverse opinions on this topic. As a Flink expert and long-term committer, Gyula Fóra mentioned that Flink has been greatly improving with each release. There is definitely no better time to join Flink than now, with a lot of new feature additions, great connectors, and cloud support. Kudos to the Flink community for working hard on adding new features at an incredible pace!

However, there are some things that could be improved on the SQL side of things. For example, good operational support for SQL queries could come in handy or a way of guaranteeing savepoint compatibility between SQL jobs. Additionally, as in Francesco’s case, telling a user exactly where errors in the SQL statements are could make it a better experience, especially for new Flink users.

The SQL client, although it has been improved a lot, is still a work in progress. Especially if we look at the SQL CLI, you want to make sure that, when you run real-world applications, logging and all other operations are configured correctly and set up for production. Most companies have a deployment stack around this and a regular Table API program is a better fit than the SQL CLI.

Another small issue with SQL applications is the difficulty to write a unit test on it, because it’s hard to make it isolated and fast. According to Olena, one tip would be to mock as many sources of your data as possible (e.g. data fakers) to try and perform in-memory testing. In the worst case scenario, you can also try using a file system or Dockerised tools as sources.

What are the alternatives to Flink?

According to our panelists, Flink is becoming more and more mainstream. In most cases, it has pretty much all you need for stream processing. In some cases, the developer experience in the company makes it impractical to get a new tool, e.g. when you already have a strong team with extensive Kafka experience. However, the downside of using platform-specific tools like kSQL is that you are forced to use Kafka as both source and sink. Flink, on the other hand, doesn’t have such a limitation and works over a broad tech ecosystem with its wide range of connectors.

If you’re new to Flink, what’s the most fun thing to try out?

Without a doubt, you should check out Flink documentation. According to our panellists, there is a lot of great content that will help you understand Flink’s capabilities. Secondly, we recommend trying the SQL client as a great way to get started with Flink. Thirdly, if you are looking for a streaming SQL interface, we just launched our Flink beta program and we can’t wait for you to try it out!

DEV Community

Optimising Data Streaming Pipelines with Apache Flink and Apache Kafka: a panel recap

What makes Apache Kafka a great fit for streaming data and why does that fit together with Flink?

What are the common challenges of using both Flink & Kafka?

Where could Flink improve?

What are the alternatives to Flink?

If you’re new to Flink, what’s the most fun thing to try out?

Top comments (0)

Read next

Designing and Implementing Ant Design Global App Tour for React Apps.

👻 Scary Ghost Cursor with Smoke Trail! 💀 code using html5,css3 and javascript

Comparison of the middleware implementation between Supabase Auth documentation and the nextjs-stripe-supabase.

Un senior destaca por su forma de pensar, no de codear