Ahead of my talk at the Bristech meetup group tomorrow, Nic (the host!) invited me for an informal interview; something I'd never really done before. It ended up being a nice roundtrip of how I landed in the Apache Flink community, from learning to let go of a batch brain to working in the open.
Sharing the ✨ polished ✨ transcript here in case you're curious to learn more about my journey into streaming so far. I'd also love to hear from you if any of this rings a bell!
[Jump to] Transitioning from batch- to streaming-first is not just about learning a new technology or new terminology; but also about getting into a different headspace.
[Jump to] Being a Developer Advocate is a lot about relating to the problems users face, but there is a disconnect from the "real" world that still feels weird, like not suffering the consequences of doing streaming wrong first-hand.
[Jump to] I'm big into projects that would've helped improve my work back in the day, like Debezium and Apache Pinot; as well as not-as-shiny-as-ai topics like data quality and discovery.
[Jump to] If there's one thing I (have to) regret, it's not getting involved in open source sooner.
To avoid confusion: Nic (N) is the one asking the questions!
[Yes and no!] In my previous job as a DWH Engineer [and before], I was very deep in the batch world. At some point, our team started consuming a service from this other team that was using Flink. And because "real-time" processing seemed to solve some of the problems that we were dealing with on a daily basis, I started getting very interested in it, and Flink in particular just because I ended up striking conversation with some people in this team and they were always very helpful in explaining how the internals of Flink worked [thanks, Javier!]. And it basically came from there: we were consuming a Flink service and I got really interested in it.
I’ll be speaking at your meetup about this exactly, because I think a lot of people downplay this transition: the jump that you have to make when you have a "batch brain", and you suddenly want to move to stream processing. There are a lot of things that are different, right? And for me, I wasn’t even very versed on distributed systems because I'd always worked in bare metal, single server kind of environments. Then suddenly, I had to adapt to this new way of thinking about data as something that is flowing all the time, instead of something that is just sitting there in a database, and you write your queries, and you schedule some jobs, and then from time to time, you just grab that data. When something fails, you just re-run everything again, you wait [a lot of] hours for the whole thing to actually go back in place. It’s a big jump, mentally. And, of course, also technically. But that’s maybe not the hardest part to adapt to — it's really just making our brain understand different things. Like, suddenly you have a lot of different notions of time that you have to think about. Or, the way you do fault tolerance is completely different.
N: You call this the "streaming mindset". It sounds like it is a topic that really resonates with you?
Definitely. Like I said before, I never went through the phase of using Hadoop or Spark or any of that. So I jumped straight from e.g. Oracle and some orchestration or ETL frameworks to Flink. And Flink is also a pretty advanced stream processing system — there’s a lot of moving parts, a lot of things that you have to take in. For me, it was a completely new world. And until today, I’m still learning a lot and my brain still gravitates towards things that I was really used to in the batch world before.
N: In the streaming world, you’re talking about data pipelines or windowing — the terminology is very markedly different. So, would you say that one of the challenges is to actually get a grip on the terminology and understand how these concepts are important?
Yeah, for sure that but also the things that you didn’t necessarily have to think about before — not so much just the terminology. For example, when you do batch you don’t really need to think about late events or events that are out of order and such. That’s the kind of intricacies that you suddenly have to think about that were second nature before, just because they didn't affect anything that you were doing. These little changes are very important in streaming, in the end. In batch, everything is a bit more black and white, or a bit simpler.
N: As a Developer Advocate, are you more into the use cases and how people are going to use this technology or showcasing the art of the possible with the technology?
I think that the best people to talk about use cases are always users, because it’s one of...I wouldn’t call it a limitation, but it’s...a thing. I don’t what to call it here. In reality, I never really went through the hardships of doing streaming wrong in a "real" work environment and experiencing the consequences first-hand. As a Developer Advocate, I have this playground to just break and experiment with things. I don’t have anyone paging me at 3AM to fix a pipeline anymore.
I [talk to and] watch a lot of talks from Flink users to at least relate to the problems they have. That’s more my role, to relate to the problems that people have with Flink and try to bring that back to the engineering team or the product team; and not so much about showing use cases I like. What I like to show is how you can use Flink with other technologies. That’s something that I really, really enjoy. For example, when there's something interesting like Debezium or Apache Pinot, I always just want to jump and see "Okay, how can you use this with Flink? And how can I show people how they can use this with Flink?".
N: Flink is within the Apache stable: it’s open source and exists within that ecosystem. So, you’re forever interacting with all of the other technologies that Flink will integrate with?
Yeah, exactly. And that’s one thing that definitely keeps things interesting for me, because there’s a big ecosystem around Flink. And you can not only just interact, but collaborate with other communities. There’s always two sides to it, right? If you want to ensure that something integrates with Flink, you need something from the engineering side. But once that is done, you can jump in and "dogfood" it.
N: Are there any particular trends in the industry that you’re interested in tracking currently, or are you still trying to get to grips with Flink itself?
I try to keep an eye on the industry as a whole, but someone pointed out recently that I’m not attracted to the gold shiny things, like AI/ML; but to the "boring stuff" [I'll take that!]. What always makes me more excited are things that I can relate to, like CDC (Change Data Capture) and Debezium, or real-time OLAP and Apache Pinot — things that would have made my work so much easier and better back in the day. Also because of that, I'm really interested in data governance and quality (e.g. data lineage, data discovery)...it’s not that it got lost, but between batch and streaming, it feels like these things have been delegated to second plane. And that’s probably why you're seeing all these projects around such topics now that streaming has "cooled off". Because, sure, you can do online model training, but if your data sucks, your results will suck and your predictions will suck. We’re entering the phase of developing all these tools that actually focus on the data or the metadata and not so much on the plumbing. That’s really interesting to me, because I was always very connected to the meaning of the data more than just moving it from A to B.
N: What would you say to your younger self, if you had the chance? When you were starting out as a newbie in the industry back in the day?
Wow. That’s a deeper question than I was expecting. 😅 But maybe one thing that I would have said to myself is to get involved in open source sooner? That could be it…
It really is a very different feeling than working in a big corp or just working in a team and using proprietary technologies. In open source, I really found a different way of working, a more collaborative way and more of a safer space to experiment. I also found that not all people out there want you to not succeed. There’s also a lot of people who are willing to help you to actually evolve or to try new things. And so it doesn’t feel wrong to fail, or at least as wrong.
[There are bad people everywhere.] I’m used to the values of the Flink community. And even within our company, the biggest focus is still working on open source Flink. So all these guys that kicked it off, like Stephan [Ewen]...he’s the creator of Flink, the whole thing started with his PhD thesis. He created this technology that is used nowadays by all the big tech companies in the world and is constantly evolving, with [hundreds of people] working on it on a daily basis. He could easily feel entitled, but he’s a guy who will sit with me and go through something I don’t understand about Flink, or he proactively reaches out to explain a thread that is being worked on. It’s a very different mindset as well from what I was used to before, especially since I started out in consulting, where it’s kind of a sink or swim situation.