David Przybilla

Posted on Jan 11, 2023

Flyte and Kubeflow

#mlops #kubernetes #datascience #operations

With this entry I want to give you insights on the following questions:

Problems with the existing orchestration frameworks
How are Kubeflow and Flyte different?
Why is Flyte a tool you might want to study?
Why is it different from existing alternatives?
Is Flyte getting traction?
How difficult is to get started ?

Context: The Machine learning toolbox

Orchestration Frameworks Reality

Defining sequence of steps that are: idempotent, resilient and cacheable, but most importantly reproducible is a huge part of the ML stack.

There are a lot of tools out there. Picking one is difficult. As always there is no silver bullet. It depends on the size of your team, your existing infrastructure and the scale of your problems.

For small ML/DS teams, this usually means taking the tradeoff between:

Option 1: Having a simple orchestration tool with the downside of low scalability, difficulty of provisioning infrastructure, difficulty gluing different services/tools together. Tools like Kedro sit here, they are good for data analysis and Dataframe manipulations. Teams can easily extend Kedro because it is python which most ML practitioners know very well.
Option 2: Having a tool that allow for better scalability, easier infrastructure provision with the overhead of A LOT of complexity. This means a team has to run and manage. a Kubernetes Cluster. On top of that the team has to understand a complex orchestration tool like Kubeflow. Kubeflow works as a Kubernetes operator. Kubernetes operators are usually written in Golang. Any extension or problem that the team faces requires a lot of K8s understanding most likely written in a not very familiar Golang stack for DS/ML engineers.

I want to point out that in both of these options ML practitioners are severely exposed to Infrastructure:

In Option1, ML practitioners will feel in a strong jacket dependent on some team to provide the needed infrastructure. They might be required to learn/manage the infrastructure themselves. I think we can agree that the ML space is already complicated. These kind of setups ask for someone to be good at setting up: security, permissions, infrastructure, manage k8s. And still be up to date as far as ML papers and models.

Another side effect is that projects that fall off the cookie cutter pattern (i.e: a recommender systems) might end up with two pipeline codebases: one for training the model and a different one for serving the model. This is a bomb in terms of bugs and consistency

In Option2. If we consider Kubeflow we start seeing some interesting phenomena. A fine line starts to divide those provisioning infrastructure and the ML practitioners building pipelines for their experiments.

This is good, as we can think of Kubeflow as a platform as a service for ML practitioners. However, here is the catch: Kubeflow pipelines are very near to Infrastructure definitions. If you have coded a Kubeflow pipeline you will be aware that it is mostly a DSL for defining Pods.

ML practitioners exposed to Kubeflow can certainly provision the infrastructure they need easier, but they are still exposed to radioactive levels of Kubernetes details.

Kubeflow issues

Kubeflow is a fantastic tool that addressed a missing Lego piece in the ecosystem when it first emerged. However let me rant with the things that reaaaally bother me about Kubeflow:

Kubeflow is hard to run locally.
1. First time I touched Kubeflow, I had to manually patch yaml files to get it running on my cloud Kubernetes cluster. I was scared to even pitched this to my team. A small team of ML Engineers
2. Kubeflow pipelines are basically “passing strings to container’s entrypoints”. In kubeflow case: these strings are paths to cloud files (i.e: s3://mydata/myfile.csv). These paths get feed into pipeline steps (containers) which are in turn container Entrypoints.

Thats precisely the problem, Kubeflow DaGs carry no type knowledge about the entrypoint and their schema. Everything is a string. Most of these steps are containers that live in another repo, a tiny refactor in how an entrypoint looks like (maybe a bash file changed) and your pipeline is suddenly broken

A Step in a Kubeflow pipeline (a container) needs to know how to transform data. Because Kubeflow just passes paths as strings. This means that your containers need to know :

How to read this file from whatever cloud (maybe local mounted volume)
How to load this file into whatever representation it is used as in the actual task. For example (see pipeline below), this means knowing how to transform parquet files to dataframes back and forth, consistently, in multiple places:

Kubeflow’s DSL is intended to define Pods. It sits too near K8s that ML practitioners need to know many K8s details when writing their tasks

Kubeflow tries to be a lot of things and have integrations with many projects, a lot of the integrations feel rather brittle. When I tried to play with Seldon and Kubeflow I felt scared of shipping something to production.
Many more details and fears regarding Kubeflow are condensed under this blog post “Is Kubeflow dead?” [3]

Flyte ( https://flyte.org/ )

So Flyte is a tool that at first sight might look very similar to kubeflow. You define pipelines, each step of the pipeline is a container/pod. However it’s scope is very different.

Flyte abstracts away the infrastructure for ML practitioners. You can still define your pipelines as an extended DSL for describing Pods. But Flyte provides and encourages different primitives that abstract k8s resources from the ML practitioners.
Typesafety. Even though flyte still runs one container per pipeline step you can get typesafety. Flyte adds types so that you get early failures if your tasks types do not compose.
Plumbing. Remember how in Kubeflow you had to transform your parquet into a Dataframe?. In Flyte lots of common types are already supported, you don’t need to specify how to load a Dataframe.

If your previous task outputs a Dataframe and your current task has as an input a Dataframe, then all the plumbing for these serialisations and deserialisations is done for you in the background.

Running a Flyte cluster locally is super easy. With a single command: flytectl sandbox start

When I tried Flyte for the first time I got a very good impression. I could easily run my pipelines locally (without a cluster), but also setting up a local cluster was super easy.

All of these being said: Flyte is a complex tool. Small teams still need to setup K8S, manage a cluster, manage security, integrate plugins..etc.

Flyte provides a lot of integrations with other tools just like Kubeflow does. That being said not as many as what Kubeflow offers (at least at the time of publishing this Blog entry). However integrations in Flyte do not feel as brittle as the Kubeflow ones.

One of the main reasons is that Flyte was an internal tool built by Flyte, and by the time it got open sourced it was already battle tested.

Personally, as I got really excited I went to take a look at flyte codebase. Flyte codebase is really interesting. The engineer in me believes that Flyte might be a point of integration of many tooling in the future, this is because the foundational layers make it easy to test integrations and get early failures.

Kubeflow is backed by Google

Indeed Kubeflow is backed by google. Big companies like canonical are offering services like “Charmed Kubeflow” (a managed kubeflow).

Flyte however does not fall behind. Lyft, Spotify [4], Gojek [5], Toyota [6]and others are using or switching to Flyte. This should give you enough confidence that the project wont just disappear.

But my team is small, this is still very complex..

I agree it is complex. The creators of Flyte gathered and started union.ai [2]. UnionAi is a startup, it is not clear to me what their product is yet, but it seems it is some sort of service on top of Flyte making the setup and maintenance much less complicated [1].

Union.ai might become an interesting player for Small teams. Why? Because this might abstract complex infrastructure gluing from ML practitioners. With “Charmed Kubeflow” practitioners are exposed to an API that looks like a DSL for defining Pods. Flyte takes that level away, and I suspect union.ai might remove the complicated parts of a Flyte.

Final thoughts

It might be worth to keep an eye on Flyte and see how its ecosystem evolves. Right now Kubeflow has many integrations and that is its strong point
I think putting Kubeflow and Flyte in the same category is a mistake.
Flyte builds new primitives that abstract infrastructure and plumbing for ML practitioners. This is key for empowering ML teams.
If you are curious about Flyte, you can have a taste in a few minutes by following the getting started section at Flyte’s website [7]

[1] UnionML Introduction video
https://www.youtube.com/watch?v=GTPKc1_QSXo

[2] https://www.union.ai/

[3] https://medium.com/mlops-community/is-kubeflow-dead-d82aadba14c0

[4] “Why We Switched Our Data Orchestration Service” https://engineering.atspotify.com/2022/03/why-we-switched-our-data-orchestration-service/

[5] “Adopting Flyte at Gojek” https://www.youtube.com/watch?v=G1ZFeLewUOA

[6] Woven Planet’s Autonomous Vehicle — Data Processing & MLOps at Scale With Flyte https://www.youtube.com/watch?v=OVLZ6-uR_so

[7] Flyte’s getting started https://docs.flyte.org/en/latest/getting_started/index.html