So, you have a great AI application idea, but now what? How do you train it, build it and let the world use your wonderful creation?
In this article, we'll discuss just that.
One of the most popular and spoken about tools in the machine learning ecosystem is Tensorflow, by Google.
"TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications."
Translation: Tensorflow has a variety of different uses. Currently, one the most popular uses is for building deep learning models. While this was a library originally created for large numerical calculations, Google open sourced it to be used as a deep learning library.
It provides an api for a few different languages, with python being the most widely used and supported one.
As part of the Tensorflow ecosystem, Google also introduced Tensorflow Serving, which allows you to take your model, exported in protobuf format and create a grpc or rest API.
Let's first dive into the architecture of Tensorflow Serving. The following are the basics you should have a good understanding of:
Servables are the smallest unit in the TF serving ecosystem. They are the objects that are used to perform different computations. These objects come in different sizes, complexity and types. For the purposes of this post, we'll be talking about servables being models in the SavedModel format.
Servables also come in versions, which is great because we can do A/B testing, try out different neural architectures and different algorithms. Essentially, the ability to version opens us up to a world where we can be more methodical in our ability to build, test and fine tune our models.
In the world of TF serving, models are actually just servables. It will usually contain the algorithm, the learned weights, and look up tables.
As the name suggests, loaders help load and unload your servables. I.e. They manage the entire lifecycle of a servable. Loaders are independent of algorithms, data and use cases.
Sources are used to find servables. They work closely with loaders, sources provide one loader instance for each stream of servables, to allow those servables to be handled by the loader.
Broadly speaking, the following steps happen:
The source creates a loader for specific versions of the model or servable.
The source alerts the manager of the version.
The manager determines if it's safe (e.g. there are enough resources) and gives the loader the needed resources and allows the loader to load the specific version.
The client is then able to ask the manager for a specific version of a servable or requests the default latest version
Machine learning, as a technique has two major requirements: data, and model building.
Scalable services and products need to be able to be automated, reproducible and debuggable.
Essentially, there are different concerns in each stage of building a product that integrates AI.
To combine these principles and concerns, I decided to split the pipeline into two main pipelines: Building a model and serving a model.
Data is collected and stored in an S3 bucket, or any other data storage of your choice. Data scientists and other model builders are then able to manually build and train a model using as much or as little compute resources as needed (eg using a P2 instance), which is published to another S3 bucket.
The benefit of this approach is two-fold. Firstly, during the stage of building the model, we don't need to worry about infrastructure. We focus on just building the model. The decisions around storage and compute resources are abstracted away. I.e. we can pretend we're just working on our own local machines.
Secondly, publishing models to a bucket means, that we're able to just publish and update as many versions of a model we want, without having to worry about the downstream clients. I.e we become the producers, and let the serving pipeline worry about alerting clients about changes or serving a new version of a model as needed.
The next pipeline is the actual serving and use of a model. Here we make use of the tensorflow architecture, which manages the loading, unloading and serving of a model. We simply specify the source (S3 bucket) and wrap the resulting api in a python web app. Clients are then able to make requests to this new application.
One question you might have is why wrap the api in another app? Well, my focus for this particular set up has been around computer vision and image processing. When dealing with images, there's a few steps we have to take to preprocess the image, and post process the results. By wrapping it in a python application, we get all the python tools that work well for images and data science in general, while also ensuring that applications making use of our model aren't tied to python. I.e. our client could be a Node app, a Go app or any other tech of your choice.
This has been an explanation at a very high level of a simple architecture I designed and implemented. As I continue to build this out there are few major things I'm currently working on adding:
- Running on Kubernetes (used Kubeflow)
- Monitoring (work in progress)
- Automated re-training of models