Using this design pattern, we can have a production ML system synchronously handle millions of prediction requests per second.
Stateless components | Stateful components |
---|---|
Output is determined purely by the inputs | Output depends both on inputs and internal state |
No state => can be shared by multiple clients | Need to store each client's conversational state |
Highly scalable => initialized on first request and destroyed when client terminates or timed out | Expensive and difficult to manage |
Exporting a model as a stateless function means the stateful variables such as epoch number, learning rate, etc. need to be tracked separately and not to be included in the exported file.
Demerits of carrying out inferences on an in-memory object
- Loading the entire model (which can be large in size) into memory
- Limits on latency on predictions
- Programming language dependency
- The model input and output may not be user-friendly
Achieving statelessness
- Export the model into a format that is programming language independent
- Restore the model as a stateless function in production
- Make the stateless function available via REST endpoint
To save a model in Keras: model.save(export_path)
This will export the model as a <model>.pb
file - protocol buffer and extracts out other stateful variables into separate files.
model.save
also takes an optional argument signatures
. This can be used to define a dictionary stating different serving functions. If not specified, the model's forward pass is exported.
To determine the signature of the stateless function that we will use for serving:
!saved_model_cli show --dir {export_path} --tag_set serve --signature_def serving_default
Finally, we can use this as follows:
restored = tf.keras.models.load_model(export_path)
infer = restored.signatures['serving_default']
outputs = infer(input)
Top comments (0)