Lots of tutorials on how to deploy a model in production directly integrates the serialized model into the API. This way of proceeding has the disadvantage of making the API coupled to the model. Another way to do this is to delegate the prediction load to workers using a queue. The schema below shows the solution architecture on the AWS environment.
The machine learning model is stored in an s3 bucket. It is loaded by workers, which are Lambda functions when a message containing prediction data is put in the SQS queue by the client through the API gateway/Lambda REST endpoint.
When the worker has finished the prediction job, it puts the result in a DynamoDb table in order to be accessed. Finally, the client requests the prediction result through an API endpoint that will read the DynamoDb table to fetch the result.
As you can see we are delegating the loading and prediction work to a worker and we do not integrate the model into the REST API. This is because a model can take a long time to load and predict. Therefore we manage them asynchronously thanks to the addition of an SQS queue and a DynamoDb table.
Complete article here including code.