DEV Community

Ming
Ming

Posted on • Edited on • Originally published at kemingy.github.io

Deep Learning Serving Benchmark

There is no black magic, everything follows the rules.

What does the deep learning serving frameworks do?

  • respond to request (RESTful HTTP or RPC)
  • model inference (with runtime)
  • preprocessing & postprocessing (optional)
  • queries dynamic batching (increase throughput)
  • monitoring metrics
  • service health check
  • versioning
  • multiple instances

Actually, when we are trying to deploy the models with kubernetes, we only need part of these features. But we do care about the performance of these frameworks. So let's do a benchmark.

Benchmark

Environments:

  • CPU: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
  • GPU: NVIDIA V100
  • Memory: 251GiB
  • OS: Ubuntu 16.04.6 LTS (Xenial Xerus)

Docker Images:

  • tensorflow/tensorflow:latest-gpu
  • tensorflow/serving:latest-gpu
  • nvcr.io/nvidia/tensorrtserver:19.10-py3

The cost of time is recorded after warmup. Dynamic batching disabled.

All the code can be found in this gist.

Framework Model Model Type Images Batch size Time(s)
Tensorflow ResNet50 TF Savedmodel 32000 32 83.189
Tensorflow ResNet50 TF Savedmodel 32000 10 86.897
Tensorflow Serving ResNet50 TF Savedmodel 32000 32 120.496
Tensorflow Serving ResNet50 TF Savedmodel 32000 10 116.887
Triton (TensorRT Inference Server) ResNet50 TF Savedmodel 32000 32 201.855
Triton (TensorRT Inference Server) ResNet50 TF Savedmodel 32000 10 171.056
Falcon + msgpack + Tensorflow ResNet50 TF Savedmodel 32000 32 115.686
Falcon + msgpack + Tensorflow ResNet50 TF Savedmodel 32000 10 115.572

According to the benchmark, Triton is not ready for production, TF Serving is a good option for TensorFlow models, and self-host service is also quite good (you may need to implement dynamic batching for production).

Comparing

Tensorflow Serving

https://www.tensorflow.org/tfx/serving

  • coupled with Tensorflow ecosystem (also support other format, not out-of-box)
  • A/B testing
  • provide both gRPC and HTTP RESTful API
  • prometheus integration
  • batching
  • multiple models
  • preprocessing & postprocessing can be implemented with signatures

Triton Inference Server

https://github.com/NVIDIA/triton-inference-server/

  • support multiply backends: ONNX, PyTorch, TensorFlow, Caffe2, TensorRT
  • both gRPC and HTTP with SDK
  • internal health check and prometheus metrics
  • batching
  • concurrent model execution
  • preprocessing & postprocessing can be done with ensemble models
  • shm-size, memlock, stack configurations are not available for Kubernetes

Multi Model Server

https://github.com/awslabs/multi-model-server

  • require Java 8
  • provide HTTP
  • Java layer communicates with Python workers through Unix Domain Socket or TCP
  • batching (not mature)
  • multiple models
  • log4j
  • management API
  • need to write model loading and inference code (means can use any runtime you want)
  • easy to add preprocessing and postprocessing to the service

GraphPipe

https://oracle.github.io/graphpipe

  • use flatbuffer which is more efficient
  • 2 years ago...
  • Oracle laid off the whole team

TorchServe

https://github.com/pytorch/serve

  • fork from Multi Model Server
  • developing...

Top comments (2)

Collapse
 
jeffyajunliu profile image
jeff-yajun-liu

Would you provide a little more instructions for how to run the actual benchmark code?

Collapse
 
keming profile image
Ming

The resnet50 I used is downloaded through pytorch.org/docs/stable/torchvisio... . Save the model in the required format then you can run the benchmark code.