This post was originally published here.
Humans are generating and collecting more data than ever. We have devices in our pockets that facilitate the creation of huge amounts of data, such as photos, gps coordinates, audio, and all kinds of personal information we consciously and unconsciously reveal.
Moreover, not only are we individuals generating data for personal reasons, but we’re also collecting data unbeknownst to us from traffic and mobility control systems, video surveillance units, satellites, smart cars, and an infinite array of smart devices.
This trend is here to stay and will continue to rise exponentially. In terms of data points, the International Data Corporation (IDC) predicts that the collective sum of the world’s data will grow from 33 zettabytes (ZB) in 2019 to 175 ZB by 2025, an annual growth rate of 61%.
While we’ve been processing data, first in data centers and then in the cloud, these solutions are not suitable for highly demanding tasks with large data volumes. Network capacity and speed are pushed to the limit and new solutions are required. This is the beginning of the era of edge computing and edge devices.
In this report, we'll benchmark five novel edge devices, using different frameworks and models, to see which combinations perform best. In particular, we'll focus on performance outcomes for machine learning on the edge.
Want to jump directly to the performance outcomes? See the results in the interactive dashboard in this interactive benchmark report!
Edge computing consists of delegating data processing tasks to devices on the edge of the network, as close as possible to the data sources. This enables real-time data processing at a very high speed, which is a must for complex IoT solutions with machine learning capabilities. On top of that, it mitigates network limitations, reduces energy consumption, increases security, and improves data privacy.
Under this new paradigm, the combination of specialized hardware and software libraries optimized for machine learning on the edge results in cutting-edge applications and products ready for mass deployment.
The biggest challenges to building these amazing applications are posed by audio, video, and image processing tasks. Deep learning techniques have proven to be highly successful in overcoming these difficulties.
As an example, let’s take self-driving cars. Here, you need to quickly and consistently analyze incoming data, in order to decipher the world around you and take action within a few milliseconds. Addressing that time constraint is why we cannot rely on the cloud to process the stream of data but instead must do it locally.
The downside of doing it locally is that the hardware is not as powerful as a super computer in the cloud, and we cannot compromise on accuracy or speed.
The solution to this is either stronger, more efficient hardware, or less complex deep neural networks. To obtain the best results, a balance of the two is essential.
Therefore, the real question is:
Which edge hardware and what type of network should we bring together in order to maximize the accuracy and speed of deep learning algorithms?
In our quest to identify the optimal combination of the two, we compared several state-of-the-art edge devices in combination with different deep neural network models.
Based on what we think is the most innovative use case, we set out to measure inference throughput in real-time via a one-at-a-time image classification task, so as to get an approximate frames-per-second score.
To accomplish this, we evaluated top-1 inference accuracy across all categories of a specific subset of ImagenetV2 comparing them to some ConvNets models and, when possible, using different frameworks and optimized versions.
While there has been much effort invested over the last few years to improve existing edge hardware, we chose to experiment with these edge devices:
- Nvidia Jetson Nano
- Google Coral Dev Board
- Intel Neural Compute Stick
- Raspberry Pi (upper bound reference)
- 2080ti NVIDIA GPU (lower bound reference)
We included the Raspberry Pi and the Nvidia 2080ti so as to be able to compare the tested hardware against well-known systems, one cloud-based and one edge-based.
The lower bound was a no-brainer. At Tryolabs, we design and train our own deep learning models. Because of this, we have a lot of computing power at our disposal. So, we used it. To set this lower bound on inference times, we ran the tests on a 2080ti NVIDIA GPU. However, because we were only going to use it as a reference point, we ran the tests using basic models, with no optimizations.
For the upper bound, we went with the defending champion, the most popular single-board computer: the Raspberry Pi 3B.
For all benchmarks, we used publicly available pre-trained models, which we run with different frameworks. With respect to the Nvidia Jetson, we tried the TensorRT optimization; for the Raspberry, we used Tensor Flow and PyTorch variants; while for Coral devices, we implemented the Edge TPU engine versions of the S, M, and L EfficientNets models; and finally, regarding Intel devices, we used the Resnet-50 compiled with OpenVINO Toolkit.
We ran the inference on each image once, saved the inference time, and then found the average. We calculated the top-1 accuracy from all tests, as well as the top-5 accuracy for certain models.
Top-1 accuracy: this is conventional accuracy, meaning that the model’s answer (the one with the highest probability) must equal the exact expected answer.
Top-5 accuracy: means that any one of the model’s top five highest-probability answers must match the expected answer.
Something to keep in mind when comparing the results: for fast device-model combinations, we ran the tests incorporating the entire dataset, whereas we only used parts of datasets for the slower combinations.
The interactive dashboards display the metrics obtained from the experiments. Due to the large difference in inference times across models and devices, the parameters are shown in logarithmic scale.
Level up every day