Training a Deep Learning model isn’t only a compute intensive task: a lot of I/O is also required. Let’s see why.
Large datasets are usually stored on network storage, such as Amazon S3. Thus, during the training process, data needs to be loaded from network storage to instance RAM. This data loading process needs to happen as fast and as steadily as possible to keep CPUs and GPUs busy. As they are blazingly fast, any delay or unexpected latency in loading data is likely to stall them and to waste valuable training time.
I/O speed and latency are also critical to inference performance. Although many applications predict one sample at a time, overall throughput is likely to suffer if I/O isn’t consistently fast.
The purpose of training a Deep Learning model is to gradually discover the optimal set of weights (aka parameters) for that model, i.e. the set of weights that minimizes a specific metric (usually the validation error).
This involves running an optimization function (SGD or one of its many variants) to compute gradients, which reflect the difference between ground truth and predictions. When training on a distributed cluster of nodes, each node receives a batch of data, forwards it through the model and computes the gradients for that batch. Then, each node pushes the gradients to a master server where results from all nodes are consolidated. Before processing a new batch, a node first pulls the latest results, which guarantees that all nodes share the same state.
Gradients for large models can be huge: 97MB for Resnet-50. That’s a lot of data that each node has to push and pull again and again. This puts a lot of strain on network bandwidth and can become a serious performance bottleneck. A number of techniques have been designed to compress and quantize gradients, and they help reduce the amount of data that needs to be exchanged [1, 2]. Still, network performance remains a very important factor in speeding up large distributed training jobs.
Happy to answer any question! Please follow me on Twitter for similar news and content.
 “Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training”, Yujun Lin, Song Han, Huizi Mao, Yu Wang, William J. Dally, 2017
 “Gradient Compression”, Apache MXNet.