The Frameworks that Google, DeepMind, Microsoft and Uber Use to Train Deep Learning Models at Scale

GPipe, Horovod, TF-Replicator and DeepSpeed combine cutting edge aspects of deep learning research and infrastructure to scale the training of deep learning models.

Jesus Rodriguez

Mar 10 . 7 min read

Large scale training is one of the most challenging aspects of building deep learning solutions in the real world. Like the old proverb says, your greatest strength can become your biggest weakness and that certainly applies to deep learning models. The entire deep learning space was possible in part to the ability of deep neural networks to scale across GPU topologies. However, that same ability to scale resulted in the creation of computationally intensive programs that result operationally challenging to most organizations. From training to optimization, the lifecycle of deep learning programs requires robust infrastructure building blocks to be able to parallelize and scale computation workloads. While deep learning frameworks are evolving at a rapid pace, the corresponding infrastructure models remain relatively nascent. Over the last few years, technology giants Google , Microsoft, Uber, DeepMind and others have regularly unveiled separate efforts for enabling the parallelization of deep learning models across large GPU infrastructures.

The principle of distributing and parallelizing computations is relevant across almost any stage of the lifecycle of a deep learning program. Training a deep learning model can be an incredibly expensive exercise and so is its execution. The obvious answer obvious is to leverage large GPU networks to distribute the workloads of deep learning programs but that’s far from being an easy endeavor. Concurrent and parallel programming is notoriously complex and even more so when applied to large neural networks. Large technology companies face these challenges every day as they operate incredibly complex deep neural networks for mission critical applications. Today, I would like to review some of the top architectures used at Google, DeepMind, Microsoft and Uber to parallelize the training of large scale deep learning models. Specifically, I would like to discuss the following projects:

. Google GPipe

. Uber Horovod

. DeepMind TF-Replicator

. Microsoft DeepSpeed

GPipe focuses on scaling training workloads for deep learning programs. The complexity of training processes from an infrastructure standpoint is an often-overlooked aspect of deep learning models. Training datasets are getting larger and more complex. For instance, in the health care space, is not uncommon to encounter models that need to be trained using millions of high resolution images. As a result, training processes often take a long time to complete and result incredibly expensive from the memory and CPU consumption.

An effective way to think about the parallelism of deep learning models is to divide it between data and model parallelism. The data parallelism approach employs large clusters of machines to split the input data across them. Model parallelism attempts to move the model to accelerators, such as GPUs or TPUs, which have special hardware to accelerate model training. At a high level, almost all training datasets can be parallelized following certain logic but the same can’t be said about models. For instance, some deep learning models are composed of parallel branches which can be trained independently. In that case, a classic strategy is to divide the computation into partitions and assign different partitions to different branches. However, that strategy falls short in deep learning models that stack layers sequentially, presenting a challenge to parallelize computation efficiently.

GPipe combines both data and model parallelism by leveraging a. Conceptually, GPipe is a distributed machine learning library that uses synchronous stochastic gradient descent and pipeline parallelism for training, applicable to any DNN that consists of multiple sequential layers. GPipe partitions a model across different accelerators and automatically splits a mini-batch of training examples into smaller micro-batches. This model allows GPipe’s accelerators to operate in parallel maximizing the scalability of the training process.

The following figure illustrates the GPipe model with a neural network with sequential layers is partitioned across four accelerators. Fk is the composite forward computation function of kth partition. Bk is the corresponding backpropagation function. Bk depends on both Bk+1 from upper layer and the intermediate activations of Fk. In the top model, we can see how the sequential nature of the network leads to the underutilization of resources. The bottom figure shows the GPipe approach in which the input mini-batch is divided into smaller macro-batches which can be processed by the accelerators at the same time.

Google as part of the TensorFlow project.

is one of the Uber ML stacks that has become extremenly popular within the community and has been adopted by research teams at AI-powerhouses like DeepMind or OpenAI. Conceptually, Horovod is a framework for running distributed deep learning training jobs at scale.

Horovod leverages message passing interface stacks such as to enable a training job to run on a highly parallel and distributed infrastructure without any modifications. Running a distributed TensorFlow training job in Horovod is accomplished in four simple steps:

  1. hvd.init() initializes Horovod.
  2. config.gpu_options.visible_device_list = str(hvd.local_rank()) assigns a GPU to each of the TensorFlow processes.
  3. opt=hvd.DistributedOptimizer(opt)wraps any regular TensorFlow optimizer with Horovod optimizer which takes care of averaging gradients using ring-allreduce.
  4. hvd.BroadcastGlobalVariablesHook(0) broadcasts variables from the first process to all other processes to ensure consistent initialization.

You can see these four steps in the following code sample that is a template of a basic TensorFlow training job:

import tensorflow as tf
import horovod.tensorflow as hvd# Initialize Horovod
hvd.init()# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())# Build model...
loss = ...
opt = tf.train.AdagradOptimizer(0.01)# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)# Add hook to broadcast variables from rank 0 to all other processes during
# initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]# Make training operation
train_op = opt.minimize(loss)# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir="/tmp/train_logs",
hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
# Perform synchronous training.

TF-Replicator focuses on a different aspect of scalability related to how TensorFlow programs leverage Tensor Processing Units(TPUs). Considered one of the most advanced AI chips, TPUs provide native scalability for machine learning workloads. However, the usage of TPUs in TensorFlow programs which causes issues with portability and barriers of adoption for data scientists who are not familiar with the underlying hardware model. DeepMind’s TF-Replicator addresses this challenge by providing a simpler, developer friendly programming model for leveraging TPUs in TensorFlow programs.

The magic of TF-Replicator relies on an “in-graph replication” patter in which the computation for each device is replicated in the same TensorFlow graph. Communication between devices is achieved by connecting nodes from the devices’ corresponding sub-graphs. To achieve that level of parallelization, TF-Replicator leverage model to insert native communication between devices in the graph. When presented with a TensorFlow graph, TF-Replicator first builds the computation for each device independently and leaves placeholders where cross-device computation has been specified by the user. Once the sub-graphs for all devices have been built, TF-Replicator connects them by replacing the placeholders with actual cross-device computation.

From the programming model standpoint, code written using TF-Replicator looks similar to native TensorFlow code authored for a single device. The user simply needs to define (1) an input function that exposes a Dataset, and (2) a step function that defines the logic of their model (e.g. a single step of gradient descent). The following code snippet shows a simple TF-Replicator program:

# Deploying a model with TpuReplicator.
repl = tf_replicator.TpuReplicator(
num_workers=1, num_tpu_cores_per_worker=8
with repl.context():
model = resnet_model()
base_optimizer = tf.train.AdamOptimizer()
optimizer = repl.wrap_optimizer(base_optimizer)# ... code to define replica input_fn and step_fn.per_replica_loss =, input_fn)
train_op = tf.reduce_mean(per_replica_loss)with tf.train.MonitoredSession() as session:
for i in xrange(num_train_steps):

To optimize the communication between different devices, TF-Replicator leverages state-of-the-art MPI interfaces. In some experiments, DeepMind was able to train the famous model on batches of size 2048 across up to 512 cores of a TPUv3 pod. At the moment, TF-Replicator is the main programming interface for TPUs at DeepMind.

Microsoft’s DeepSpeed is a new open source framework focused on optimizing the training of massively large deep learning models. The current release includes the first implementation of ZeRO as well as other optimization methods. From a programming standpoint, DeepSpeed is built on top of PyTorch and provides a simple API that allows engineers to leverage training parallelization techniques with just a few lines of code. DeepSpeed abstracts all of the difficult aspects of large scale training, such as parallelization, mixed precision, gradient accumulation, and checkpoints allowing developers to focus on the construction of the models.

From the functional standpoint, DeepScale excels at four key aspects:

. Scale: DeepSpeed provides system support to run models up to 100 billion parameters which represents a 10x improvement compared to other training optimization frameworks.

. Speed: In the initial tests, DeepSpeed showed 4x-5x higher throughput than other libraries.

. Cost: Models could be trained using DeepSpeed at three times less cost than alternatives.

. Usability: DeepSpeed does not require refactoring PyTorch models and could be used with just a few lines of code.

the training of deep learning models is a very complex excercise that falls beyond the expertise of most machine learning teams. Leveraging the frameworks and architectures created by technology companies like Google, Microsoft, Uber or DeepMind can certainly streamline these efforts. In the near future, we expect to see some versions of these frameworks included in mainstream deep learning stacks to democratize the access by the core deep learning community.

No Comments

Post A Comment