Bagua: A new, efficient distributed training strategy available in PyTorch Lightning

We’ve added support for a new distributed training strategy in collaboration with the Bagua team.

Carlos Mocholí

Follow

Published in

PyTorch Lightning Developer Blog

3 min readApr 12, 2022

--

We also covered the benefits of this training strategy in a live stream:

What is Bagua?

BaguaSys/Bagua is a deep learning acceleration framework for PyTorch developed by AI platform@Kuaishou Technology and DS3 Lab@ETH.

Bagua supports multiple, advanced distributed training algorithms with state-of-the-art system relaxation techniques. These include quantization, decentralization, and communication delay.

Training throughput for VGG16 on ImageNet with 128 GPUs under different network bandwidths per algorithm. Results show that Bagua can achieve multiple times the speedup compared to other systems.

Bagua generally produces a higher training throughput than vanilla PyTorch DistributedDataParallel (the de-facto strategy) due to their custom techniques and is written in Rust. You can check out this paper for more details on the Bagua system, as well as this book to learn more about the theoretical guarantees of the involved algorithms. Its effectiveness has been validated in several models and scenarios:

End-to-end training performance of three tasks. For each task, the best algorithm (according to the training efficiency and accuracy) from Bagua was selected and compared with other systems.

Algorithms

Bagua thrives on the diversity of distributed communication algorithms. The system’s impressive flexibility allows it to smoothly incorporate various state-of-the-art algorithms while also providing automatic optimizations for performance during execution. Algorithms have their tradeoffs in terms of inter-machine communication latency and bandwidth, network availability, and convergence rate:

Gradient AllReduce for centralized synchronous communication, where gradients are averaged among all workers.
Decentralized SGD for decentralized synchronous communication, where each worker exchanges data with one or a few specific workers.
ByteGrad and QAdam for low precision communication, where data is compressed into low precision before communication.
Asynchronous Model Average for asynchronous communication, workers are not required to be synchronized in the same iteration in a lock-step style.

You can find more information here for readers interested in specific algorithmic and runtime details, referencing published papers.

In addition to these distributed algorithms, Bagua offers a set of system optimizations such as load-balanced DataLoaders and fused optimizers that haven’t yet been integrated into PyTorch Lightning.

Code

Enabling Bagua with PyTorch Lightning is as simple as:

You can also choose a specific communication algorithm:

For more information about strategies, check out our previous blogpost.

By default, Bagua uses the gradient_allreduce algorithm, which is also the algorithm implemented in DistributedDataParallel and Horovod.

Other valid algorithm inputs are: bytegrad, decentralized, low_precision_decentralized, async, qadam.

Launching the distributed script is as easy as starting your script with python my_bagua_trainer.py. If you are interested in manually launching the processes or setting up a multi-node environment, check out our Bagua docs here.

Next steps

Finally, before we go, let us introduce you to Grid.ai. Built by the creators of PyTorch Lightning, this platform enables you to scale your model training without worrying about infrastructure, similar to how Lightning automates training.

Grid.AI enables you to scale training from your laptop to the cloud without modifying a single line of code. Leveraging Lightning features such as Early Stopping, Integrated Logging, Automatic Checkpointing, and CLI enables you to make the traditional MLOps behind model training seem invisible.

About the author

Carlos Mocholí is a research engineer at Grid.ai and tech lead of PyTorch Lightning, the lightweight wrapper for boilerplate-free PyTorch research. Previously, Carlos worked as a Research Engineer on Handwritten Text Recognition. He holds an MSc in AI from the University of Edinburgh.