Announcing the Stable Accelerator and Strategy API for PyTorch Lightning

As new AI-optimized accelerators have emerged, the Lightning team has re-engineered the internals of the Lightning Trainer to make integrations of new hardware and distributed algorithms as smooth as possible.

Adrian Wälchli

Published in

PyTorch Lightning Developer Blog

4 min readApr 5, 2022

Towards a Stable Accelerator API

When PyTorch Lightning was born three years ago, it granted researchers easy access to multi-node/multi-GPU training without code changes. Today, GPUs are still the most popular choice for training large neural networks, and the ease of accessibility is why people love Lightning. As new AI-optimized accelerators such as TPU, Graphcore’s IPU, or Habana’s Gaudi© processor have since emerged, the Lightning team has re-engineered the internals of the Lightning Trainer to make integrations of new hardware and distributed algorithms as smooth as possible. Today, we announce that we have converged on a stable API that will allow the community to extend Lightning in ways never before possible.

Convert Trainer Arguments

Previously, we announced a new syntax for selecting accelerators and devices via the Trainer. Before that, we specified devices with the {x}pus notation. While still backward-compatible, we recommend changing to the new syntax:

Now, try the following:

And, of course, if you are using the LightningCLI, it supports the new syntax as well!

Rename Plugins to Strategy

Sometimes, it’s possible to squeeze out extra performance by tweaking the settings for the strategy. Previously, we instantiated the Plugin and passed in additional parameters. These plugins have been renamed to Strategy and are now handled by the strategy Trainer argument:

Accelerator vs. Strategy vs. Plugins

This new update marks a significant leap forward in our efforts to separate the responsibilities of Accelerator, Strategy, and Plugins:

Accelerator is responsible for translating calls to the specific hardware Lightning is running on (CPU, GPU, TPU, etc.).
Strategy is responsible for implementing communication between devices. When hardware access is needed, the Strategy calls into the Accelerator. Examples of strategies are DP, DDP, Bagua, …
Plugin is a customizable component that the Strategy can call to enable additional functionality depending on the Trainer settings. Examples of Plugins are Precision Plugins (mixed, bfloat16, double), CheckpointIO, …

Illustration showing how Strategy, Accelerator, and Plugins are composed in Lightning.

The abstractions of these three components are crucial in enabling the Lightning Loops to be fully hardware-agnostic. This way, the community can quickly develop experimental features, accelerators, and strategies without dealing with other parts of the Lightning code base. Learn more about Strategies from our official documentation.

Create Custom Accelerators and Strategies

To illustrate how a new accelerator can be integrated with Lightning, let’s imagine we are working with a fictional XPU accelerator, and we have access to its hardware through a library xpulib.

We first implement the Accelerator class by overriding all abstract methods:

A sample implementation of a custom Accelerator.

We want to take advantage of parallel execution, so we need a Strategy that enables the communication between the XPU devices. Here is the sketch of such an XPU Strategy:

A sample implementation of a custom `Strategy`.

Putting it all together, we can now use XPU with the Trainer:

This is just a sneak peek. More features and APIs around accelerators will help members of the Lightning community leverage state-of-the-art AI hardware. You can learn more in our official documentation.

Next Steps

The Lightning Team is more strongly committed than ever before to providing the best experience possible to anyone doing optimization with PyTorch, and because the PyTorch Lightning API is already stable, breaking changes will be minimal.

If you’re interested in helping out with these efforts, find us on slack!

Finally, before we go, let us introduce you to Grid.ai. Built by the creators of PyTorch Lightning, this platform enables you to scale your model training without worrying about infrastructure, similar to how Lightning automates training.

Grid.AI enables you to scale training from your laptop to the cloud without modifying a single line of code. Leveraging Lightning features such as Early Stopping, Integrated Logging, Automatic Checkpointing, and CLI enables you to make the traditional MLOps behind model training seem invisible.

About the Authors

Adrian Wälchli is a research engineer at Grid.ai and maintainer of PyTorch Lightning, the lightweight wrapper for boilerplate-free PyTorch research. Before that, Adrian was a Ph.D. student at the University of Bern, Switzerland, with an MSc in Computer Science, focusing on Deep Learning for Computer Vision.