Announcing the Stable Accelerator and Strategy API for PyTorch Lightning
As new AI-optimized accelerators have emerged, the Lightning team has re-engineered the internals of the Lightning Trainer to make integrations of new hardware and distributed algorithms as smooth as possible.
Towards a Stable Accelerator API
When PyTorch Lightning was born three years ago, it granted researchers easy access to multi-node/multi-GPU training without code changes. Today, GPUs are still the most popular choice for training large neural networks, and the ease of accessibility is why people love Lightning. As new AI-optimized accelerators such as TPU, Graphcore’s IPU, or Habana’s Gaudi© processor have since emerged, the Lightning team has re-engineered the internals of the Lightning Trainer to make integrations of new hardware and distributed algorithms as smooth as possible. Today, we announce that we have converged on a stable API that will allow the community to extend Lightning in ways never before possible.
Convert Trainer Arguments
Previously, we announced a new syntax for selecting accelerators and devices via the Trainer. Before that, we specified devices with the {x}pus
notation. While still backward-compatible, we recommend changing to the new syntax:
Now, try the following:
And, of course, if you are using the LightningCLI, it supports the new syntax as well!
Rename Plugins to Strategy
Sometimes, it’s possible to squeeze out extra performance by tweaking the settings for the strategy. Previously, we instantiated the Plugin and passed in additional parameters. These plugins have been renamed to Strategy and are now handled by the strategy
Trainer argument:
Accelerator vs. Strategy vs. Plugins
This new update marks a significant leap forward in our efforts to separate the responsibilities of Accelerator, Strategy, and Plugins:
Accelerator
is responsible for translating calls to the specific hardware Lightning is running on (CPU, GPU, TPU, etc.).Strategy
is responsible for implementing communication between devices. When hardware access is needed, the Strategy calls into the Accelerator. Examples of strategies are DP, DDP, Bagua, …Plugin
is a customizable component that the Strategy can call to enable additional functionality depending on the Trainer settings. Examples of Plugins are Precision Plugins (mixed, bfloat16, double), CheckpointIO, …
The abstractions of these three components are crucial in enabling the Lightning Loops to be fully hardware-agnostic. This way, the community can quickly develop experimental features, accelerators, and strategies without dealing with other parts of the Lightning code base. Learn more about Strategies from our official documentation.
Create Custom Accelerators and Strategies
To illustrate how a new accelerator can be integrated with Lightning, let’s imagine we are working with a fictional XPU accelerator, and we have access to its hardware through a library xpulib
.
We first implement the Accelerator
class by overriding all abstract methods:
We want to take advantage of parallel execution, so we need a Strategy
that enables the communication between the XPU devices. Here is the sketch of such an XPU Strategy:
Putting it all together, we can now use XPU with the Trainer:
This is just a sneak peek. More features and APIs around accelerators will help members of the Lightning community leverage state-of-the-art AI hardware. You can learn more in our official documentation.
Next Steps
The Lightning Team is more strongly committed than ever before to providing the best experience possible to anyone doing optimization with PyTorch, and because the PyTorch Lightning API is already stable, breaking changes will be minimal.
If you’re interested in helping out with these efforts, find us on slack!
Finally, before we go, let us introduce you to Grid.ai. Built by the creators of PyTorch Lightning, this platform enables you to scale your model training without worrying about infrastructure, similar to how Lightning automates training.
Grid.AI enables you to scale training from your laptop to the cloud without modifying a single line of code. Leveraging Lightning features such as Early Stopping, Integrated Logging, Automatic Checkpointing, and CLI enables you to make the traditional MLOps behind model training seem invisible.
About the Authors
Adrian Wälchli is a research engineer at Grid.ai and maintainer of PyTorch Lightning, the lightweight wrapper for boilerplate-free PyTorch research. Before that, Adrian was a Ph.D. student at the University of Bern, Switzerland, with an MSc in Computer Science, focusing on Deep Learning for Computer Vision.