PyTorch Lightning 1.6: Support Intel’s Habana Accelerator, New efficient DDP strategy (Bagua), Manual Fault-tolerance, Stability, and Reliability.
PyTorch Lightning 1.6 is the work of 99 contributors who have worked on features, bug fixes, and documentation for a total of over 750 commits since 1.5. This is our most active release yet. Here are some highlights:
Introducing Intel’s Habana Accelerator
Check out our previous post about leveraging Intel’s new hardware with PyTorch Lightning.
The Bagua Strategy
Check out our previous post about this new communication strategy!
Towards stable Accelerator, Strategy, and Plugin APIs
Check out our previous post about this re-engineering of some of PyTorch Lightning’s critical internals:
LightningCLI
improvements
In the previous release, we added shorthand notation support for registered components. In this release, we added a flag to automatically register all available components:
We have also added support for the ReduceLROnPlateau
scheduler with shorthand notation:
If you need to customize the learning rate scheduler configuration, you can do so by overriding:
Finally, loggers are also now configurable with shorthand:
Control SLURM’s re-queueing
We’ve added the ability to turn the automatic resubmission on or off when a job gets interrupted by the SLURM controller (via signal handling). Users who prefer to let their code handle the resubmission (for example, when submitit is used) can now pass:
Fault-tolerance improvements
The Fault-tolerance training under manual optimization now tracks optimization progress. We also changed the graceful exit signal from SIGUSR1
to SIGTERM
for better support inside cloud instances.
An additional feature we're excited to announce is support for consecutive trainer.fit()
calls.
Loop customization improvements
The Loop
's state is now included as part of the checkpoints saved by the library. This enables finer restoration of custom loops.
We’ve also made it easier to replace Lightning’s loops with your own. For example:
Data-Loading improvements
In previous versions, Lightning required that the DataLoader
instance set its input arguments as instance attributes. This meant that custom DataLoader
s also had this hidden requirement. In this release, we do this automatically for the user, easing the passing of custom loaders:
class MyDataLoader(torch.utils.data.DataLoader):
def __init__(self, a=123, *args, **kwargs):
- # this was required before
- self.a = a
super().__init__(*args, **kwargs)trainer.fit(model, train_dataloader=MyDataLoader())
As of this release, Lightning no longer pre-fetches 1 extra batch if it doesn’t need to. Previously, doing so would conflict with the internal pre-fetching done by optimized data loaders such as FFCV’s. You could now define your own pre-fetching value like this:
New Hooks
LightningModule.lr_scheduler_step
Lightning now allows the use of custom learning rate schedulers that aren’t natively available in PyTorch. A great example of this is Timm Schedulers.
When using custom learning rate schedulers relying on an API other than PyTorch’s, you can now define the LightningModule.lr_scheduler_step
with your desired logic.
A new stateful API
This release introduces new hooks to standardize all stateful components to use state_dict
and load_state_dict
, mimicking the PyTorch API. The new hooks receive their own component's state and replace most usages of the previous on_save_checkpoint
and on_load_checkpoint
hooks.
def MyCallback(pl.Callback):
- def on_save_checkpoint(self, trainer, pl_module, checkpoint):
- return {'x': self.x}
- def on_load_checkpoint(self, trainer, pl_module, checkpoint):
- self.x = x+ def state_dict(self):
+ return {'x': self.x}
+ def load_state_dict(self, checkpoint):
+ self.x = x
New properties
Trainer.estimated_stepping_batches
You can use built-in Trainer.estimated_stepping_batches
to compute the total number of stepping batches needed for the complete training.
The property took gradient accumulation factor and distributed setting into consideration when performing this computation so that you don’t have to derive it manually:
Trainer.num_devices
and Trainer.device_ids
In the past, retrieving the number of devices used or their IDs posed a considerable challenge. Additionally, doing so required knowing which property to access based on the current Trainer
configuration.
To simplify this process, we’ve deprecated the per-accelerator properties to have accelerator agnostic properties. For example:
- num_devices = max(1, trainer.num_gpus, trainer.num_processes)
- if trainer.tpu_cores:
- num_devices = max(num_devices, trainer.tpu_cores)
+ num_devices = trainer.num_devices
Experimental Features
Manual Fault-tolerance
Fault Tolerance has limitations that require specific information about your data-loading structure.
It is now possible to resolve those limitations by enabling manual fault tolerance, where you can write your logic and specify how exactly to checkpoint your datasets and samplers. You can do so using this environment flag:
Check out this video for a dive into the internals of this flag.
Customizing the layer synchronization
We introduced a new plugin class for wrapping a model's layers with multiprocessing synchronization logic.
Registering Custom Accelerators
There has been much progress in the field of ML Accelerators, and the list of accelerators is constantly expanding.
We’ve made it easier for users to try out new accelerators by enabling support for registering custom Accelerator
classes in Lightning.
Backward Incompatible Changes
Here is a selection of notable changes that are not backward compatible with previous versions. The full list of changes and removals can be found in the CHANGELOG below.
Drop PyTorch 1.7 support
Following our 4 PyTorch release window, this release supports PyTorch 1.8 to 1.11. Support for PyTorch 1.7 has been removed.
Drop Python 3.6 support
Following Python’s end-of-life, support for Python 3.6 has been removed.
AcceleratorConnector
rewrite
To support the new accelerator and strategy features, we completely rewrote our internal AcceleratorConncetor
class. No backward compatibility was maintained, so it is likely to have broken your code if it was using this class.
Re-define the current_epoch
boundary
To resolve fault-tolerance issues, we changed where the current epoch value gets increased.
trainer.current_epoch
is now increased by 1 on_train_end
. This means that if a model is run for 3 epochs (0, 1, 2), trainer.current_epoch
will now return 3 instead of 2 after trainer.fit()
. This can also impact custom callbacks that access this property inside this hook.
This also impacts checkpoints saved during an epoch (e.g. on_train_epoch_end
). For example, a Trainer(max_epochs=1, limit_train_batches=1)
instance that saves a checkpoint will have the current_epoch=0
value saved instead of current_epoch=1
.
Re-define the global_step
boundary
To resolve fault-tolerance issues, we changed where the global step value gets increased.
Access to trainer.global_step
during an intra-training validation, the hook will now correctly return the number of optimizer steps taken already. In pseudocode:
training_step()
+ global_step += 1
validation_if_necessary()
- global_step += 1
Saved checkpoints that use the global step value as part of the filename are now increased by 1 for the same reason. A checkpoint saved after 1 step will now be named step=1.ckpt
instead of step=0.ckpt
.
The trainer.global_step
value will now account for TBPTT or multiple optimizers. Users setting Trainer({min,max}_steps=...)
under these circumstances will need to adjust their values.
Removed automatic reduction of outputs in training_step when using DataParallel
When using Trainer(strategy="dp")
, all the tensors returned by training_step were previously reduced to a scalar (#11594). This behavior was especially confusing when outputs needed to be collected into the training_epoch_end
hook.
From now on, outputs are no longer reduced except for the loss
tensor, unless you implement training_step_end
, in which case the loss won't get reduced either.
No longer fallback to CPU with no devices
Previous versions were lenient in that the lack of GPU devices defaulted to running on CPU. This meant that users’ code could be running much slower without them ever noticing that it was running on CPU.
We suggest passing Trainer(accelerator="auto")
when this leniency is desired.
Next steps
PyTorch Lightning is a constantly evolving project. We generally keep a weekly release schedule of bug-fixes. As of the writing of this post, 1.6.2 is already released. Go ahead and give it a try!
Finally, before we go, let us introduce you to Grid.ai. Built by the creators of PyTorch Lightning, this platform enables you to scale your model training without worrying about infrastructure, similar to how Lightning automates training.
Grid.AI enables you to scale training from your laptop to the cloud without modifying a single line of code. Leveraging Lightning features such as Early Stopping, Integrated Logging, Automatic Checkpointing, and CLI enables you to make the traditional MLOps behind model training seem invisible.
About the author
Carlos Mocholí is a research engineer at Grid.ai and tech lead of PyTorch Lightning, the lightweight wrapper for boilerplate-free PyTorch research. Previously, Carlos worked as a Research Engineer on Handwritten Text Recognition. He holds an MSc in AI from the University of Edinburgh.