PyTorch Lightning 1.6: Support Intel’s Habana Accelerator, New efficient DDP strategy (Bagua), Manual Fault-tolerance, Stability, and Reliability.

Carlos Mocholí
PyTorch Lightning Developer Blog

--

PyTorch Lightning 1.6 is the work of 99 contributors who have worked on features, bug fixes, and documentation for a total of over 750 commits since 1.5. This is our most active release yet. Here are some highlights:

Introducing Intel’s Habana Accelerator

Check out our previous post about leveraging Intel’s new hardware with PyTorch Lightning.

The Bagua Strategy

Check out our previous post about this new communication strategy!

Towards stable Accelerator, Strategy, and Plugin APIs

Check out our previous post about this re-engineering of some of PyTorch Lightning’s critical internals:

LightningCLI improvements

In the previous release, we added shorthand notation support for registered components. In this release, we added a flag to automatically register all available components:

We have also added support for the ReduceLROnPlateau scheduler with shorthand notation:

If you need to customize the learning rate scheduler configuration, you can do so by overriding:

Finally, loggers are also now configurable with shorthand:

Control SLURM’s re-queueing

We’ve added the ability to turn the automatic resubmission on or off when a job gets interrupted by the SLURM controller (via signal handling). Users who prefer to let their code handle the resubmission (for example, when submitit is used) can now pass:

Fault-tolerance improvements

The Fault-tolerance training under manual optimization now tracks optimization progress. We also changed the graceful exit signal from SIGUSR1 to SIGTERM for better support inside cloud instances.

An additional feature we're excited to announce is support for consecutive trainer.fit() calls.

Loop customization improvements

The Loop's state is now included as part of the checkpoints saved by the library. This enables finer restoration of custom loops.

We’ve also made it easier to replace Lightning’s loops with your own. For example:

Data-Loading improvements

In previous versions, Lightning required that the DataLoader instance set its input arguments as instance attributes. This meant that custom DataLoaders also had this hidden requirement. In this release, we do this automatically for the user, easing the passing of custom loaders:

class MyDataLoader(torch.utils.data.DataLoader):
def __init__(self, a=123, *args, **kwargs):
- # this was required before
- self.a = a
super().__init__(*args, **kwargs)
trainer.fit(model, train_dataloader=MyDataLoader())

As of this release, Lightning no longer pre-fetches 1 extra batch if it doesn’t need to. Previously, doing so would conflict with the internal pre-fetching done by optimized data loaders such as FFCV’s. You could now define your own pre-fetching value like this:

New Hooks

LightningModule.lr_scheduler_step

Lightning now allows the use of custom learning rate schedulers that aren’t natively available in PyTorch. A great example of this is Timm Schedulers.

When using custom learning rate schedulers relying on an API other than PyTorch’s, you can now define the LightningModule.lr_scheduler_step with your desired logic.

A new stateful API

This release introduces new hooks to standardize all stateful components to use state_dict and load_state_dict, mimicking the PyTorch API. The new hooks receive their own component's state and replace most usages of the previous on_save_checkpoint and on_load_checkpoint hooks.

def MyCallback(pl.Callback):
- def on_save_checkpoint(self, trainer, pl_module, checkpoint):
- return {'x': self.x}

- def on_load_checkpoint(self, trainer, pl_module, checkpoint):
- self.x = x
+ def state_dict(self):
+ return {'x': self.x}

+ def load_state_dict(self, checkpoint):
+ self.x = x

New properties

Trainer.estimated_stepping_batches

You can use built-in Trainer.estimated_stepping_batches to compute the total number of stepping batches needed for the complete training.

The property took gradient accumulation factor and distributed setting into consideration when performing this computation so that you don’t have to derive it manually:

Trainer.num_devices and Trainer.device_ids

In the past, retrieving the number of devices used or their IDs posed a considerable challenge. Additionally, doing so required knowing which property to access based on the current Trainer configuration.

To simplify this process, we’ve deprecated the per-accelerator properties to have accelerator agnostic properties. For example:

- num_devices = max(1, trainer.num_gpus, trainer.num_processes)
- if trainer.tpu_cores:
- num_devices = max(num_devices, trainer.tpu_cores)
+ num_devices = trainer.num_devices

Experimental Features

Manual Fault-tolerance

Fault Tolerance has limitations that require specific information about your data-loading structure.

It is now possible to resolve those limitations by enabling manual fault tolerance, where you can write your logic and specify how exactly to checkpoint your datasets and samplers. You can do so using this environment flag:

Check out this video for a dive into the internals of this flag.

Customizing the layer synchronization

We introduced a new plugin class for wrapping a model's layers with multiprocessing synchronization logic.

Registering Custom Accelerators

There has been much progress in the field of ML Accelerators, and the list of accelerators is constantly expanding.

We’ve made it easier for users to try out new accelerators by enabling support for registering custom Accelerator classes in Lightning.

Backward Incompatible Changes

Here is a selection of notable changes that are not backward compatible with previous versions. The full list of changes and removals can be found in the CHANGELOG below.

Drop PyTorch 1.7 support

Following our 4 PyTorch release window, this release supports PyTorch 1.8 to 1.11. Support for PyTorch 1.7 has been removed.

Drop Python 3.6 support

Following Python’s end-of-life, support for Python 3.6 has been removed.

AcceleratorConnector rewrite

To support the new accelerator and strategy features, we completely rewrote our internal AcceleratorConncetor class. No backward compatibility was maintained, so it is likely to have broken your code if it was using this class.

Re-define the current_epoch boundary

To resolve fault-tolerance issues, we changed where the current epoch value gets increased.

trainer.current_epoch is now increased by 1 on_train_end. This means that if a model is run for 3 epochs (0, 1, 2), trainer.current_epoch will now return 3 instead of 2 after trainer.fit(). This can also impact custom callbacks that access this property inside this hook.

This also impacts checkpoints saved during an epoch (e.g. on_train_epoch_end). For example, a Trainer(max_epochs=1, limit_train_batches=1) instance that saves a checkpoint will have the current_epoch=0 value saved instead of current_epoch=1.

Re-define the global_step boundary

To resolve fault-tolerance issues, we changed where the global step value gets increased.

Access to trainer.global_step during an intra-training validation, the hook will now correctly return the number of optimizer steps taken already. In pseudocode:

training_step()
+ global_step += 1
validation_if_necessary()
- global_step += 1

Saved checkpoints that use the global step value as part of the filename are now increased by 1 for the same reason. A checkpoint saved after 1 step will now be named step=1.ckpt instead of step=0.ckpt.

The trainer.global_step value will now account for TBPTT or multiple optimizers. Users setting Trainer({min,max}_steps=...) under these circumstances will need to adjust their values.

Removed automatic reduction of outputs in training_step when using DataParallel

When using Trainer(strategy="dp"), all the tensors returned by training_step were previously reduced to a scalar (#11594). This behavior was especially confusing when outputs needed to be collected into the training_epoch_end hook.

From now on, outputs are no longer reduced except for the loss tensor, unless you implement training_step_end, in which case the loss won't get reduced either.

No longer fallback to CPU with no devices

Previous versions were lenient in that the lack of GPU devices defaulted to running on CPU. This meant that users’ code could be running much slower without them ever noticing that it was running on CPU.

We suggest passing Trainer(accelerator="auto") when this leniency is desired.

Next steps

PyTorch Lightning is a constantly evolving project. We generally keep a weekly release schedule of bug-fixes. As of the writing of this post, 1.6.2 is already released. Go ahead and give it a try!

Finally, before we go, let us introduce you to Grid.ai. Built by the creators of PyTorch Lightning, this platform enables you to scale your model training without worrying about infrastructure, similar to how Lightning automates training.

Grid.AI enables you to scale training from your laptop to the cloud without modifying a single line of code. Leveraging Lightning features such as Early Stopping, Integrated Logging, Automatic Checkpointing, and CLI enables you to make the traditional MLOps behind model training seem invisible.

About the author

Carlos Mocholí is a research engineer at Grid.ai and tech lead of PyTorch Lightning, the lightweight wrapper for boilerplate-free PyTorch research. Previously, Carlos worked as a Research Engineer on Handwritten Text Recognition. He holds an MSc in AI from the University of Edinburgh.

--

--