Announcing Lightning 1.4

Lightning 1.4 Release adds TPU pods, IPU Hardware, DeepSpeed Infinity, Fully Sharded Data-Parallel and More.

PyTorch Lightning team
PyTorch Lightning Developer Blog
5 min readJul 27, 2021

--

Today we are excited to announce Lightning 1.4, introducing support for TPU pods, XLA profiling, IPUs, and new plugins to reach 10+ billion parameters, including Deep Speed Infinity, Fully Sharded Data-Parallel and more!

TPU Pod Training

https://cloud.google.com/tpu

With Lightning, you can now scale up training from a single Cloud TPU (8 cores) instance to Cloud TPU Pods of up to 2048 cores without any code changes!

To scale your training to TPU pods, first set the tpu_cores Trainer flag:

A single TPU v3–8 board consists of 8 cores, with 16 GiB of HBM for each core. In contrast, a TPU v3 Pod has up to 2048 TPU cores and 32 TiB of memory.

Then run your script with the official XLA orchestrator on all VMs for distributed training:

TPU XLA Profiler

The XLA Profiler helps debug and optimize Cloud TPU training workload performance. It supports Manual capture via TensorBoard for capturing traces from a running program. Follow this guide for Cloud TPU setup.

To enable the profiler, add the profiler=xla Trainer flag.

You can then visualize the XLA profiler results directly from TensorBoard.

IPU Accelerator

IPUs, or Intelligence Processing Units, designed by Graphcore, “let innovators make breakthroughs in machine intelligence”. The latest generation of IPUs packs 59.4 billion transistors and almost 1,500 processing units onto a single die. If you want to give it a try for your research, Graphcore offers researchers and academics the opportunity to access IPUs in the cloud at no cost through their Academic Program.

https://www.graphcore.ai/products/ipu

Lightning now supports IPUs, built on top of PopTorch, Graphcore’s PyTorch interface. You can train your Lightning models on IPUs by simply specifying the ipus flag with the Lightning Trainer:

Read our documentation for more information.

Fully Sharded Data Parallel [BETA]

https://engineering.fb.com/2021/07/15/open-source/fsdp/

With v1.4, Lightning now supports Fully Sharded Parallelism. The Fully Sharded Distributed plugin shards optimizer state, gradients, and parameters across data-parallel workers. This allows you to fit much larger models onto multiple GPUs into memory, reaching up to 40+ billion parameters on an A100 (read more in this blogpost by Facebook)!

Fully Sharded Training alleviates the need to worry about balancing layers onto specific devices using some form of pipe parallelism and optimizes for distributed communication with minimal effort.

Fully-Sharded provides helper wrap and auto_wrap functions to assist in sharding parameters to reach even larger sizes.

Read our documentation for more information.

DeepSpeed Infinity [BETA]

With the new DeepSpeed Infinity plugin, you can finetune 1 Trillion+ parameters using NVMe Offloading on one 8 GPU machine. It supports offloading to NVMe drives for even larger models, utilizing the large memory space found on NVMe drives. Below shows how to enable DeepSpeed Infinity, assuming a NVMe drive is mounted on a directory called /local_nvme:

Deep Speed Infinity Code Sample

Read our documentation for more information.

Cleaner Lightning Tutorials

To reduce the size footprint of the PyTorch Lightning Repo and enable better documentation, we’ve spun off the PyTorch Lightning Tutorials into a new repo.

If you’d like to get involved and contribute Lightning Tutorials to the official docs, check out our best practices guide below:

Other Features

  • Added support for checkpointing based on a provided time interval during training:
  • Added support setting a max_time for training jobs
  • New cluster environments: KubeflowEnvironment, LSFEnvironment
  • Support for manual launching of DDP processes throughLOCAL_RANKand NODE_RANK. You can now launch individual processes with a custom launch utility or script.
  • Added LightningCLI support for:
    — argument links applied on instantiation (#7895)
    — configurable callbacks that should always be present (#7964)
    — optimizers and learning rate schedulers (#8093)
    — default seed_everything(workers=True) (#7504)
  • New hooks: on_before_backward, on_before_optimizer_step(#8328)
    on_after_backward hook is now called on accumulating iterations
    — use the on_before_optimizer_step hook to mimic the old behaviour

API changes

  • Changed the Trainer’s checkpoint_callback argument to allow only boolean values (#7539)
  • Validation is now always run inside the training epoch scope (#7357)
  • ModelCheckpoint now runs at the end of the training epoch by default (#8389)
  • EarlyStopping now runs at the end of the training epoch by default (#8286)
  • LightningCLI now aborts with a clearer message if config already exists and disables save config during fast_dev_run(#7963)
  • Dropped official support/testing for PyTorch <1.6 (#8288)

See the full changelog here:

Advanced Loops

Trying to implement your own loops with Lightning? Please reach out to us with feature requests and API suggestions for how you are looking to customize your training loops!

Next Steps

As always, we would like to thank the Lightning community of 500+ contributors for their efforts in building and shaping it. Interested in becoming a new contributor, then check out the new OSS contribution quick guide.

Then consider giving one of the following issues a try:

Keep an eye out for more exciting updates soon by following us on LinkedIn, Twitter, YouTube, and Slack.

We hope you enjoy this release; we look forward to hearing how Lighting can take your deep learning workflow to the next level.

--

--

We are the core contributors team developing PyTorch Lightning — the deep learning research framework to run complex models without the boilerplate