Supercharge your training with zero code changes using Intel’s Habana Accelerator ⚡️

Kaushik Bokka
PyTorch Lightning Developer Blog

--

We recently added support for Habana’s Gaudi AI Processors, which can be used to accelerate deep learning training workloads.

We also covered the benefits of using Habana and how to leverage the Habana Accelerator with PyTorch Lightning on a livestream:

What is the Habana Gaudi AI Processor? 🤔

Habana Gaudi was designed from the ground up to maximize training throughput and efficiency. 🔥

The processors are built on a heterogeneous architecture with a cluster of fully programmable Tensor Processing Cores (TPC), along with its associated development tools and libraries and a configurable Matrix Math engine.

The TPC core is a VLIW SIMD processor with an instruction set and hardware tailored to efficiently serve training workloads. The Gaudi memory architecture includes on-die SRAM and local memories in each TPC, and Gaudi is the first DL training processor that has integrated RDMA over Converged Ethernet (RoCE v2) engines on-chip.

On the software side, the PyTorch-Habana bridge interfaces between the framework and the SynapseAI software stack to enable the execution of deep learning models on the Habana Gaudi device.

Habana Gaudi’s compute and scaling efficiency brings new levels of price-performance for deep learning training — so you get to do more deep learning training while spending less.

What are Accelerators in PyTorch Lightning? 🚀

‘Accelerator’ refers to the hardware being used by PyTorch Lightning for training and inference applications. At the moment, Lightning supports several accelerators: CPUs, GPUs, TPUs, IPUs & HPUs.

As an ML practitioner, you’d like your focus more on research rather than engineering logic around hardware.

PyTorch Lightning allows you to focus on your research by abstracting the accelerator logic away from users, allowing them to focus on writing accelerator agnostic code.

We aim to integrate the best performing accelerators in the ML space for Lightning users to leverage, and to continue being a leading framework with an extensive list of accelerators supported in the PyTorch ecosystem. 💪

How to get access to Habana Gaudi AI Processors?

You can either use Gaudi-powered AWS EC2 DL1 instances or Supermicro X12 Gaudi server to get access to the processors.

Check out the Get Started Guide with AWS and Habana.

Benchmarks 📊

The following charts are Gaudi comparisons with the popular GPU variants A100 and V100.

Model Training Performance

Ref: https://developer.habana.ai/resources/habana-training-models/

Model Training Cost Savings 💰

ResNet50 Model: $/image (lower is better)

Ref: https://developer.habana.ai/resources/habana-training-models/

Training with Habana Gaudi AI Processor (HPU)

Let’s get to the good stuff: how to start training with the Gaudi devices. We use the alias HPU for Habana Gaudi AI Processor in PyTorch Lightning.

To enable PyTorch Lightning to utilize the HPU accelerator, simply provide the accelerator="hpu" parameter to the Trainer class.

Selecting the HPU accelerator via Lightning Trainer

Passing devices=1 and accelerator="hpu" to the Trainer class enables the Habana accelerator for single Gaudi training.

Training on a single Gaudi device

The devices=8 and accelerator="hpu" parameters to the Trainer class enable the Habana accelerator for distributed training with 8 Gaudis. It uses HPUParallelStrategy internally which is based on the DDP strategy with the addition of Habana’s collective communication library (HCCL) to support scale-up within a node and scale-out across multiple nodes.

Distributed Training on 8 Gaudi devices

Mixed Precision Plugin

PyTorch Lightning also allows mixed precision training with HPUs. By default, HPU training will use 32-bit precision. To enable mixed-precision, set the precision flag:

Selecting the precision type for HPU training

Internally, we use the Habana Mixed Precision (HMP) package in HPUPrecisionPluginto enable mixed-precision training. Check out this section to understand how to customize mixed precision for training on HPUs.

Next steps

Finally, before we go, let us introduce you to Grid.ai. Built by the creators of PyTorch Lightning, this platform enables you to scale your model training without worrying about infrastructure, similar to how Lightning automates training.

Grid.AI enables you to scale training from your laptop to the cloud without modifying a single line of code. Leveraging Lightning features such as Early Stopping, Integrated Logging, Automatic Checkpointing, and CLI enables you to make the traditional MLOps behind model training seem invisible.

About the author

Kaushik Bokka is a Senior Research Engineer at Grid.ai and one of the core maintainers of the PyTorch Lightning framework. He has experience building production-scale Machine Learning and Computer Vision systems for several products ranging from Video Analytics to Fashion AI workflows.

--

--

Machine Learning Engineer passionate about building ML Systems for people and organizations to build better AI applications.