PyTorch is an open source software library for high performance tensor computation (like NumPy) with strong GPU acceleration. It also provides high-level deep neural networks built on a tape-based autograd system.

This document gives a quick introduction on how to get a first test program in PyTorch running on Piz Daint.

Setup

To use PyTorch on Piz Daint you have to load the corresponding module:

module load daint-gpu
module load PyTorch

or

module load daint-gpu
module load PyTorch/<module-version>

in order to specify a version. The command module avail PyTorch can be used to have an overview of the available versions. For production runs, we recommend either to specify the version or to print it on the job's logs since different releases of PyTorch might be incompatible. This can prove useful when revisiting old calculations.

Testing PyTorch

Simple Import Test

On the Daint login node, directly try to import the PyTorch module:

python -c 'import torch; print(torch.__version__)'

If you don't get an error, you should be able to use PyTorch in your code.

Simple GPU Test

Similary, we can verify that PyTorch finds the GPU in a compute node:

srun -N 1 -C gpu --account=<project> python -c 'import torch; print(torch.cuda.get_device_name())'
# Tesla P100-PCIE-16GB

Please replace the string <project> with the ID of the active project that will be charged for the allocation.

Testing MNIST demo model

A more elaborate test is to actually train a model using the GPU. For that we can use the mnist example from the pytorch examples repository.

 cd $SCRATCH
git clone https://github.com/pytorch/examples
cd examples/mnist

salloc -N 1 -C gpu --account=<project>
time srun python -u main.py --batch-size 2000 --epochs 2

At the end of the script, you should see something like

Test set: Average loss: 0.0614, Accuracy: 9801/10000 (98%)
real	0m37.380s

Running on Piz Daint

The following script exemplifies how to submit a PyTorch job to the queuing system. The script asks for 1 node, making 12 CPUs available to the 1 Python task. Further, the job is constraint to the GPU nodes of Piz Daint and its running time is 5 minutes.

#!/bin/bash -l
#SBATCH --job-name=test_pt
#SBATCH --time=00:05:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24
#SBATCH --constraint=gpu
#SBATCH --account=<project>

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

module load daint-gpu PyTorch

srun python main.py

Running distributed PyTorch with Horovod

Horovod is a distribution framework initially developed by Uber to make distributed Deep Learning fast and easy to use. It can be used to perform distributed training with PyTorch, TensorFlow and other frameworks. On Piz Daint, Horovod is already integrated on the PyTorch module.

Horovod was built on Piz Daint to use NCCL-2 to perform the synchronization of the model running on each node over the Cray Network. NCCL operations can be configured with a set of environment variables. Some of them are shown in the next section.

Example of a distributed job

As an example, let's train a small convolutional neural network (CNN) model from the Horovod's repository using again on the simple MNIST dataset. The required scripts can be obtained with:

cd $SCRATCH
git clone https://github.com/horovod/horovod -b v0.21.1  # specify your <horovod-version>
cd horovod/examples/pytorch/

The following Slurm submission script can be used to run a distributed training job:

#!/bin/bash -l

#SBATCH --job-name=test_pt_hvd
#SBATCH --time=00:10:00
#SBATCH --nodes=<num-nodes>
#SBATCH --ntasks-per-node=1
#SBATCH --constraint=gpu
#SBATCH --account=<project>
#SBATCH --output=test_pt_hvd_%j.out

module load daint-gpu PyTorch

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Environment variables needed by the NCCL backend
export NCCL_DEBUG=INFO
export NCCL_IB_HCA=ipogif0
export NCCL_IB_CUDA_SUPPORT=1

srun python -u pytorch_mnist.py \
    --batch-size 1000 \
    --lr 0.1 --momentum 0.9 \
    --epochs 2 \
    --fp16-allreduce

Distributing your own model

Horovod simplifies a lot the training distribution across several GPU nodes, however, it still requires some changes on your code. The patch below shows the required code additions for a typical supervised learning case:

 diff --git b/examples/pytorch_mnist.py a/examples/pytorch_mnist.py
index a59112a..1a342c4 100644
--- b/examples/pytorch_mnist.py
+++ a/examples/pytorch_mnist.py
@@ -6,6 +6,7 @@ import torch.nn.functional as F
 import torch.optim as optim
 from torchvision import datasets, transforms
 import torch.utils.data.distributed
+import horovod.torch as hvd

 # Training settings
 parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
@@ -72,6 +73,8 @@ def train(epoch):

 def metric_average(val, name):
     tensor = torch.tensor(val)
+    # Horovod: average test metrics across nodes
+    tensor = hvd.allreduce(tensor, name=name)
     return tensor.item()

@@ -108,9 +110,13 @@ if __name__ == '__main__':
     args = parser.parse_args()
     args.cuda = not args.no_cuda and torch.cuda.is_available()

+    # Horovod: initialize library.
+    hvd.init()
     torch.manual_seed(args.seed)

@@ -131,8 +137,14 @@ if __name__ == '__main__':
             transforms.ToTensor(),
             transforms.Normalize((0.1307,), (0.3081,))
         ]))
+    # Horovod: use DistributedSampler to partition the training data.
+    train_sampler = torch.utils.data.distributed.DistributedSampler(
+        train_dataset, num_replicas=hvd.size(), rank=hvd.rank())
     train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=args.batch_size,
-                                              **kwargs)
+                                              sampler=train_sampler, **kwargs)

+    # Horovod: use DistributedSampler to partition the test data.
+    test_sampler = torch.utils.data.distributed.DistributedSampler(
+        test_dataset, num_replicas=hvd.size(), rank=hvd.rank())
     test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=args.test_batch_size,
-                                              **kwargs)
+                                              sampler=test_sampler, **kwargs)

     model = Net()

@@ -162,6 +169,19 @@ if __name__ == '__main__':
     optimizer = optim.SGD(model.parameters(), lr=args.lr * lr_scaler,
                           momentum=args.momentum)

+    # Horovod: broadcast parameters & optimizer state.
+    hvd.broadcast_parameters(model.state_dict(), root_rank=0)
+    hvd.broadcast_optimizer_state(optimizer, root_rank=0)
+
+    # Horovod: (optional) compression algorithm.
+    compression = hvd.Compression.fp16 if args.fp16_allreduce else hvd.Compression.none
+
+    # Horovod: wrap optimizer with DistributedOptimizer.
+    optimizer = hvd.DistributedOptimizer(optimizer,
+                                         named_parameters=model.named_parameters(),
+                                         compression=compression,
+                                         op=hvd.Adasum if args.use_adasum else hvd.Average)
+
     for epoch in range(1, args.epochs + 1):
         train(epoch)
         test()

Horovod with Gloo

Depending on your dataset, you may need to use num_workers > 0 on torch.utils.data.DataLoader to avoid GPU starvation. This can cause problems with the MPI and InfiniBand implementation in some clusters, which can be avoided by using Gloo instead of the default MPI interface on Horovod.

No code changes are needed, but you have to call your training script with horovodrun instead of srun, as shown below:

#!/bin/bash -l

#SBATCH --job-name=test_pt_hvd
#SBATCH --time=00:10:00
#SBATCH --nodes=<num-nodes>
#SBATCH --ntasks-per-node=1
#SBATCH --constraint=gpu
#SBATCH --account=<project>
#SBATCH --output=test_pt_hvd_%j.out

module load daint-gpu PyTorch

# Avoid warnings on forks
export PMI_NO_PREINITIALIZE=1
unset CSCS_CUSTOM_ENV PELOCAL_PRGENV PROFILEREAD RCLOCAL_PRGENV RCLOCAL_BASEOPTS

# List nodes for Gloo
for node in $(scontrol show hostnames); do
   HOSTS="$HOSTS$node:$SLURM_NTASKS_PER_NODE,"
done
HOSTS=${HOSTS%?}  # trim trailing comma

horovodrun -np $SLURM_NTASKS -H $HOSTS --gloo --network-interface ipogif0 \
 python -u pytorch_mnist.py \
    --batch-size 1000 \
    --lr 0.1 --momentum 0.9 \
    --epochs 2 \
    --fp16-allreduce

ImageNet 1k ILSVRC2012

The ImageNet dataset with 1000 classes is available on Piz Daint under /scratch/snx3000/datasets/imagenet/ILSVRC2012_1k

The 1,281,166 training images as well as the 50,000 validation images are repacked as TFRecord files, which greatly speeds up data loading during training. We do not recommend the use of the large amount of small files directly, as it creates a great number of IO operations on the distributed file system.

Please see below an example on how to use Nvidia DALI dataloader to read the ImageNet TF records.

First install its pip package in a virtual enviroment and download the example script:


# install dependencies
module load daint-gpu PyTorch
python -m venv ~/dali_env
source ~/dali_env/bin/activate
pip install tensorboard tqdm
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda110

# get example
wget https://raw.githubusercontent.com/henrique/horovod/ptdali/examples/pytorch/pytorch_imagenet_resnet50_dali.py

Then run the sbatch script:

#!/bin/bash -l

#SBATCH --job-name=test_pt_hvd
#SBATCH --time=00:10:00
#SBATCH --nodes=<num-nodes>
#SBATCH --ntasks-per-node=1
#SBATCH --constraint=gpu
#SBATCH --account=<project>
#SBATCH --output=test_pt_hvd_%j.out

module load daint-gpu PyTorch
source ~/dali_env/bin/activate

srun python -u pytorch_imagenet_resnet50_dali.py \
    --train-recs '/scratch/snx3000/datasets/imagenet/ILSVRC2012_1k/train/*' --val-recs '/scratch/snx3000/datasets/imagenet/ILSVRC2012_1k/validation/*' \
    --train-idx '/scratch/snx3000/datasets/imagenet/ILSVRC2012_1k/idx_files/train/*' --val-idx '/scratch/snx3000/datasets/imagenet/ILSVRC2012_1k/idx_files/validation/*' \
    --log-dir $SCRATCH/runs/pytorch_imagenet_resnet50/ --batch-size 128 --epochs 60

Known Issues

Occasionally PyTorch based workloads will crash due to a NCCL timeout error similar to the following:

[rank27]:[E826 18:01:07.846331043 ProcessGroupNCCL.cpp:568] [Rank 27] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=2, NumelOut=2, Timeout(ms)=120000) ran for 120003 milliseconds before timing out.

In certain situations, only one rank raises this error and in others multiple or all ranks do. It is difficult to pinpoint the exact cause for this issue, however any of the following reasons (or a combination thereof) can result in this behaviour:

  • There was a bug in the workload
  • The rank went out of memory
  • There was an issue with NCCL itself
  • There was a network related issue
  • A faulty node/GPU caused the issue

Unfortunately, we have not yet found a direct and reliable fix for this besides restarting the workload.

In some situations, increasing the timeout limit set by PyTorch for NCCL collectives can help. This can be easily done when initialising the torch distributed group:

import torch
from datetime import timedelta

# Disable the time limit
os.environ['TORCH_NCCL_BLOCKING_WAIT'] = '0'

# Increase the timeout for NCCL operations to 1h
torch.distributed.init_process_group(..., timeout=timedelta(seconds=3600)) 

More information on these options is available in the PyTorch documentation https://pytorch.org/docs/stable/distributed.html

Further Documentation