PyTorch is an open source software library for high performance tensor computation (like NumPy) with strong GPU acceleration. It also provides high-level deep neural networks built on a tape-based autograd system.
This document gives a quick introduction on how to get a first test program in PyTorch running on Piz Daint.
Setup
To use PyTorch on Piz Daint you have to load the corresponding module:
module load daint-gpu module load PyTorch
or
module load daint-gpu module load PyTorch/<module-version>
in order to specify a version. The command module avail PyTorch
can be used to have an overview of the available versions. For production runs, we recommend either to specify the version or to print it on the job's logs since different releases of PyTorch might be incompatible. This can prove useful when revisiting old calculations.
Testing PyTorch
Simple Import Test
On the Daint login node, directly try to import the PyTorch module:
python -c 'import torch; print(torch.__version__)'
If you don't get an error, you should be able to use PyTorch in your code.
Simple GPU Test
Similary, we can verify that PyTorch finds the GPU in a compute node:
srun -N 1 -C gpu --account=<project> python -c 'import torch; print(torch.cuda.get_device_name())' # Tesla P100-PCIE-16GB
Please replace the string <project>
with the ID of the active project that will be charged for the allocation.
Testing MNIST demo model
A more elaborate test is to actually train a model using the GPU. For that we can use the mnist example from the pytorch examples repository.
cd $SCRATCH git clone https://github.com/pytorch/examples cd examples/mnist salloc -N 1 -C gpu --account=<project> time srun python -u main.py --batch-size 2000 --epochs 2
At the end of the script, you should see something like
Test set: Average loss: 0.0614, Accuracy: 9801/10000 (98%) real 0m37.380s
Running on Piz Daint
The following script exemplifies how to submit a PyTorch job to the queuing system. The script asks for 1 node, making 12 CPUs available to the 1 Python task. Further, the job is constraint to the GPU nodes of Piz Daint and its running time is 5 minutes.
#!/bin/bash -l #SBATCH --job-name=test_pt #SBATCH --time=00:05:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=24 #SBATCH --constraint=gpu #SBATCH --account=<project> export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK module load daint-gpu PyTorch srun python main.py
Running distributed PyTorch with Horovod
Horovod is a distribution framework initially developed by Uber to make distributed Deep Learning fast and easy to use. It can be used to perform distributed training with PyTorch, TensorFlow and other frameworks. On Piz Daint, Horovod is already integrated on the PyTorch module.
Horovod was built on Piz Daint to use NCCL-2 to perform the synchronization of the model running on each node over the Cray Network. NCCL operations can be configured with a set of environment variables. Some of them are shown in the next section.
Example of a distributed job
As an example, let's train a small convolutional neural network (CNN) model from the Horovod's repository using again on the simple MNIST dataset. The required scripts can be obtained with:
cd $SCRATCH git clone https://github.com/horovod/horovod -b v0.21.1 # specify your <horovod-version> cd horovod/examples/pytorch/
The following Slurm submission script can be used to run a distributed training job:
#!/bin/bash -l #SBATCH --job-name=test_pt_hvd #SBATCH --time=00:10:00 #SBATCH --nodes=<num-nodes> #SBATCH --ntasks-per-node=1 #SBATCH --constraint=gpu #SBATCH --account=<project> #SBATCH --output=test_pt_hvd_%j.out module load daint-gpu PyTorch export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # Environment variables needed by the NCCL backend export NCCL_DEBUG=INFO export NCCL_IB_HCA=ipogif0 export NCCL_IB_CUDA_SUPPORT=1 srun python -u pytorch_mnist.py \ --batch-size 1000 \ --lr 0.1 --momentum 0.9 \ --epochs 2 \ --fp16-allreduce
Distributing your own model
Horovod simplifies a lot the training distribution across several GPU nodes, however, it still requires some changes on your code. The patch below shows the required code additions for a typical supervised learning case:
diff --git b/examples/pytorch_mnist.py a/examples/pytorch_mnist.py index a59112a..1a342c4 100644 --- b/examples/pytorch_mnist.py +++ a/examples/pytorch_mnist.py @@ -6,6 +6,7 @@ import torch.nn.functional as F import torch.optim as optim from torchvision import datasets, transforms import torch.utils.data.distributed +import horovod.torch as hvd # Training settings parser = argparse.ArgumentParser(description='PyTorch MNIST Example') @@ -72,6 +73,8 @@ def train(epoch): def metric_average(val, name): tensor = torch.tensor(val) + # Horovod: average test metrics across nodes + tensor = hvd.allreduce(tensor, name=name) return tensor.item() @@ -108,9 +110,13 @@ if __name__ == '__main__': args = parser.parse_args() args.cuda = not args.no_cuda and torch.cuda.is_available() + # Horovod: initialize library. + hvd.init() torch.manual_seed(args.seed) @@ -131,8 +137,14 @@ if __name__ == '__main__': transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ])) + # Horovod: use DistributedSampler to partition the training data. + train_sampler = torch.utils.data.distributed.DistributedSampler( + train_dataset, num_replicas=hvd.size(), rank=hvd.rank()) train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=args.batch_size, - **kwargs) + sampler=train_sampler, **kwargs) + # Horovod: use DistributedSampler to partition the test data. + test_sampler = torch.utils.data.distributed.DistributedSampler( + test_dataset, num_replicas=hvd.size(), rank=hvd.rank()) test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=args.test_batch_size, - **kwargs) + sampler=test_sampler, **kwargs) model = Net() @@ -162,6 +169,19 @@ if __name__ == '__main__': optimizer = optim.SGD(model.parameters(), lr=args.lr * lr_scaler, momentum=args.momentum) + # Horovod: broadcast parameters & optimizer state. + hvd.broadcast_parameters(model.state_dict(), root_rank=0) + hvd.broadcast_optimizer_state(optimizer, root_rank=0) + + # Horovod: (optional) compression algorithm. + compression = hvd.Compression.fp16 if args.fp16_allreduce else hvd.Compression.none + + # Horovod: wrap optimizer with DistributedOptimizer. + optimizer = hvd.DistributedOptimizer(optimizer, + named_parameters=model.named_parameters(), + compression=compression, + op=hvd.Adasum if args.use_adasum else hvd.Average) + for epoch in range(1, args.epochs + 1): train(epoch) test()
Horovod with Gloo
Depending on your dataset, you may need to use num_workers > 0
on torch.utils.data.DataLoader
to avoid GPU starvation. This can cause problems with the MPI and InfiniBand implementation in some clusters, which can be avoided by using Gloo instead of the default MPI interface on Horovod.
No code changes are needed, but you have to call your training script with horovodrun
instead of srun
, as shown below:
#!/bin/bash -l #SBATCH --job-name=test_pt_hvd #SBATCH --time=00:10:00 #SBATCH --nodes=<num-nodes> #SBATCH --ntasks-per-node=1 #SBATCH --constraint=gpu #SBATCH --account=<project> #SBATCH --output=test_pt_hvd_%j.out module load daint-gpu PyTorch # Avoid warnings on forks export PMI_NO_PREINITIALIZE=1 unset CSCS_CUSTOM_ENV PELOCAL_PRGENV PROFILEREAD RCLOCAL_PRGENV RCLOCAL_BASEOPTS # List nodes for Gloo for node in $(scontrol show hostnames); do HOSTS="$HOSTS$node:$SLURM_NTASKS_PER_NODE," done HOSTS=${HOSTS%?} # trim trailing comma horovodrun -np $SLURM_NTASKS -H $HOSTS --gloo --network-interface ipogif0 \ python -u pytorch_mnist.py \ --batch-size 1000 \ --lr 0.1 --momentum 0.9 \ --epochs 2 \ --fp16-allreduce
ImageNet 1k ILSVRC2012
The ImageNet dataset with 1000 classes is available on Piz Daint under /scratch/snx3000/datasets/imagenet/ILSVRC2012_1k
The 1,281,166 training images as well as the 50,000 validation images are repacked as TFRecord files, which greatly speeds up data loading during training. We do not recommend the use of the large amount of small files directly, as it creates a great number of IO operations on the distributed file system.
Please see below an example on how to use Nvidia DALI dataloader to read the ImageNet TF records.
First install its pip package in a virtual enviroment and download the example script:
# install dependencies module load daint-gpu PyTorch python -m venv ~/dali_env source ~/dali_env/bin/activate pip install tensorboard tqdm pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda110 # get example wget https://raw.githubusercontent.com/henrique/horovod/ptdali/examples/pytorch/pytorch_imagenet_resnet50_dali.py
Then run the sbatch script:
#!/bin/bash -l #SBATCH --job-name=test_pt_hvd #SBATCH --time=00:10:00 #SBATCH --nodes=<num-nodes> #SBATCH --ntasks-per-node=1 #SBATCH --constraint=gpu #SBATCH --account=<project> #SBATCH --output=test_pt_hvd_%j.out module load daint-gpu PyTorch source ~/dali_env/bin/activate srun python -u pytorch_imagenet_resnet50_dali.py \ --train-recs '/scratch/snx3000/datasets/imagenet/ILSVRC2012_1k/train/*' --val-recs '/scratch/snx3000/datasets/imagenet/ILSVRC2012_1k/validation/*' \ --train-idx '/scratch/snx3000/datasets/imagenet/ILSVRC2012_1k/idx_files/train/*' --val-idx '/scratch/snx3000/datasets/imagenet/ILSVRC2012_1k/idx_files/validation/*' \ --log-dir $SCRATCH/runs/pytorch_imagenet_resnet50/ --batch-size 128 --epochs 60
Known Issues
Occasionally PyTorch based workloads will crash due to a NCCL timeout error similar to the following:
[rank27]:[E826 18:01:07.846331043 ProcessGroupNCCL.cpp:568] [Rank 27] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=2, NumelOut=2, Timeout(ms)=120000) ran for 120003 milliseconds before timing out.
In certain situations, only one rank raises this error and in others multiple or all ranks do. It is difficult to pinpoint the exact cause for this issue, however any of the following reasons (or a combination thereof) can result in this behaviour:
- There was a bug in the workload
- The rank went out of memory
- There was an issue with NCCL itself
- There was a network related issue
- A faulty node/GPU caused the issue
Unfortunately, we have not yet found a direct and reliable fix for this besides restarting the workload.
In some situations, increasing the timeout limit set by PyTorch for NCCL collectives can help. This can be easily done when initialising the torch distributed group:
import torch from datetime import timedelta # Disable the time limit os.environ['TORCH_NCCL_BLOCKING_WAIT'] = '0' # Increase the timeout for NCCL operations to 1h torch.distributed.init_process_group(..., timeout=timedelta(seconds=3600))
More information on these options is available in the PyTorch documentation https://pytorch.org/docs/stable/distributed.html