This page collects more comprehensive and/or real-world use case examples which can be performed with containerized applications and the Container Engine on Alps.

NCCL tests with NGC-based image

The NVIDIA Collective Communication Library (NCCL) implements optimized collective operations (for example all-reduce, all-gather, broadcast) for multi-GPU communication. NCCL has been adopted in particular by deep learning frameworks to improve training performance on distributed systems.
One of the most frequent ways in which NCCL is accessed and used is through NVIDIA GPU Cloud (NGC) container images. Such images provide software stacks authored and optimized by NVIDIA for a variety of domains, with an emphasis on ML/AI frameworks and applications.

This example shows how to take advantage of the HPE Slingshot interconnect when using a container image derived from the NGC and communicating through NCCL.

The software used are the NCCL Tests, a collection of open-source synthetic benchmarks developed by NVIDIA to evaluate correctness and performance of NCCL operations.

We are using a custom image only for installing the tests: all the other components of the software stack (including the CUDA libraries, NCCL) come from the PyTorch:24.08-py3 image from NGC.


The Containerfile is as follows:

FROM nvcr.io/nvidia/pytorch:24.08-py3

# to avoid interaction with apt-get
ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y --no-install-recommends \
        build-essential \
        automake \
        autoconf \
        libtool \
        wget \
        libpmi2-0-dev \
        ca-certificates \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

RUN wget -O nccl-tests-2.13.10.tar.gz https://github.com/NVIDIA/nccl-tests/archive/refs/tags/v2.13.10.tar.gz \
    && tar xf nccl-tests-2.13.10.tar.gz \
    && cd nccl-tests-2.13.10 \
    && MPI=1 MPI_HOME=/opt/hpcx/ompi make -j$(nproc) \
    && cd .. \
    && rm -rf nccl-tests-2.13.10.tar.gz

The following EDF is used:

image = "/capstor/scratch/cscs/amadonna/quay.io+madeeks+nccl-tests+2.13.10-ngc-pytorch24.08-arm64.sqsh"
entrypoint = false

[annotations]
com.hooks.aws_ofi_nccl.enabled = "true"
com.hooks.aws_ofi_nccl.variant = "cuda12"

Notice that the AWS OFI NCCL hook is enabled and instructed to use the NCCL plugin variant for the CUDA version in the container image.

Using the aforementioned EDF, a test run on 2 nodes and 8 GH200 devices yields the following results:

[<vcluster>][amadonna@<vcluster>-ln001 ~]$ $ srun -N2 --ntasks-per-node=4 --mpi=pmi2 --environment=nccl-tests-ngc /workspace/nccl-tests-2.13.10/build/all_reduce_perf -b 8 -e 128M -f 2
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 190002 on  nid007222 device  0 [0x01] GH200 120GB
#  Rank  1 Group  0 Pid 190003 on  nid007222 device  1 [0x01] GH200 120GB
#  Rank  2 Group  0 Pid 190004 on  nid007222 device  2 [0x01] GH200 120GB
#  Rank  3 Group  0 Pid 190005 on  nid007222 device  3 [0x01] GH200 120GB
#  Rank  4 Group  0 Pid 118700 on  nid007223 device  0 [0x01] GH200 120GB
#  Rank  5 Group  0 Pid 118701 on  nid007223 device  1 [0x01] GH200 120GB
#  Rank  6 Group  0 Pid 118702 on  nid007223 device  2 [0x01] GH200 120GB
#  Rank  7 Group  0 Pid 118703 on  nid007223 device  3 [0x01] GH200 120GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    41.07    0.00    0.00      0    17.53    0.00    0.00      0
          16             4     float     sum      -1    17.22    0.00    0.00      0    17.00    0.00    0.00      0
          32             8     float     sum      -1    17.30    0.00    0.00      0    17.01    0.00    0.00      0
          64            16     float     sum      -1    17.54    0.00    0.01      0    17.60    0.00    0.01      0
         128            32     float     sum      -1    19.59    0.01    0.01      0    18.83    0.01    0.01      0
         256            64     float     sum      -1    19.29    0.01    0.02      0    19.13    0.01    0.02      0
         512           128     float     sum      -1    19.73    0.03    0.05      0    19.67    0.03    0.05      0
        1024           256     float     sum      -1    20.46    0.05    0.09      0    20.15    0.05    0.09      0
        2048           512     float     sum      -1    21.63    0.09    0.17      0    21.21    0.10    0.17      0
        4096          1024     float     sum      -1    28.77    0.14    0.25      0    28.17    0.15    0.25      0
        8192          2048     float     sum      -1    35.38    0.23    0.41      0    28.33    0.29    0.51      0
       16384          4096     float     sum      -1    34.83    0.47    0.82      0    29.04    0.56    0.99      0
       32768          8192     float     sum      -1    36.16    0.91    1.59      0    29.73    1.10    1.93      0
       65536         16384     float     sum      -1    45.66    1.44    2.51      0    34.26    1.91    3.35      0
      131072         32768     float     sum      -1    51.82    2.53    4.43      0    50.93    2.57    4.50      0
      262144         65536     float     sum      -1    68.79    3.81    6.67      0    67.06    3.91    6.84      0
      524288        131072     float     sum      -1    134.1    3.91    6.84      0    223.3    2.35    4.11      0
     1048576        262144     float     sum      -1    256.3    4.09    7.16      0    209.4    5.01    8.76      0
     2097152        524288     float     sum      -1    953.8    2.20    3.85      0   1017.6    2.06    3.61      0
     4194304       1048576     float     sum      -1    173.6   24.17   42.29      0    173.2   24.22   42.38      0
     8388608       2097152     float     sum      -1    202.3   41.47   72.58      0    201.7   41.58   72.77      0
    16777216       4194304     float     sum      -1    360.9   46.48   81.34      0    358.8   46.76   81.83      0
    33554432       8388608     float     sum      -1    688.0   48.77   85.34      0    687.1   48.84   85.46      0
    67108864      16777216     float     sum      -1   1342.2   50.00   87.50      0   1340.7   50.05   87.60      0
   134217728      33554432     float     sum      -1   1875.2   71.57  125.25      0   1875.9   71.55  125.21      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 21.1924 
#

For comparison, without access to the Slingshot network (e.g. by not enabling the AWS OFI NCCL plugin), the same test yields:

# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 206717 on  nid007121 device  0 [0x01] GH200 120GB
#  Rank  1 Group  0 Pid 206718 on  nid007121 device  1 [0x01] GH200 120GB
#  Rank  2 Group  0 Pid 206719 on  nid007121 device  2 [0x01] GH200 120GB
#  Rank  3 Group  0 Pid 206720 on  nid007121 device  3 [0x01] GH200 120GB
#  Rank  4 Group  0 Pid  92345 on  nid007122 device  0 [0x01] GH200 120GB
#  Rank  5 Group  0 Pid  92346 on  nid007122 device  1 [0x01] GH200 120GB
#  Rank  6 Group  0 Pid  92347 on  nid007122 device  2 [0x01] GH200 120GB
#  Rank  7 Group  0 Pid  92348 on  nid007122 device  3 [0x01] GH200 120GB
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    86.52    0.00    0.00      0    56.44    0.00    0.00      0
          16             4     float     sum      -1    53.53    0.00    0.00      0    51.33    0.00    0.00      0
          32             8     float     sum      -1    54.26    0.00    0.00      0    53.75    0.00    0.00      0
          64            16     float     sum      -1    56.48    0.00    0.00      0    55.84    0.00    0.00      0
         128            32     float     sum      -1    58.59    0.00    0.00      0    57.00    0.00    0.00      0
         256            64     float     sum      -1    57.64    0.00    0.01      0    57.21    0.00    0.01      0
         512           128     float     sum      -1    58.26    0.01    0.02      0    57.93    0.01    0.02      0
        1024           256     float     sum      -1    58.90    0.02    0.03      0    58.04    0.02    0.03      0
        2048           512     float     sum      -1    63.17    0.03    0.06      0    61.67    0.03    0.06      0
        4096          1024     float     sum      -1    69.42    0.06    0.10      0    67.60    0.06    0.11      0
        8192          2048     float     sum      -1    90.76    0.09    0.16      0    88.20    0.09    0.16      0
       16384          4096     float     sum      -1    126.5    0.13    0.23      0    128.7    0.13    0.22      0
       32768          8192     float     sum      -1    220.5    0.15    0.26      0    206.7    0.16    0.28      0
       65536         16384     float     sum      -1    246.8    0.27    0.46      0    238.3    0.27    0.48      0
      131072         32768     float     sum      -1    469.5    0.28    0.49      0    491.5    0.27    0.47      0
      262144         65536     float     sum      -1   1043.8    0.25    0.44      0   1037.7    0.25    0.44      0
      524288        131072     float     sum      -1   1356.9    0.39    0.68      0   2635.1    0.20    0.35      0
     1048576        262144     float     sum      -1   1992.1    0.53    0.92      0   1939.3    0.54    0.95      0
     2097152        524288     float     sum      -1   6302.0    0.33    0.58      0    14262    0.15    0.26      0
     4194304       1048576     float     sum      -1   3531.5    1.19    2.08      0   4855.8    0.86    1.51      0
     8388608       2097152     float     sum      -1   6608.1    1.27    2.22      0   6484.8    1.29    2.26      0
    16777216       4194304     float     sum      -1    23592    0.71    1.24      0    23500    0.71    1.25      0
    33554432       8388608     float     sum      -1    25024    1.34    2.35      0    24177    1.39    2.43      0
    67108864      16777216     float     sum      -1    47853    1.40    2.45      0    47909    1.40    2.45      0
   134217728      33554432     float     sum      -1    72219    1.86    3.25      0    67601    1.99    3.47      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0.704906 
#

Building software in an NVIDIA HPC SDK environment

The NVIDIA HPC SDK is a suite of compilers, libraries and tools to develop, debug, and analyze HPC applications, with an emphasis on GPU-accelerated software and the CUDA ecosystem.


The SDK is available also in the form of container images directly from the NGC catalog. This example demonstrates how to use one such image to jumpstart the creation of a development environment. In this case, we choose to compile and run one of the sample programs for the cuSPARSE library included in the image:

[<vcluster>][<username>@<vcluster>-ln001 ~]$ cat $HOME/.edf/nvhpc-24.7.toml
image = "/capstor/scratch/cscs/<username>/nvidia+nvhpc+24.7-devel-cuda12.5-ubuntu22.04.sqsh"
mounts = ["/capstor/scratch/cscs/<username>:/capstor/scratch/cscs/<username>"]
workdir = "/capstor/scratch/cscs/<username>"
writable=true

[<vcluster>][<username>@<vcluster>-ln001 <username>]$ srun -p nvgpu --environment=nvhpc-24.7 --pty bash

<username>@<hostname>:/capstor/scratch/cscs/<username>$ cd /opt/nvidia/hpc_sdk/Linux_aarch64/24.7/examples/CUDA-Libraries//opt/nvidia/hpc_sdk/Linux_aarch64/24.7/examples/CUDA-Libraries/cuSPARSE/test_sparse_omp_cpp
<username>@<hostname>:/opt/nvidia/hpc_sdk/Linux_aarch64/24.7/examples/CUDA-Libraries//opt/nvidia/hpc_sdk/Linux_aarch64/24.7/examples/CUDA-Libraries/cuSPARSE/test_sparse_omp_cpp$ make
nvc++ -fast -mp=gpu -cudalib=cusparse,cublas -o tcusparseompc1.exe tcusparseompc1.cpp
./tcusparseompc1.exe
iteration =   1, residual = 4.449882e+01
iteration =   2, residual = 3.245218e+00
iteration =   3, residual = 2.690220e-01
iteration =   4, residual = 2.307639e-02
iteration =   5, residual = 1.993140e-03
iteration =   6, residual = 1.846192e-04
iteration =   7, residual = 1.693379e-05
iteration =   8, residual = 1.600115e-06
Final residual: 1.600115e-06
&&&& uvm_cg test PASSED
Test Summary:  Error amount = 0.000000, result = SUCCESS

Binaries compiled within the containerized development environment could be saved to a shared filesystem, and potentially re-used in a different job later on. For example, imagining to continue the session from the previous code block:

<username>@<hostname>:/opt/nvidia/hpc_sdk/Linux_aarch64/24.7/examples/CUDA-Libraries/cuSPARSE/test_sparse_omp_cpp$ cp ./tcusparseompc1.exe $SCRATCH

<username>@<hostname>:/opt/nvidia/hpc_sdk/Linux_aarch64/24.7/examples/CUDA-Libraries/cuSPARSE/test_sparse_omp_cpp$ exit

[<vcluster>][<username>@<vcluster>-ln001 ~]$ cd /capstor/scratch/cscs/<username>
[<vcluster>][<username>@<vcluster>-ln001 <username>]$ srun --environment=nvhpc-24.7 ./tcusparseompc1.exe 
iteration =   1, residual = 4.449882e+01
iteration =   2, residual = 3.245218e+00
iteration =   3, residual = 2.690220e-01
iteration =   4, residual = 2.307639e-02
iteration =   5, residual = 1.993140e-03
iteration =   6, residual = 1.846192e-04
iteration =   7, residual = 1.693379e-05
iteration =   8, residual = 1.600115e-06
Final residual: 1.600115e-06
&&&& uvm_cg test PASSED
Test Summary:  Error amount = 0.000000, result = SUCCESS

Using containerized environments to produce binaries, save them, and run them at a different time can offer flexibility during experimentation and development. However, when converging to more mature software it is strongly recommended to package applications into their own container images, in order to take advantage of their portability, prescriptiveness, and ease of distribution. In turn, such images can be used to create application-specific EDFs.

Further reading