This page collects more comprehensive and/or real-world use case examples which can be performed with containerized applications and the Container Engine on Alps.
NCCL tests with NGC-based image
The NVIDIA Collective Communication Library (NCCL) implements optimized collective operations (for example all-reduce, all-gather, broadcast) for multi-GPU communication. NCCL has been adopted in particular by deep learning frameworks to improve training performance on distributed systems.
One of the most frequent ways in which NCCL is accessed and used is through NVIDIA GPU Cloud (NGC) container images. Such images provide software stacks authored and optimized by NVIDIA for a variety of domains, with an emphasis on ML/AI frameworks and applications.
This example shows how to take advantage of the HPE Slingshot interconnect when using a container image derived from the NGC and communicating through NCCL.
The software used are the NCCL Tests, a collection of open-source synthetic benchmarks developed by NVIDIA to evaluate correctness and performance of NCCL operations.
We are using a custom image only for installing the tests: all the other components of the software stack (including the CUDA libraries, NCCL) come from the PyTorch:24.08-py3 image from NGC.
The Containerfile is as follows:
FROM nvcr.io/nvidia/pytorch:24.08-py3 # to avoid interaction with apt-get ENV DEBIAN_FRONTEND=noninteractive RUN apt-get update && apt-get install -y --no-install-recommends \ build-essential \ automake \ autoconf \ libtool \ wget \ libpmi2-0-dev \ ca-certificates \ && apt-get clean && rm -rf /var/lib/apt/lists/* RUN wget -O nccl-tests-2.13.10.tar.gz https://github.com/NVIDIA/nccl-tests/archive/refs/tags/v2.13.10.tar.gz \ && tar xf nccl-tests-2.13.10.tar.gz \ && cd nccl-tests-2.13.10 \ && MPI=1 MPI_HOME=/opt/hpcx/ompi make -j$(nproc) \ && cd .. \ && rm -rf nccl-tests-2.13.10.tar.gz
The following EDF is used:
image = "/capstor/scratch/cscs/amadonna/quay.io+madeeks+nccl-tests+2.13.10-ngc-pytorch24.08-arm64.sqsh" entrypoint = false [annotations] com.hooks.aws_ofi_nccl.enabled = "true" com.hooks.aws_ofi_nccl.variant = "cuda12"
Notice that the AWS OFI NCCL hook is enabled and instructed to use the NCCL plugin variant for the CUDA version in the container image.
Using the aforementioned EDF, a test run on 2 nodes and 8 GH200 devices yields the following results:
[<vcluster>][amadonna@<vcluster>-ln001 ~]$ $ srun -N2 --ntasks-per-node=4 --mpi=pmi2 --environment=nccl-tests-ngc /workspace/nccl-tests-2.13.10/build/all_reduce_perf -b 8 -e 128M -f 2 # nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # # Using devices # Rank 0 Group 0 Pid 190002 on nid007222 device 0 [0x01] GH200 120GB # Rank 1 Group 0 Pid 190003 on nid007222 device 1 [0x01] GH200 120GB # Rank 2 Group 0 Pid 190004 on nid007222 device 2 [0x01] GH200 120GB # Rank 3 Group 0 Pid 190005 on nid007222 device 3 [0x01] GH200 120GB # Rank 4 Group 0 Pid 118700 on nid007223 device 0 [0x01] GH200 120GB # Rank 5 Group 0 Pid 118701 on nid007223 device 1 [0x01] GH200 120GB # Rank 6 Group 0 Pid 118702 on nid007223 device 2 [0x01] GH200 120GB # Rank 7 Group 0 Pid 118703 on nid007223 device 3 [0x01] GH200 120GB # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 8 2 float sum -1 41.07 0.00 0.00 0 17.53 0.00 0.00 0 16 4 float sum -1 17.22 0.00 0.00 0 17.00 0.00 0.00 0 32 8 float sum -1 17.30 0.00 0.00 0 17.01 0.00 0.00 0 64 16 float sum -1 17.54 0.00 0.01 0 17.60 0.00 0.01 0 128 32 float sum -1 19.59 0.01 0.01 0 18.83 0.01 0.01 0 256 64 float sum -1 19.29 0.01 0.02 0 19.13 0.01 0.02 0 512 128 float sum -1 19.73 0.03 0.05 0 19.67 0.03 0.05 0 1024 256 float sum -1 20.46 0.05 0.09 0 20.15 0.05 0.09 0 2048 512 float sum -1 21.63 0.09 0.17 0 21.21 0.10 0.17 0 4096 1024 float sum -1 28.77 0.14 0.25 0 28.17 0.15 0.25 0 8192 2048 float sum -1 35.38 0.23 0.41 0 28.33 0.29 0.51 0 16384 4096 float sum -1 34.83 0.47 0.82 0 29.04 0.56 0.99 0 32768 8192 float sum -1 36.16 0.91 1.59 0 29.73 1.10 1.93 0 65536 16384 float sum -1 45.66 1.44 2.51 0 34.26 1.91 3.35 0 131072 32768 float sum -1 51.82 2.53 4.43 0 50.93 2.57 4.50 0 262144 65536 float sum -1 68.79 3.81 6.67 0 67.06 3.91 6.84 0 524288 131072 float sum -1 134.1 3.91 6.84 0 223.3 2.35 4.11 0 1048576 262144 float sum -1 256.3 4.09 7.16 0 209.4 5.01 8.76 0 2097152 524288 float sum -1 953.8 2.20 3.85 0 1017.6 2.06 3.61 0 4194304 1048576 float sum -1 173.6 24.17 42.29 0 173.2 24.22 42.38 0 8388608 2097152 float sum -1 202.3 41.47 72.58 0 201.7 41.58 72.77 0 16777216 4194304 float sum -1 360.9 46.48 81.34 0 358.8 46.76 81.83 0 33554432 8388608 float sum -1 688.0 48.77 85.34 0 687.1 48.84 85.46 0 67108864 16777216 float sum -1 1342.2 50.00 87.50 0 1340.7 50.05 87.60 0 134217728 33554432 float sum -1 1875.2 71.57 125.25 0 1875.9 71.55 125.21 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 21.1924 #
For comparison, without access to the Slingshot network (e.g. by not enabling the AWS OFI NCCL plugin), the same test yields:
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # # Using devices # Rank 0 Group 0 Pid 206717 on nid007121 device 0 [0x01] GH200 120GB # Rank 1 Group 0 Pid 206718 on nid007121 device 1 [0x01] GH200 120GB # Rank 2 Group 0 Pid 206719 on nid007121 device 2 [0x01] GH200 120GB # Rank 3 Group 0 Pid 206720 on nid007121 device 3 [0x01] GH200 120GB # Rank 4 Group 0 Pid 92345 on nid007122 device 0 [0x01] GH200 120GB # Rank 5 Group 0 Pid 92346 on nid007122 device 1 [0x01] GH200 120GB # Rank 6 Group 0 Pid 92347 on nid007122 device 2 [0x01] GH200 120GB # Rank 7 Group 0 Pid 92348 on nid007122 device 3 [0x01] GH200 120GB # # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 8 2 float sum -1 86.52 0.00 0.00 0 56.44 0.00 0.00 0 16 4 float sum -1 53.53 0.00 0.00 0 51.33 0.00 0.00 0 32 8 float sum -1 54.26 0.00 0.00 0 53.75 0.00 0.00 0 64 16 float sum -1 56.48 0.00 0.00 0 55.84 0.00 0.00 0 128 32 float sum -1 58.59 0.00 0.00 0 57.00 0.00 0.00 0 256 64 float sum -1 57.64 0.00 0.01 0 57.21 0.00 0.01 0 512 128 float sum -1 58.26 0.01 0.02 0 57.93 0.01 0.02 0 1024 256 float sum -1 58.90 0.02 0.03 0 58.04 0.02 0.03 0 2048 512 float sum -1 63.17 0.03 0.06 0 61.67 0.03 0.06 0 4096 1024 float sum -1 69.42 0.06 0.10 0 67.60 0.06 0.11 0 8192 2048 float sum -1 90.76 0.09 0.16 0 88.20 0.09 0.16 0 16384 4096 float sum -1 126.5 0.13 0.23 0 128.7 0.13 0.22 0 32768 8192 float sum -1 220.5 0.15 0.26 0 206.7 0.16 0.28 0 65536 16384 float sum -1 246.8 0.27 0.46 0 238.3 0.27 0.48 0 131072 32768 float sum -1 469.5 0.28 0.49 0 491.5 0.27 0.47 0 262144 65536 float sum -1 1043.8 0.25 0.44 0 1037.7 0.25 0.44 0 524288 131072 float sum -1 1356.9 0.39 0.68 0 2635.1 0.20 0.35 0 1048576 262144 float sum -1 1992.1 0.53 0.92 0 1939.3 0.54 0.95 0 2097152 524288 float sum -1 6302.0 0.33 0.58 0 14262 0.15 0.26 0 4194304 1048576 float sum -1 3531.5 1.19 2.08 0 4855.8 0.86 1.51 0 8388608 2097152 float sum -1 6608.1 1.27 2.22 0 6484.8 1.29 2.26 0 16777216 4194304 float sum -1 23592 0.71 1.24 0 23500 0.71 1.25 0 33554432 8388608 float sum -1 25024 1.34 2.35 0 24177 1.39 2.43 0 67108864 16777216 float sum -1 47853 1.40 2.45 0 47909 1.40 2.45 0 134217728 33554432 float sum -1 72219 1.86 3.25 0 67601 1.99 3.47 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 0.704906 #
Building software in an NVIDIA HPC SDK environment
The NVIDIA HPC SDK is a suite of compilers, libraries and tools to develop, debug, and analyze HPC applications, with an emphasis on GPU-accelerated software and the CUDA ecosystem.
The SDK is available also in the form of container images directly from the NGC catalog. This example demonstrates how to use one such image to jumpstart the creation of a development environment. In this case, we choose to compile and run one of the sample programs for the cuSPARSE library included in the image:
[<vcluster>][<username>@<vcluster>-ln001 ~]$ cat $HOME/.edf/nvhpc-24.7.toml image = "/capstor/scratch/cscs/<username>/nvidia+nvhpc+24.7-devel-cuda12.5-ubuntu22.04.sqsh" mounts = ["/capstor/scratch/cscs/<username>:/capstor/scratch/cscs/<username>"] workdir = "/capstor/scratch/cscs/<username>" writable=true [<vcluster>][<username>@<vcluster>-ln001 <username>]$ srun -p nvgpu --environment=nvhpc-24.7 --pty bash <username>@<hostname>:/capstor/scratch/cscs/<username>$ cd /opt/nvidia/hpc_sdk/Linux_aarch64/24.7/examples/CUDA-Libraries//opt/nvidia/hpc_sdk/Linux_aarch64/24.7/examples/CUDA-Libraries/cuSPARSE/test_sparse_omp_cpp <username>@<hostname>:/opt/nvidia/hpc_sdk/Linux_aarch64/24.7/examples/CUDA-Libraries//opt/nvidia/hpc_sdk/Linux_aarch64/24.7/examples/CUDA-Libraries/cuSPARSE/test_sparse_omp_cpp$ make nvc++ -fast -mp=gpu -cudalib=cusparse,cublas -o tcusparseompc1.exe tcusparseompc1.cpp ./tcusparseompc1.exe iteration = 1, residual = 4.449882e+01 iteration = 2, residual = 3.245218e+00 iteration = 3, residual = 2.690220e-01 iteration = 4, residual = 2.307639e-02 iteration = 5, residual = 1.993140e-03 iteration = 6, residual = 1.846192e-04 iteration = 7, residual = 1.693379e-05 iteration = 8, residual = 1.600115e-06 Final residual: 1.600115e-06 &&&& uvm_cg test PASSED Test Summary: Error amount = 0.000000, result = SUCCESS
Binaries compiled within the containerized development environment could be saved to a shared filesystem, and potentially re-used in a different job later on. For example, imagining to continue the session from the previous code block:
<username>@<hostname>:/opt/nvidia/hpc_sdk/Linux_aarch64/24.7/examples/CUDA-Libraries/cuSPARSE/test_sparse_omp_cpp$ cp ./tcusparseompc1.exe $SCRATCH <username>@<hostname>:/opt/nvidia/hpc_sdk/Linux_aarch64/24.7/examples/CUDA-Libraries/cuSPARSE/test_sparse_omp_cpp$ exit [<vcluster>][<username>@<vcluster>-ln001 ~]$ cd /capstor/scratch/cscs/<username> [<vcluster>][<username>@<vcluster>-ln001 <username>]$ srun --environment=nvhpc-24.7 ./tcusparseompc1.exe iteration = 1, residual = 4.449882e+01 iteration = 2, residual = 3.245218e+00 iteration = 3, residual = 2.690220e-01 iteration = 4, residual = 2.307639e-02 iteration = 5, residual = 1.993140e-03 iteration = 6, residual = 1.846192e-04 iteration = 7, residual = 1.693379e-05 iteration = 8, residual = 1.600115e-06 Final residual: 1.600115e-06 &&&& uvm_cg test PASSED Test Summary: Error amount = 0.000000, result = SUCCESS
Using containerized environments to produce binaries, save them, and run them at a different time can offer flexibility during experimentation and development. However, when converging to more mature software it is strongly recommended to package applications into their own container images, in order to take advantage of their portability, prescriptiveness, and ease of distribution. In turn, such images can be used to create application-specific EDFs.
Further reading
- Building container images on Alps
- Run images with Container Engine