Todi is the first early access system with Grace-Hopper (GH200) nodes on Alps that is opened to external users. To access the system projects need to have applied for special access.

Todi is an early access system and represents a work in progress. CSCS engineers are working continuously on the system to improve the quality of service. Accounting is not enabled, and users are expected to perform their tests and benchmarks with care. 
We ask that you report any issues, and ask any questions, on the CSCS Service Desk.

Calendar and key events

Todi will be deployed in several stages.

  • Early Access: during early access Todi will provide 1 login node and 17 GH200 compute nodes
  • Scale out: between - Todi will be scaled up to full production size.
  • From Todi will be returned to early access users with ca. 825 compute nodes; vClusters for the various platforms (High Performance Computing Platform, Machine Learning Platform and Weather and Climate Platform) will then be spun out from this system.

  1. EDIT THE CALENDAR

    Customise the different types of events you'd like to manage in this calendar.

    #legIndex/#totalLegs
  2. RESTRICT THE CALENDAR

    Optionally, restrict who can view or add events to the team calendar.

    #legIndex/#totalLegs
  3. SHARE WITH YOUR TEAM

    Grab the calendar's URL and email it to your team, or paste it on a page to embed the calendar.

    #legIndex/#totalLegs
  4. ADD AN EVENT

    The calendar is ready to go! Click any day on the calendar to add an event or use the Add event button.

    #legIndex/#totalLegs
  5. SUBSCRIBE

    Subscribe to calendars using your favourite calendar client.

    #legIndex/#totalLegs

Regular Interventions on the system

Maintenance is between 10am and 2pm every Wednesday. During this time CSCS engineers will perform larger upgrades and modifications. Please note that you might experience some instability, while we try to keep the system available during maintenance.

Change log

Job requeue has been disabled by default. See https://slurm.schedmd.com/slurm.conf.html#OPT_JobRequeue

The maximum wall time for normal partition is increased to 12 hours. 

The aws-ofi-nccl-plugin has been disabled. This means that containers using nccl will not use it, consequently going through UDP with at least a factor 10x slower performance in the communication part. This is a temporary measure until the issues with the plugin are resolved.

Todi is returned to service with ca. 2000 nodes in normal  and debug  partitions. Jobs can be submitted without specifying a reservation.

The reservation daint  is available again exclusively for Daint users that need to run their benchmarks for submitting their proposals, additionally to the debug  partition and free nodes.
Please add the Slurm option --reservation=daint to your batch script if you want to use it

New compute node image deployed to fix the issue with GPU-aware MPI.

Max job time limit is decreased from 12 hours to 6 hours

In order to complete the preparatory work necessary to deliver Alps in production, as of September 18 2024 the vCluster Daint on Alps will no longer be accessible until further notice: the early access will still be granted on Tödi using the Slurm reservation option --reservation=daint

  A debug partition has been added for fast turnaround of short (30 min) jobs (--partition=debug). 

  The maintenance on Tödi has completed and the system has approximately 825 nodes available for use. Reservations have been removed. 

  A set of dedicated reservations is currently in place for specific projects in order to perform large-scale runs. You will be able to submit to  --reservation=todi  in between the dedicated reservations. 

  Please run jobs on Tödi using the Slurm option --reservation=todi. The partition debug has been removed.

A dedicated reservation has been made available on Tödi for application benchmarking. Projects lp*a*, g154, g26, and sd36 can submit jobs with the Slurm option --reservation=benchmarking.

Minor changes to Slurm scheduling configuration to improve backfill 

 

  • The cray environment module was incorrectly loading the craype-x86-rome module rather than craype-arm-grace . This has now been corrected. 

 

  • Batch Job Summary Report is now available: see below for further details and here.

The uenv tools provided on Todi are updated. This update makes some significant changes, that will affect user workflows.

There was a change to the database format used to record information about uenv images that have been pulled (uenv image pull )

  • an error message like the following will be printed the first time you use uenv: The repository /capstor/scratch/cscs/user/.uenv-images needs to be upgraded 
  • run the command uenv repo upgrade  to update your registry

The uenv module command was removed:

  • going forward, the same functionality will be provided as a view: uenv view modules  or uenv start --view=modules prgenv-gnu/24.7:v2  are two ways to make modules available.
  • if you are using an old image you will have to manually load the modules once the uenv has started:
    • module use /user-environment/modules
    • manual loading is only required when uenv status does not list modules  as an available view

Known issues

NCCL 

We have identified critical issues with the NVIDIA Collective Communications Library (NCCL) on the Alps Slingshot network, which can impact performance and result in node failures on job completion. These issues affect all machine learning workflows that rely on NCCL, including those based on NGC containers as well as our PyTorch images provided by uenv. It also affects the materials modelling code VASP, which should not be run on multiple nodes until further notice. Most other conventional HPC codes use MPI, OpenMP or OpenACC for communication, and are thus not affected by this issue.

You should not run workflows that use NCCL across multiple nodes, including:
- NGC containers
- PyTorch images provided by uenv
- The materials modelling code VASP

Lustre

A bug in Lustre is causing the Capstor filesystem to hang intermittently. HPE is aware of the issue and working on a fix

 There is a bug with cray-mpich that is triggered for inter-node cuda-aware communication. The crashes occur in cuda-aware MPI codes, when the first call performing an exchange between GPU buffers on different nodes. This issue has been resolved. 

Container runtime workaround

A change to libfabric has affected our container hooks. To run containers you will need to set $ENROOT_LIBRARY_PATH as a workaround until a permanent fix is deployed:

 ENROOT_LIBRARY_PATH=/capstor/scratch/cscs/fmohamed/enrootlibn srun --environment=...

Messages from syslogd@todi

You might see messages similar to the following when logged into a login node. There is a case open with the vendor.

Message from syslogd@todi-ln002 at Sep 11 15:14:58 ... kernel:[1042939.478037][T200921] LustreError: 200921:0:(statahead.c:268:sa_kill()) ASSERTION( sa_ready(entry) ) failed:
[todi][<username>@todi-ln002 ~]$
Message from syslogd@todi-ln002 at Aug 27 14:45:37 ...
 kernel:[688625.071349][    C0] BUG: workqueue lockup - pool cpus=80 node=1 flags=0x0 nice=0 stuck for 52s!

git clone git@github hangs on compute nodes

The compute nodes do not have direct internet access. The environment variables https_proxy , http_proxy and no_proxy enable most processes that access the internet to work.

Git clone via ssh, i.e. git clone git@github.com:org/repo.git, will still hang.
There are two fixes:

  • clone via https, i.e. git clone https://github.com/eth-cscs/uenv.git 
  • add the following to ~/.ssh/config
    Match Host *,!148.187.0.0/16,!172.28.0.0/16,!10.0.0.0/8
        ProxyCommand nc -X connect -x proxy.cscs.ch:8080 %h %p

uenv: error: argument command: invalid choice: 'modules'

A recent upgrade of uenv  changed how modules are dealt with. If you are using an old UENV without a modules  view and need to use modules, you can do what uenv modules use  was doing manually by starting the UENV and then running:

module use /user-environment/modules/

Access

Log into Todi as you would for other CSCS systems, by first first configuring your SSH keys (see the MFA documentation for more information), then log into todi using the front end server ela.cscs.ch as a jump host:

$ ssh -J username@ela.cscs.ch username@todi.cscs.ch


Simplifying log in

To log in directly to todi without first specifying ela.cscs.ch as a jump host, you can add some configuration to the ~/.ssh/config  file on your laptop or PC:

~/.ssh/config
Host todi
    HostName todi.cscs.ch
    User cscsusername
    IdentityFile ~/.ssh/cscs-key
    ProxyJump ela.cscs.ch
    AddKeysToAgent yes
    ForwardAgent yes

Where cscsusername is your CSCS account name.

Now you can access Todi directly from your laptop or PC:

ssh todi

Beware: with this approach you can't ssh between login nodes of different systems, for example, you won't be able to ssh eiger  from Todi login node.

Cluster Specifications

All nodes are identical, with 4 Grace-Hopper modules per node. Specifically:

  • login nodes: 4 repurposed GH compute nodes (todi-ln00[1-4])
  • compute nodes:
    The number of compute nodes will change over time; you can get an up-to-date number using the command sinfo. The majority of the nodes is provided in a single slurm partition called normal, and the nodes are available in a Slurm reservation called todi. A smaller number is accessible through the slurm partition debug (the debug partition is temporarily removed).  

Each node has approximately 800GB of free memory accessible from all sockets. Each Grace CPU has 72 cores with the following specification:

  • Arm V9.0 ISA compliant aarch64 (Neoverse V2 “Demeter”architecture)
  • Full SVE-2 Vector Extensions support, inclusive of NEON instructions
  • Supports 48-bit virtual and 48-bit physical address space

Each Hopper GPU has 96GB of RAM. NVLINK provides all-to-all cache-coherent memory between all host and device memory.

A login node is a shared resource. Do not run compute-intensive jobs on a login node, and do not start the CUDA MPS service there, as you might impact the work of others.

Slurm configuration

Currently there is no fair share policy applied.

ConfigurationValue
Job durationmax. 12H
Job sizeunlimited

Nodes are not shared and at least 1 node must be allocated for your job.

Running jobs

In the current Slurm configuration OMP threads are placed consecutively on the cores and MPI ranks are placed in the round-robin fashion between 4 sockets. Example below demonstrates the output of running 8 MPI ranks with 4 cores/rank:

$ srun -N1 -c4 -n8 ./a.out
MPI rank : 6, OMP thread id : 0, CPU core id : 149
MPI rank : 6, OMP thread id : 3, CPU core id : 148
MPI rank : 6, OMP thread id : 1, CPU core id : 150
MPI rank : 6, OMP thread id : 2, CPU core id : 151
MPI rank : 0, OMP thread id : 1, CPU core id : 2
MPI rank : 0, OMP thread id : 0, CPU core id : 1
MPI rank : 0, OMP thread id : 3, CPU core id : 0
MPI rank : 0, OMP thread id : 2, CPU core id : 3
MPI rank : 4, OMP thread id : 0, CPU core id : 5
MPI rank : 4, OMP thread id : 2, CPU core id : 7
MPI rank : 4, OMP thread id : 3, CPU core id : 4
MPI rank : 4, OMP thread id : 1, CPU core id : 6
MPI rank : 5, OMP thread id : 0, CPU core id : 76
MPI rank : 5, OMP thread id : 3, CPU core id : 79
MPI rank : 5, OMP thread id : 2, CPU core id : 78
MPI rank : 5, OMP thread id : 1, CPU core id : 77
MPI rank : 1, OMP thread id : 0, CPU core id : 73
MPI rank : 1, OMP thread id : 3, CPU core id : 72
MPI rank : 1, OMP thread id : 2, CPU core id : 75
MPI rank : 1, OMP thread id : 1, CPU core id : 74
MPI rank : 3, OMP thread id : 0, CPU core id : 216
MPI rank : 3, OMP thread id : 1, CPU core id : 217
MPI rank : 3, OMP thread id : 2, CPU core id : 218
MPI rank : 3, OMP thread id : 3, CPU core id : 219
MPI rank : 7, OMP thread id : 0, CPU core id : 220
MPI rank : 7, OMP thread id : 3, CPU core id : 221
MPI rank : 7, OMP thread id : 1, CPU core id : 223
MPI rank : 7, OMP thread id : 2, CPU core id : 222
MPI rank : 2, OMP thread id : 0, CPU core id : 147
MPI rank : 2, OMP thread id : 3, CPU core id : 144
MPI rank : 2, OMP thread id : 2, CPU core id : 145
MPI rank : 2, OMP thread id : 1, CPU core id : 146

Oversubscription of GPU cards

If you want to share GPU card(s) between multiple MPI ranks, you currently need to start multi-process daemon on the node yourself. This is due to the fact that CRAY_CUDA_MPS variable is no longer supported. To do so, you need to use a simple wrapper script:

MPS wrapper script
#!/bin/bash
# Example mps-wrapper.sh usage:
# > srun [srun args] mps-wrapper.sh [cmd] [cmd args]
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
export CUDA_VISIBLE_DEVICES=$(( SLURM_LOCALID % 4 ))
# Launch MPS from a single rank per node
if [ $SLURM_LOCALID -eq 0 ]; then
    CUDA_VISIBLE_DEVICES=0,1,2,3 nvidia-cuda-mps-control -d
fi
# Wait for MPS to start
sleep 5
# Run the command
"$@"
# Quit MPS control daemon before exiting
if [ $SLURM_LOCALID -eq 0 ]; then
    echo quit | nvidia-cuda-mps-control
fi

and run your code using the following sample slurm script:

Batch submission script
#!/bin/bash -l
#SBATCH --job-name=<job_name>
#SBATCH --time=01:30:00 #HH:MM:SS
#SBATCH --nodes=2
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=32 #32 MPI ranks per node
#SBATCH --cpus-per-task=8 #8 OMP threads per rank
#SBATCH --account=<account> 
#SBATCH --hint=nomultithread 
#SBATCH --hint=exclusive

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MPICH_MALLOC_FALLBACK=1

ulimit -s unlimited

srun --cpu-bind=socket ./mps-wrapper.sh <code> <args>

The batch summary report provided automatically on Piz Daint is not available on Todi. For Project Proposals that need a batch job summary report please use the jobreport  tool discussed here.

Programming Environment

UENV

Uenv are used to provide programming environments and application software. Please refer to main uenv documentation for detailed documentation on how to use the uenv tools installed on the system.

On Todi the following uenv are provided:

Coming soon:

Cray Programming Environment (CPE)

CPE is provided on Todi, however CSCS does not officially support CPE or provide software build with CPE on the system.

The supported method for building software will be uenv (see above).

This is a key difference from Piz Daint.


To enable Cray Programming Environment (CPE) on Todi, please run:

$ module load cray
$ module load PrgEnv-cray
$ module load craype-arm-grace

# to check that PE is loaded:
$ ftn --version
Cray Fortran : Version 17.0.0

Container engine

The Container Engine is available on Todi. Please see the page on using container images with it.

File systems

The following mount points are available on Todi:

  • /users/$USER - $HOME
  • /capstor/scratch/cscs/$USER - $SCRATCH
  • /store
  • /capstor/cscs/apps

You always will have access to /users and /capstor/cscs/scratch. Access to the other locations will depend on the allocations set in your project. 

In order to check the occupancy by your user (data volume and number of inodes), you can run the quota command on the login nodes.

  • No labels