Santis (santis.alps.cscs.ch
) is the vCluster of the User Lab deployed on the Alps infrastructure serving as part of the climate and weather platform. It's name derives from the highest mountain Säntis in the Alpstein massif of northeastern Switzerland.
Latest Updates
Maintenances
- Regular activities
- When: Wednesday mornings from 08h00 - 12h00 CET
- What: Mainly adjustment of service deployments on Santis.
- Potential service interruptions can occur during these activities.
- Exceptional activities
- When: Anytime
- What: E.g., urgent security patches or work affecting the larger infrastructure
- Announcement via mailing list, status page, etc.
Change log
No recent updates.
Known issues
ICON jobs
- Jobs hanging in I/O
- Certain ICON jobs seem to encounter an issue when restarted from a previous snapshot. We are investigation possible root causes.
- Jobs hanging w/o crashing for specific node counts
- Currently observed with coupled ICON simulations.
- Try setting these two ENV variables for the libfabric of the SlingShot high-speed network
export FI_CXI_OFLOW_BUF_COUNT=10 # Specifying the number of CXI overflow buffers allocated to hold unexpected messages. Each buffer holds FI_CXI_OFLOW_BUF_SIZE bytes. Only applies to Slingshot 11. export FI_MR_CACHE_MAX_COUNT=0 # Disabling CXI provider memory registration (MR) caching.
Cluster specifications
Hardware
All nodes are identical, with 4 Grace-Hopper (GH) modules per node. Specifically:
- User Access Node (UAN):
- X repurposed GH compute nodes that serve as login nodes (santis-ln00[1-4])
- Compute Node (CN):
- The number of compute nodes will change over time.
You can get an up-to-date number using the commandsinfo -s
on the UAN.
- The number of compute nodes will change over time.
A GH module consists of:
Name | Type | Compute | Memory |
---|---|---|---|
Grace | CPU | 72 ARM cores | 128GB LPDDR |
Hopper | GPU | - | 96GB HBM3 |
For more information, please also have a look at https://www.cscs.ch/computers/alps.
In the configuration of Santis, a GH node has approximately 800GB of free, unified (CPU + GPU) memory accessible.
Please note, the complete memory is accessible from all modules.
The ARM cores of a GH module have the following specifications:
- Arm V9.0 ISA compliant aarch64 (Neoverse V2 “Demeter”architecture)
- Full SVE-2 Vector Extensions support, inclusive of NEON instructions
- Supports 48-bit virtual and 48-bit physical address space
The Hopper GPUs are using NVLINK, providing all-to-all cache-coherent memory between all host and device memory.
File systems
The following mount points can be found on Santis:
Mount point | Environment variable | File system | Features |
---|---|---|---|
/users/$USER | $HOME | NFS | Snapshot, Backup |
/capstor/scratch/cscs/$USER | $SCRATCH | Lustre | - |
/capstor/store/cscs/userlab/<GROUP_ID> | Lustre | Backup |
NB: On $SCRATCH a quota of 150TB and 1M inodes (files, folders) is applied. These are implemented as soft quotas, i.e., upon reaching either limit, you will be given a grace period of 1 week before write access to $SCRATCH is blocked for your user.
( You can still submit jobs
)
Please make sure to check your quota regularly using the quota
command; it is available on the login nodes of Säntis as well as on the frontend ela.cscs.ch
.
Slurm configuration
The following partitions are configured:
Partition | Max time | Max nodes | Comments |
---|---|---|---|
normal | 24h | Standard queue for production work | |
debug | 30 min | 8 | Quick turnaround for development; 1 job per user |
xfer | 24h | 1 | Internal transfer of data in between file systems and/or clusters |
NB: Nodes are not shared and at least 1 node must be allocated for your job. (exception is the xfer
queue)
Task/Thread allocation on a node:
- Threads (e.g. OpenMP) are placed consecutively on the cores.
- Tasks (e.g. MPI ranks) are placed in round-robin fashion between the 4 modules of the node.
Oversubscription of GPU cards
If you want to share GPU card(s) between multiple MPI ranks, you currently need to start multi-process daemon on the node yourself.
To do so, you need to use a simple wrapper script:
#!/bin/bash # Example mps-wrapper.sh usage: # > srun [srun args] mps-wrapper.sh [cmd] [cmd args] export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log export CUDA_VISIBLE_DEVICES=$(( SLURM_LOCALID % 4 )) # Launch MPS from a single rank per node if [ $SLURM_LOCALID -eq 0 ]; then CUDA_VISIBLE_DEVICES=0,1,2,3 nvidia-cuda-mps-control -d fi # Wait for MPS to start sleep 5 # Run the command "$@" # Quit MPS control daemon before exiting if [ $SLURM_LOCALID -eq 0 ]; then echo quit | nvidia-cuda-mps-control fi
and run your code using the following sample slurm script:
#!/bin/bash -l #SBATCH --job-name=<job_name> #SBATCH --time=01:30:00 #HH:MM:SS #SBATCH --nodes=2 #SBATCH --ntasks-per-core=1 #SBATCH --ntasks-per-node=32 #32 MPI ranks per node #SBATCH --account=<account> #SBATCH --hint=nomultithread #SBATCH --hint=exclusive export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export MPICH_MALLOC_FALLBACK=1 ulimit -s unlimited srun --cpu-bind=socket ./mps-wrapper.sh <code> <args>
User environments
User environments (uenv) are used to providing programming environments and applications on the system. Please refer to the uenv user environments documentation on this knowledge base for more detailed information on how to use the UENV tooling on the system.
User environments might not be tagged for the system Santis, but for other vClusters.
If there is a uenv image on a different vCluster that you want to use on Santis, send a request to have it deployed on Santis.
It is also possible to directly use uenv images built for other systems using the @system
syntax, for example:
uenv image find @todi
The above command would list all images initially prepared for the vCluster Tödi on Santis.
Container engine
The Container Engine (CE) is available on the system: this is a toolset is designed to enable computing jobs to seamlessly run inside Linux application containers, thus providing support for containerized user environments: please see the dedicated page to use CE.