Daint (daint.alps
) is a vCluster deployed on Alps for the HPC Platform. It replaces the Cray XC Piz Daint, which has reached end-of-life.
Maintenance
- Tuesday morning 8-12 CET is reserved for periodic updates, with services potentially unavailable during this timeframe
- Exceptional and non-disruptive updates may happen outside this time frame and will be announced to the users mailing list
Change log
The version of uenv
on daint
, clariden
and santis
was been updated this morning with new features, bug fixes and other improvements. It also introduces some breaking changes, the main one is that the uenv view
command is no longer supported
- provide the view when starting the environment, e.g.
uenv start --view=gromacs gromacs/2024:v1
Please check the updated documentation available on the dedicated page uenv user environments
The Slurm queue xfer
is available for the Internal Transfer on Daint. The data mover nodes mount temporarily with read-only access /project.CrayXC
, /store.CrayXC
and /users.CrayXC
that correspond to the GPFS mount points /project
, /store
and /users
currently available on Piz Daint XC login nodes
Known issues
Quota command
The quota command is unavailable on the login nodes.
Jobs silently crashing in Slurm prolog
We are investigating an issue where interactive jobs are silently crashing in the Slurm prolog and being removed from the queue.
$ [daint][user@daint-ln001 ~]$ srun -p debug --pty bash srun: job 152105 queued and waiting for resources <hanging, but the job has been cancelled by slurm>
Access
Log in as you would for other CSCS systems, by first configuring your SSH keys (see the MFA documentation for more information), then log into Daint via the front end server ela.cscs.ch
:
$ ssh -A ela.cscs.ch $ ssh daint.alps.cscs.ch
Simplifying log in
To log in directly to Daint without first logging into ela.cscs.ch
, you can add the following configuration to the ~/.ssh/config
file on your laptop or PC:
Host ela HostName ela.cscs.ch User <<username>> IdentityFile ~/.ssh/cscs-key Host daint.alps HostName daint.alps.cscs.ch User <<username>> IdentityFile ~/.ssh/cscs-key ProxyJump ela AddKeysToAgent yes ForwardAgent yes
Where cscsusername
is your CSCS account name.
Now you can access Daint directly from your laptop or PC:
ssh daint.alps
Cluster Specifications
All nodes are identical, with 4 Grace-Hopper modules per node. Specifically:
- User Access Node (UAN): 4 repurposed GH compute nodes that serve as login nodes (daint-ln00[1-4])
- Compute Node (CN): The number of compute nodes will change over time. You can get an up-to-date number using the command
sinfo -s
on the UAN. The majority of the nodes is provided in the Slurm partitionnormal
, while a smaller number is accessible through the partitiondebug
, meant for short test jobs with a quick turnaround
Each node has approximately 800GB of free memory accessible from all sockets. Each Grace CPU has 72 cores with the following specification:
- Arm V9.0 ISA compliant aarch64 (Neoverse V2 “Demeter”architecture)
- Full SVE-2 Vector Extensions support, inclusive of NEON instructions
- Supports 48-bit virtual and 48-bit physical address space
Each Hopper GPU has 96GB of RAM. NVLINK provides all-to-all cache-coherent memory between all host and device memory.
A login node is a shared resource. Do not run compute-intensive jobs on a login node, and do not start the CUDA MPS service there, as you might impact the work of others. Please have a look at the Policies that apply on CSCS computing systems
Programming Environment
uenv
User environments uenv
are used to provide programming environments and application software. Please refer to the uenv
documentation for detailed information on how to use the uenv
tools on the system.
You can list the uenv
provided on the system with the command uenv image find
. A non exhaustive list of the software provided by uenv
images currently available on the system is shown below:
Please note that uenv
images provided on the system Todi should also work on Daint. They can be accessed via
CLUSTER_NAME=todi uenv image find
Please always prepend CLUSTER_NAME=todi
to uenv
commands that should address an image available on Todi.
Cray Programming Environment (CPE)
CPE is provided on Daint, however CSCS does not officially support CPE or provide software built with CPE on the system.
The supported method for building software is uenv (see above): this is a key difference from the previous system Piz Daint (Cray XC).
To enable Cray Programming Environment (CPE), please run
$ module load cray $ module load PrgEnv-cray # to check that CPE is loaded: $ ftn --version Cray Fortran : Version 17.0.0
Container engine
The Container Engine (CE) is available on the system: this is a toolset is designed to enable computing jobs to seamlessly run inside Linux application containers, thus providing support for containerized user environments: please see the dedicated page to use CE.
File systems
The following mount points are available:
/users/$USER
-$HOME
/capstor/scratch/cscs/$USER
-$SCRATCH
/capstor/store
-$PROJECT
The environment variable $HOME
and $SCRATCH
will give you have access to the user dedicated folders /users/$USER
and /capstor/cscs/scratch/$USER
respectively. The $PROJECT
environment variable targets your personal folder /capstor/store/cscs/userlab/group_id>/$USER
only for UserLab customers on the capstor
storage: please note that users need to create their own sub-folders under the project folder, as they are not created automatically.
Please check the occupancy by your user (data volume and number of inodes) on the different file systems with the quota
command, that is currently available on the frontend Ela.
Slurm configuration
Currently there is no fair share policy applied.
Partition | Max time | Max nodes | Brief Description |
---|---|---|---|
normal | 24 h | 568 | Standard queue for production work |
debug | 30 min | 10 | Quick turnaround for test jobs (one per user) |
xfer | 24 h | 1 | Internal transfer queue |
Nodes are not shared (except for the xfer
queue) and at least 1 node must be allocated for your job.
Running jobs
In the current Slurm configuration OMP threads are placed consecutively on the cores and MPI ranks are placed in the round-robin fashion between 4 sockets. Example below demonstrates the output of running 8 MPI ranks with 4 cores/rank:
Oversubscription of GPU cards
If you want to share GPU card(s) between multiple MPI ranks, you currently need to start multi-process daemon on the node yourself. This is due to the fact that CRAY_CUDA_MPS
variable is no longer supported. To do so, you need to use a simple wrapper script:
#!/bin/bash # Example mps-wrapper.sh usage: # > srun [srun args] mps-wrapper.sh [cmd] [cmd args] export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log export CUDA_VISIBLE_DEVICES=$(( SLURM_LOCALID % 4 )) # Launch MPS from a single rank per node if [ $SLURM_LOCALID -eq 0 ]; then CUDA_VISIBLE_DEVICES=0,1,2,3 nvidia-cuda-mps-control -d fi # Wait for MPS to start sleep 5 # Run the command "$@" # Quit MPS control daemon before exiting if [ $SLURM_LOCALID -eq 0 ]; then echo quit | nvidia-cuda-mps-control fi
and run your code using the following sample slurm script:
#!/bin/bash -l #SBATCH --job-name=<job_name> #SBATCH --time=01:30:00 #HH:MM:SS #SBATCH --nodes=2 #SBATCH --ntasks-per-core=1 #SBATCH --ntasks-per-node=32 #32 MPI ranks per node #SBATCH --account=<account> #SBATCH --hint=nomultithread #SBATCH --hint=exclusive export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export MPICH_MALLOC_FALLBACK=1 ulimit -s unlimited srun --cpu-bind=socket ./mps-wrapper.sh <code> <args>