CP2K is a program to perform atomistic and molecular simulations of solid state, liquid, molecular and biological systems. CP2K is optimized for the mixed Gaussian and Plane-Waves (GPW) method based on pseudopotentials, but is able to run all-electron or pure plane-wave/Gaussian calculations as well.

It is written in Fortran 2008 and it can be run efficiently on massively parallel systems. Key characteristics of CP2K:

Density functional theory calculations
Post Hartree-Fock (MP2, RPA) energies and gradients
Hybrid and double-hybrid DFT
Time-dependent DFT
Various vibrational and electronic spectroscopies
xTB tight-binding and classical force-fields

Please have a look at the detailed overview of features on the program web site.

2Licensing Terms and Conditions
3ALPS (GH200)
3.1Setup
3.2How to run
3.3Scaling
3.3.1QS/H2O-1024.inp
3.3.2QS_mp2_rpa/128-H2O/H2O-128-RI-MP2-TZ.inp
3.3.3QS_mp2_rpa/128-H2O/H2O-128-RI-dRPA-TZ.inp
3.4Known DBCSR issue
4Further Documentation

Licensing Terms and Conditions

The source of CP2K is freely available under the GPL license: the code can be downloaded from the CP2K repository. Publications of results obtained with CP2K should acknowledge its use by an appropriate citation (see CP2K Homepage).

ALPS (GH200)

Setup

On ALPS, CP2K is precompiled and available in a user environment (uenv). The program is built with the following libraries:

LibXC
Libint
Spglib
COSMA (with GPU-aware MPI)
SLPA
ELPA
SIRIUS

To find out which CP2K uenv is provided, you can use the following command:

uenv image find cp2k

To get and start the uenv for a specific version of CP2K you can use

uenv image pull cp2k/2024.3:v1
uenv start cp2k/2024.3:v1

After pulling the image, you will find a new folder with the file store.squashfs under $SCRATCH/.uenv-imges/images.
You can also use the Slurm --uenv option to start the environment inside a Slurm batch script, as shown in the template provided in the next section. Please note that you need to pull the corresponding image before being able to use it in your Slurm batch script.

How to run

To start a job, 2 bash scripts are potentially required: a standard SLURM submission script, and a wrapper to start the CUDA MPS daemon (in order to have multiple MPI ranks per GPU):

launch.sh

#!/bin/bash -l

#SBATCH --job-name=<job_name>
#SBATCH --time=01:30:00           # HH:MM:SS
#SBATCH --nodes=4
#SBATCH --ntasks-per-core=1                                                                         
#SBATCH --ntasks-per-node=32      # Number of MPI ranks per node
#SBATCH --cpus-per-task=8         # Number of OMP threads per rank
#SBATCH --account=<account>                                                                              
#SBATCH --hint=nomultithread                                                                            
#SBATCH --hint=exclusive
#SBATCH --no-requeue
#SBATCH --uenv=cp2k/2024.2:v1
#SBATCH --view=cp2k

# set environment
export CP2K_DATA_DIR=<absolute/path/to/CP2K/data/dir>
export CUDA_CACHE_PATH="/dev/shm/$RANDOM"
export MPICH_GPU_SUPPORT_ENABLED=1
export MPICH_MALLOC_FALLBACK=1
export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) 
ulimit -s unlimited

# run
srun --cpu-bind=socket ./mps-wrapper.sh cp2k.psmp -i input.inp -o output.out

The mps-wrapper.sh can be found in Oversubscription of GPU cards.

With the above scripts, you can launch a calculation on 4 nodes, with 32 MPI ranks per node and 8 OMP threads per rank with the command sbatch launch.sh. You may need to make the wrapper script executable with chmod +x mps-wrapper.sh first. Make sure to replace <account> with your CSCS account name. Note that the --bind-cpu=socket SLURM option is absolutely necessary to get good performance.

Please note that the environment variable CUDA_CACHE_PATH is set to a random folder to avoid using the default path under $HOME.

There are 4 GPUs per node, and 72 cores per GPU. It is very important that the number of MPI ranks passed to SLURM with --ntasks-per-node is a multiple of 4.

Note that in the example above, we use 32 MPI ranks with 8 OMP threads, for a total of 64 cores per GPU and 256 cores per node. Experiments have shown that CP2K performs and scales better when the number of MPI ranks is a power of 2, even if some cores are left idling.

If you want to run CP2K regression tests with the CP2K executable provided by the uenv, make sure to use the version of the regression tests corresponding to the version of CP2K provided by the env.

The CP2K GRID CUDA backend is currently bugged on Alps. Systems using basis sets with high angular momenta (l >= 3) are known to run extremely slowly, especially for force calculations an MGGA. As a workaround, you can set the following in the &GLOBAL section of your input file:

&GRID

BACKEND CPU

&END GRID

If you want to build your own user environment (e.g. for the CP2K development version), please consult this documentation.

If you want to build CP2K from your own source code, please follow these instructions.

Scaling

We ran 3 representative CP2K tests, taken from the official benchmark folder. These tests cover DFT molecular dynamics, MP2 energies, and RPA energies. A variety of different libraries are tested in the process.

QS/H2O-1024.inp

This is a DFT molecular dynamics input file, which mostly relies on the DBCSR block sparse matrix-matrix multiplication library for performance. Note that H2O-1024.inp is the largest such example that fits on a single GH200 node. All calculations were run with 32 MPI ranks per node, and 8 OMP threads per rank (tested as being the best configuration for this benchmark).



1	793.1	1.00	1.00
2	535.2	1.48	0.74
4	543.9	1.45	0.36
8	487.3	1.62	0.20
16	616.7	1.28	0.08

We note that scaling is not ideal on more than 2 nodes. While H2O-1024.inp is the largest such calculation that fits on a single node, it does so comfortably. Therefore, there is only so much room for speedups by adding more (very large) GH200 nodes. In terms of time to solution, 1 GH200 node corresponds to 32 Piz Daint nodes.

QS_mp2_rpa/128-H2O/H2O-128-RI-MP2-TZ.inp

This is a MP2 energy input for 128 water molecules with triple-zeta basis sets, a very large calculation. Note that this file is not distributed with CP2K, but is straightforwardly adapted from the existing H2O-64-RI-MP2-TZ.inp file. For reference, this is the input file used: H2O-128-RI-MP2-TZ.inp. All calculations run for this scaling tests were using 32 MPI ranks per node and 8 OMP threads per rank. The smallest amount of nodes necessary to run this calculation is 8.



8	2037.0	1.00	1.00
16	1096.2	1.85	0.92
32	611.5	3.33	0.83
64	410.5	4.96	0.62
128	290.9	7.00	0.43

MP2 calculations scale well on GH200, up to a large number of nodes (> 50% efficiency with 64 nodes). In terms of time to solution, 1 GH200 node corresponds to ~16 Piz Daint nodes.

QS_mp2_rpa/128-H2O/H2O-128-RI-dRPA-TZ.inp

This is a RPA energy input, traditionally used to benchmark the performance of the COSMA library. This is also a very large calculation, which requires at least 8 GH200 nodes to run. The calculations were run with 16 MPI ranks per node and 16 OMP threads per rank. We observed that for RPA workloads, higher ration of threads per rank was beneficial.



8	575.4	1.00	1.00
16	465.8	1.23	0.61
32	281.1	2.04	0.51
64	205.3	2.80	0.35
128	185.8	3.09	0.19

This RPA input scales well until 32 GH200 nodes. In terms of time to solution, a GH200 node corresponds to 16-32 Piz Daint nodes.

Known DBCSR issue

On the GH200 architecture, it has been observed that the GPU accelerated version of DBCSR (CP2K's block-sparse matrix-matrix multiplication library) does not perform optimally in some cases. For example, in the QS/H2O-1024.inp benchmark above, CP2K clearly does not scale well beyond 2 nodes. The CPU implementation of DBCSR does not suffer from this. A workaround was implemented in DBCSR, in order to switch GPU acceleration on/off with an environment variable. When export DBCSR_RUN_ON_GPU=0 is added to the Slurm submission script, the QS/H2O-1024.inp benchmark changes to this:

While GPU acceleration is very good on a small number of nodes, the CPU implementation scales better. Therefore, for CP2K jobs running on a large number of nodes, it is worth investigating the use of the DBCSR_RUN_ON_GPU environment variable.

Additionally, some niche application cases such as the QS_low_scaling_postHF benchmarks only run efficiently with the CPU version of DBCSR. In this case, total runtimes can be reduced by half or more. Generally, if the function `dbcsr_multiply_generic` takes a significant portion of the timing report (at the end of the CP2K output file), it is also worth investigating the DBCSR_RUN_ON_GPU environment variable.

Further Documentation

CP2K Homepage

CP2K Online Manual

Content

Space Tools

2Licensing Terms and Conditions
3ALPS (GH200)
3.1Setup
3.2How to run
3.3Scaling
3.3.1QS/H2O-1024.inp
3.3.2QS_mp2_rpa/128-H2O/H2O-128-RI-MP2-TZ.inp
3.3.3QS_mp2_rpa/128-H2O/H2O-128-RI-dRPA-TZ.inp
3.4Known DBCSR issue
4Further Documentation

Licensing Terms and Conditions

ALPS (GH200)

Setup

How to run

Scaling

QS/H2O-1024.inp

QS_mp2_rpa/128-H2O/H2O-128-RI-MP2-TZ.inp

QS_mp2_rpa/128-H2O/H2O-128-RI-dRPA-TZ.inp

Known DBCSR issue

Further Documentation

Content

Space Tools

Breadcrumbs

CP2K

2Licensing Terms and Conditions3ALPS (GH200)3.1Setup3.2How to run3.3Scaling3.3.1QS/H2O-1024.inp3.3.2QS_mp2_rpa/128-H2O/H2O-128-RI-MP2-TZ.inp3.3.3QS_mp2_rpa/128-H2O/H2O-128-RI-dRPA-TZ.inp3.4Known DBCSR issue4Further Documentation

Licensing Terms and Conditions

ALPS (GH200)

Setup

How to run

Scaling

QS/H2O-1024.inp

QS_mp2_rpa/128-H2O/H2O-128-RI-MP2-TZ.inp

QS_mp2_rpa/128-H2O/H2O-128-RI-dRPA-TZ.inp

Known DBCSR issue

Further Documentation

2Licensing Terms and Conditions
3ALPS (GH200)
3.1Setup
3.2How to run
3.3Scaling
3.3.1QS/H2O-1024.inp
3.3.2QS_mp2_rpa/128-H2O/H2O-128-RI-MP2-TZ.inp
3.3.3QS_mp2_rpa/128-H2O/H2O-128-RI-dRPA-TZ.inp
3.4Known DBCSR issue
4Further Documentation