CP2K is a program to perform atomistic and molecular simulations of solid state, liquid, molecular and biological systems. CP2K is optimized for the mixed Gaussian and Plane-Waves (GPW) method based on pseudopotentials, but is able to run all-electron or pure plane-wave/Gaussian calculations as well.
It is written in Fortran 2008 and it can be run efficiently on massively parallel systems. Key characteristics of CP2K:
- Density functional theory calculations
- Post Hartree-Fock (MP2, RPA) energies and gradients
- Hybrid and double-hybrid DFT
- Time-dependent DFT
- Various vibrational and electronic spectroscopies
- xTB tight-binding and classical force-fields
Please have a look at the detailed overview of features on the program web site.
Licensing Terms and Conditions
The source of CP2K is freely available under the GPL license: the code can be downloaded from the CP2K repository. Publications of results obtained with CP2K should acknowledge its use by an appropriate citation (see CP2K Homepage).
ALPS (GH200)
Setup
On ALPS, CP2K is precompiled and available in a user environment (uenv). The program is built with the following libraries:
To find out which CP2K uenv is provided, you can use the following command:
uenv image find cp2k
To get and start the uenv for a specific version of CP2K you can use
uenv image pull cp2k/2024.3:v1 uenv start cp2k/2024.3:v1
After pulling the image, you will find a new folder with the file store.squashfs
under $SCRATCH/.uenv-imges/images
.
You can also use the Slurm --uenv
option to start the environment inside a Slurm batch script, as shown in the template provided in the next section. Please note that you need to pull the corresponding image before being able to use it in your Slurm batch script.
How to run
To start a job, 2 bash scripts are potentially required: a standard SLURM submission script, and a wrapper to start the CUDA MPS daemon (in order to have multiple MPI ranks per GPU):
#!/bin/bash -l #SBATCH --job-name=<job_name> #SBATCH --time=01:30:00 # HH:MM:SS #SBATCH --nodes=4 #SBATCH --ntasks-per-core=1 #SBATCH --ntasks-per-node=32 # Number of MPI ranks per node #SBATCH --cpus-per-task=8 # Number of OMP threads per rank #SBATCH --account=<account> #SBATCH --hint=nomultithread #SBATCH --hint=exclusive #SBATCH --no-requeue #SBATCH --uenv=cp2k/2024.2:v1 #SBATCH --view=cp2k # set environment export CP2K_DATA_DIR=<absolute/path/to/CP2K/data/dir> export CUDA_CACHE_PATH="/dev/shm/$RANDOM" export MPICH_GPU_SUPPORT_ENABLED=1 export MPICH_MALLOC_FALLBACK=1 export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) ulimit -s unlimited # run srun --cpu-bind=socket ./mps-wrapper.sh cp2k.psmp -i input.inp -o output.out
The mps-wrapper.sh
can be found in Oversubscription of GPU cards.
With the above scripts, you can launch a calculation on 4 nodes, with 32 MPI ranks per node and 8 OMP threads per rank with the command sbatch launch.sh
. You may need to make the wrapper script executable with chmod +x mps-wrapper.sh
first. Make sure to replace <account>
with your CSCS account name. Note that the --bind-cpu=socket
SLURM option is absolutely necessary to get good performance.
Please note that the environment variable CUDA_CACHE_PATH
is set to a random folder to avoid using the default path under $HOME
.
There are 4 GPUs per node, and 72 cores per GPU. It is very important that the number of MPI ranks passed to SLURM with --ntasks-per-node
is a multiple of 4.
Note that in the example above, we use 32 MPI ranks with 8 OMP threads, for a total of 64 cores per GPU and 256 cores per node. Experiments have shown that CP2K performs and scales better when the number of MPI ranks is a power of 2, even if some cores are left idling.
If you want to run CP2K regression tests with the CP2K executable provided by the uenv, make sure to use the version of the regression tests corresponding to the version of CP2K provided by the env.
The CP2K GRID CUDA backend is currently bugged on Alps. Systems using basis sets with high angular momenta (l >= 3) are known to run extremely slowly, especially for force calculations an MGGA. As a workaround, you can set the following in the &GLOBAL section of your input file:
&GRID
BACKEND CPU
&END GRID
If you want to build your own user environment (e.g. for the CP2K development version), please consult this documentation.
If you want to build CP2K from your own source code, please follow these instructions.
Scaling
We ran 3 representative CP2K tests, taken from the official benchmark folder. These tests cover DFT molecular dynamics, MP2 energies, and RPA energies. A variety of different libraries are tested in the process.
QS/H2O-1024.inp
This is a DFT molecular dynamics input file, which mostly relies on the DBCSR block sparse matrix-matrix multiplication library for performance. Note that H2O-1024.inp
is the largest such example that fits on a single GH200 node. All calculations were run with 32 MPI ranks per node, and 8 OMP threads per rank (tested as being the best configuration for this benchmark).
N nodes | Wall time (s) | Speedup | Efficiency |
---|---|---|---|
1 | 793.1 | 1.00 | 1.00 |
2 | 535.2 | 1.48 | 0.74 |
4 | 543.9 | 1.45 | 0.36 |
8 | 487.3 | 1.62 | 0.20 |
16 | 616.7 | 1.28 | 0.08 |
We note that scaling is not ideal on more than 2 nodes. While H2O-1024.inp
is the largest such calculation that fits on a single node, it does so comfortably. Therefore, there is only so much room for speedups by adding more (very large) GH200 nodes. In terms of time to solution, 1 GH200 node corresponds to 32 Piz Daint nodes.
QS_mp2_rpa/128-H2O/H2O-128-RI-MP2-TZ.inp
This is a MP2 energy input for 128 water molecules with triple-zeta basis sets, a very large calculation. Note that this file is not distributed with CP2K, but is straightforwardly adapted from the existing H2O-64-RI-MP2-TZ.inp
file. For reference, this is the input file used: H2O-128-RI-MP2-TZ.inp. All calculations run for this scaling tests were using 32 MPI ranks per node and 8 OMP threads per rank. The smallest amount of nodes necessary to run this calculation is 8.
N nodes | Wall time (s) | Speedup | Efficieny |
---|---|---|---|
8 | 2037.0 | 1.00 | 1.00 |
16 | 1096.2 | 1.85 | 0.92 |
32 | 611.5 | 3.33 | 0.83 |
64 | 410.5 | 4.96 | 0.62 |
128 | 290.9 | 7.00 | 0.43 |
MP2 calculations scale well on GH200, up to a large number of nodes (> 50% efficiency with 64 nodes). In terms of time to solution, 1 GH200 node corresponds to ~16 Piz Daint nodes.
QS_mp2_rpa/128-H2O/H2O-128-RI-dRPA-TZ.inp
This is a RPA energy input, traditionally used to benchmark the performance of the COSMA library. This is also a very large calculation, which requires at least 8 GH200 nodes to run. The calculations were run with 16 MPI ranks per node and 16 OMP threads per rank. We observed that for RPA workloads, higher ration of threads per rank was beneficial.
N nodes | Wall time (s) | Speedup | Efficiency |
---|---|---|---|
8 | 575.4 | 1.00 | 1.00 |
16 | 465.8 | 1.23 | 0.61 |
32 | 281.1 | 2.04 | 0.51 |
64 | 205.3 | 2.80 | 0.35 |
128 | 185.8 | 3.09 | 0.19 |
This RPA input scales well until 32 GH200 nodes. In terms of time to solution, a GH200 node corresponds to 16-32 Piz Daint nodes.
Known DBCSR issue
On the GH200 architecture, it has been observed that the GPU accelerated version of DBCSR (CP2K's block-sparse matrix-matrix multiplication library) does not perform optimally in some cases. For example, in the QS/H2O-1024.inp
benchmark above, CP2K clearly does not scale well beyond 2 nodes. The CPU implementation of DBCSR does not suffer from this. A workaround was implemented in DBCSR, in order to switch GPU acceleration on/off with an environment variable. When export DBCSR_RUN_ON_GPU=0
is added to the Slurm submission script, the QS/H2O-1024.inp
benchmark changes to this:
While GPU acceleration is very good on a small number of nodes, the CPU implementation scales better. Therefore, for CP2K jobs running on a large number of nodes, it is worth investigating the use of the DBCSR_RUN_ON_GPU
environment variable.
Additionally, some niche application cases such as the QS_low_scaling_postHF
benchmarks only run efficiently with the CPU version of DBCSR. In this case, total runtimes can be reduced by half or more. Generally, if the function `dbcsr_multiply_generic` takes a significant portion of the timing report (at the end of the CP2K output file), it is also worth investigating the DBCSR_RUN_ON_GPU
environment variable.