CP2K is a program to perform atomistic and molecular simulations of solid state, liquid, molecular and biological systems. CP2K is optimized for the mixed Gaussian and Plane-Waves (GPW) method based on pseudopotentials, but is able to run all-electron or pure plane-wave/Gaussian calculations as well.
It is written in Fortran 2008 and it can be run efficiently on massively parallel systems. Key characteristics of CP2K:
- Density functional theory calculations
- Post Hartree-Fock (MP2, RPA) energies and gradients
- Hybrid and double-hybrid DFT
- Time-dependent DFT
- Various vibrational and electronic spectroscopies
- xTB tight-binding and classical force-fields
Please have a look at the detailed overview of features on the program web site.
Licensing Terms and Conditions
The source of CP2K is freely available under the GPL license: the code can be downloaded from the CP2K repository. Publications of results obtained with CP2K should acknowledge its use by an appropriate citation (see CP2K Homepage).
ALPS (GH200)
Setup
On ALPS, CP2K is precompiled and available in a user environment (uenv). The program is built with the following libraries:
To find out which CP2K uenv is provided, you can use the following command:
uenv image find cp2k
To get and start the uenv for a specific version of CP2K you can use
uenv image pull cp2k/2024.3:v1 uenv start cp2k/2024.3:v1
After pulling the image, you will find a new folder with the file store.squashfs
under $SCRATCH/.uenv-imges/images
.
You can also use the Slurm --uenv
option to start the environment inside a Slurm batch script, as shown in the template provided in the next section. Please note that you need to pull the corresponding image before being able to use it in your Slurm batch script.
How to run
To start a job, 2 bash scripts are potentially required: a standard SLURM submission script, and a wrapper to start the CUDA MPS daemon (in order to have multiple MPI ranks per GPU):
#!/bin/bash -l #SBATCH --job-name=<job_name> #SBATCH --time=01:30:00 # HH:MM:SS #SBATCH --nodes=4 #SBATCH --ntasks-per-core=1 #SBATCH --ntasks-per-node=32 # Number of MPI ranks per node #SBATCH --cpus-per-task=8 # Number of OMP threads per rank #SBATCH --account=<account> #SBATCH --hint=nomultithread #SBATCH --hint=exclusive #SBATCH --no-requeue #SBATCH --uenv=cp2k/2024.2:v1 #SBATCH --view=cp2k # set environment export CP2K_DATA_DIR=<absolute/path/to/CP2K/data/dir> export CUDA_CACHE_PATH="/dev/shm/$RANDOM" export MPICH_GPU_SUPPORT_ENABLED=1 export MPICH_MALLOC_FALLBACK=1 export OMP_NUM_THREADS=$((SLURM_CPUS_PER_TASK - 1)) ulimit -s unlimited # run srun --cpu-bind=socket ./mps-wrapper.sh cp2k.psmp -i input.inp -o output.out
The mps-wrapper.sh
can be found in Oversubscription of GPU cards.
With the above scripts, you can launch a calculation on 4 nodes, with 32 MPI ranks per node and 8 OMP threads per rank with the command sbatch launch.sh
. You may need to make the wrapper script executable with chmod +x mps-wrapper.sh
first. Make sure to replace <account>
with your CSCS account name. Note that the --bind-cpu=socket
SLURM option is absolutely necessary to get good performance.
Please note that the environment variable CUDA_CACHE_PATH
is set to a random folder to avoid using the default path under $HOME
.
There are 4 GPUs per node, and 72 cores per GPU. It is very important that the number of MPI ranks passed to SLURM with --ntasks-per-node
is a multiple of 4.
Note that in the example above, we use 32 MPI ranks with 8 OMP threads, for a total of 64 cores per GPU and 256 cores per node. Experiments have shown that CP2K performs and scales better when the number of MPI ranks is a power of 2, even if some cores are left idling.
If you want to run CP2K regression tests with the CP2K executable provided by the uenv, make sure to use the version of the regression tests corresponding to the version of CP2K provided by the env.
If you want to build your own user environment (e.g. for the CP2K development version), please consult this documentation.
If you want to build CP2K from your own source code, please follow these instructions.
Scaling
We ran 3 representative CP2K tests, taken from the official benchmark folder. These tests cover DFT molecular dynamics, MP2 energies, and RPA energies. A variety of different libraries are tested in the process.
QS/H2O-1024.inp
This is a DFT molecular dynamics input file, which mostly relies on the DBCSR block sparse matrix-matrix multiplication library for performance. Note that H2O-1024.inp
is the largest such example that fits on a single GH200 node. All calculations were run with 32 MPI ranks per node, and 8 OMP threads per rank (tested as being the best configuration for this benchmark).
N nodes | Wall time (s) | Speedup | Efficiency |
---|---|---|---|
1 | 793.1 | 1.00 | 1.00 |
2 | 535.2 | 1.48 | 0.74 |
4 | 543.9 | 1.45 | 0.36 |
8 | 487.3 | 1.62 | 0.20 |
16 | 616.7 | 1.28 | 0.08 |
We note that scaling is not ideal on more than 2 nodes. While H2O-1024.inp
is the largest such calculation that fits on a single node, it does so comfortably. Therefore, there is only so much room for speedups by adding more (very large) GH200 nodes. In terms of time to solution, 1 GH200 node corresponds to 32 Piz Daint nodes.
QS_mp2_rpa/128-H2O/H2O-128-RI-MP2-TZ.inp
This is a MP2 energy input for 128 water molecules with triple-zeta basis sets, a very large calculation. Note that this file is not distributed with CP2K, but is straightforwardly adapted from the existing H2O-64-RI-MP2-TZ.inp
file. For reference, this is the input file used: H2O-128-RI-MP2-TZ.inp. All calculations run for this scaling tests were using 32 MPI ranks per node and 8 OMP threads per rank. The smallest amount of nodes necessary to run this calculation is 8.
N nodes | Wall time (s) | Speedup | Efficieny |
---|---|---|---|
8 | 2037.0 | 1.00 | 1.00 |
16 | 1096.2 | 1.85 | 0.92 |
32 | 611.5 | 3.33 | 0.83 |
64 | 410.5 | 4.96 | 0.62 |
128 | 290.9 | 7.00 | 0.43 |
MP2 calculations scale well on GH200, up to a large number of nodes (> 50% efficiency with 64 nodes). In terms of time to solution, 1 GH200 node corresponds to ~16 Piz Daint nodes.
QS_mp2_rpa/128-H2O/H2O-128-RI-dRPA-TZ.inp
This is a RPA energy input, traditionally used to benchmark the performance of the COSMA library. This is also a very large calculation, which requires at least 8 GH200 nodes to run. The calculations were run with 16 MPI ranks per node and 16 OMP threads per rank. We observed that for RPA workloads, higher ration of threads per rank was beneficial.
N nodes | Wall time (s) | Speedup | Efficiency |
---|---|---|---|
8 | 575.4 | 1.00 | 1.00 |
16 | 465.8 | 1.23 | 0.61 |
32 | 281.1 | 2.04 | 0.51 |
64 | 205.3 | 2.80 | 0.35 |
128 | 185.8 | 3.09 | 0.19 |
This RPA input scales well until 32 GH200 nodes. In terms of time to solution, a GH200 node corresponds to 16-32 Piz Daint nodes.
Piz Daint
Setup
You can see a list of the available versions of the program installed on the machine after loading the gpu or multicore modulefile. In the examples below we use the daint-gpu modulefile:
module load daint-gpu (or daint-mc) module avail CP2K
The previous set of commands will show the gpu enabled modules of the applications. The following module command will then load the environment of the default version of the program:
module load CP2K
You can either type this command every time you intend to use the program within a new session, or you can automatically load it by including it in your shell configuration file.
The following module commands will print the environment variables set by loading the program and a help message:
module show CP2K module help CP2K
How to run
The following job script asks for 8 nodes, using 6 MPI tasks per node and 2 OpenMP threads per MPI task. If you use more MPI tasks per node, you will have less memory per MPI task. Since we employ multiple MPI tasks per node, we need to set CRAY_CUDA_MPS=1
to let the MPI tasks access the GPU device on each node at the same time.
#!/bin/bash -l # # CP2K on Piz Daint: 8 nodes, 6 MPI task per node, 2 OpenMP threads per task # #SBATCH --job-name=cp2k #SBATCH --time=00:30:00 #SBATCH --nodes=16 #SBATCH --ntasks-per-node=6 #SBATCH --cpus-per-task=2 #SBATCH --constraint=gpu #SBATCH --account=<project> # # load modules and run simulation module load daint-gpu module load CP2K export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export CRAY_CUDA_MPS=1 ulimit -s unlimited srun cp2k.psmp cp2k.inp
Please replace the string <project>
with the ID of the active project that will be charged for the allocation.
Simulations using the default ELPA diagonalization library on GPU nodes might get a warning or even stop with messages like Skipping GPU init, should have already been initialized
in the output file. Users experiencing issues with ELPA can try using ELPA_KERNEL AVX2_BLOCK2
in the GLOBAL section of the input file or switch to ScaLAPACK with PREFERRED_DIAG_LIBRARY SL
.
Scaling
We provide a CP2K scaling example simulating the dynamics of a small water system of 256 molecules. The input file H2O-256.inp is provided under the QS benchmarks of the CP2K GitHub repository, while the basis sets and the potential are available under the data folder:
GTH_BASIS_SETS
H2O-256.inp
POTENTIAL
We run the scaling jobs with the constraint gpu
on the Cray XC50, using 6 MPI tasks per node and 2 OpenMP threads per MPI task, as shown in the template Slurm script above. The wall time of each job is retrieved from the TIMING section of the CP2K output file with the command grep "CP2K " slurm.out
, while the relative speed-up is computed taking the longest runtime as a reference value. The scaling data are reported in the table below:
Nodes | Wall time (s) | Speed-up | Parallel efficiency |
---|---|---|---|
1 | 763 | 1.00 | 1.00 |
2 | 469 | 1.63 | 0.82 |
4 | 282 | 2.71 | 0.68 |
8 | 182 | 4.19 | 0.52 |
16 | 133 | 5.78 | 0.36 |
The parallel efficiency is the ratio of the speed-up with respect to the ideal speed-up, that grows linearly with the number of nodes: therefore we reach the ∼50% limit in parallel efficiency already on 8 nodes in this small example. Strong scaling speed-up is plotted against ideal scaling below: