Applications must perform their I/O on the default scratch file system, which is a Cray Sonexion 3000 Lustre file system (snx3000
), except during maintenance operations. Scratch is connected via network to the compute nodes. The personal user space is directly accessible through the environment variable $SCRATCH
(it defaults to /scratch/snx3000/$USER
). The snx3000
currently contains 40 object storage targets (OSTs), each featuring multiple hard disks for data storage. When the system is in production, an application can approach the peak read/write performance only if the following conditions are simultaneously satisfied:
- it accesses all or most OSTs in parallel and in an optimal manner
- other users do not perform significant I/O operations at the same time
There are two common approaches to perform I/O in parallel:
The Lustre file system enables data striping, i.e. the distributed storage of a file on multiple OSTs (creating a file with a stripe count of x means to distribute the file on x OSTs). Hence, even if all processes of an application access only a single shared file, all OSTs can be used concurrently and optimal performance should therefore be possible. Below we give some recommendations on striping; advanced users may find more details on the Lustre Wiki.
File-per-process
In the file-per-process approach, each process accesses its own file(s). No parallel I/O library is needed: the standard C/C++/Fortran I/O read/write routines may be used.
The main advantage of this approach is obviously the possible simplicity on the application level. A major drawback is however the large amount of files that is typically created if this approach is used at large scale, which can cause a severe slowdown of the shared file system if it is not handled with great care and expertise (particularly in post-processing!).
Please keep in mind the following recommendations when adopting the file-per-process approach:
- Do not use striping as every file is normally created on a different OST; so it is possible to use all OSTs without using striping. Striping will typically create unnecessary network traffic and reduce the overall I/O performance. By default, striping is not activated on the
scratch
file system (stripe count is 1). - Do not create thousands of files in a single folder; the Lustre file system is not designed for that and severe slowdowns of the file system may occur when accessing them. Group your files in subfolders instead.
- Do not access thousands of files simultaneously, as this may create contention on the file system and result in bad overall I/O performance.
- Be aware that if you cause contention on the file system, all users will suffer the slowdowns as all I/O resources are shared.
Shared file
In the shared file approach, all processes access the same shared file(s). Parallel versions of I/O libraries (e.g. MPI-IO, HDF5, NetCDF and ADIOS) are commonly used to manage the complexity of accessing shared file(s) concurrently from different processes.
The main advantages of the shared file(s) approach using parallel I/O libraries are the following:
- the number of files can be kept small in a straightforward manner
- the files can be post-processed particularly easily with common tools as i) extensive metadata can easily be added to the file(s), and ii) multiple processes may conveniently create a shared file in a way that there is no difference to a file created by a single process
A major disadvantage is however that careful tuning is often required in order to reach good I/O performance.
Please keep in mind the following recommendations when adopting the shared file approach:
- Figure out the optimal striping configuration for your application. If you access only one large file at a time, then you should be able to make great use of striping and set the stripe count to 32 or 40 in order to use nearly all or all OSTs. If you access multiple files concurrently, then a smaller stripe count may be better. For small files, the stripe count should be 1 in any case.
- A convenient way to set the striping configuration you desire is to define it for your simulation's output folder(s). Inside, files will be created with the same striping configuration as the folder(s) itself (do not copy your applications's executable(s) in there!). E.g., to set the stripe count of a simulation's output folder to 32 do
lfs setstripe --stripe-count 32 <output folder>
. To check the striping configuration of an existing file or folder you can dolfs getstripe <file/folder>
. Typelfs setstripe
andlfs getstripe
for more information; typelfs --help
for general information about commands for the Lustre file system. - Use collective I/O operations. They enable the merging of I/O request of different processes into fewer larger ones, what normally significantly improves shared file I/O performance.
- Refer to the official websites of the used parallel I/O libraries for details on how to make best usage of them.
- Follow widely adopted Metadata conventions when creating files (cf. NetCDF CF Metadata Conventions, XDMF and GADGET). This enables straightforward pre- and post-processing with common tools and portability of data between applications in general.
Available parallel I/O libraries on Piz Daint
CRAY provides the following modules:
cray-hdf5-parallel
(HDF5)cray-netcdf-hdf5parallel
(NetCDF using HDF5 underneath)cray-parallel-netcdf
(NetCDF using PnetCDF underneath)
General recommendations for efficient I/O on Piz Daint
- Do not write ASCII but binary files (except for few very small parameter / metadata files); writing ASCII files can easily be 10 to 100 times slower than writing binary files.
- Avoid opening and closing files frequently.
- Do not open files for read and write access, but instead for read-only or write-only access.
- Avoid small and frequent I/O requests.
- Avoid random file access; regular access patterns work best in general.
- Avoid multiple processes accessing the same data.
- Read small files just from one process and broadcast the data to the remaining.
- Limit file metadata access as much as possible on the Lustre file system; in particular, avoid the usage of
ls -l
and use insteadls
orlfs find
whenever possible. - Be aware that if you cause contention on the file system, all users will suffer the slowdowns as I/O resources are shared.