DateNews
07-11 November 2024

Live update of the Cray Operating System (COS) to release 3.1 on Eiger. The process is transparent to users as no downtime is required: compute nodes have been reserved in groups to proceed with the COS update. Some system libraries path might change due to the COS update: please don't hesitate to contact us if you have any question. The update will improve the stability of compute nodes and will fix the functionality of the MPI hook of Sarus

25 June 2024

Eiger is back in service. Please note the following changes impacting the user environment:

- The Cray Operating System (COS) is updated to release 3.0, the Cray Programming Environment (CPE) is now by default 23.12, and the Slurm workload manager is 23.02. Please note that supported applications provided with the default CPE have been re-compiled: we also advise you to rebuild your applications with the new programming environment

- Some applications are no longer provided, e.g. Amber, matplotlib, Scalasca, VisIt. Please check the list of available applications under the corresponding section of the “Alps (Eiger) User Guide” at https://confluence.cscs.ch/x/_gD0E

- Due to incompatibilities with some JupyterLab extensions, JupyterHub can (currently) only launch notebooks, not the JupyterLab interface

- The MPI hook of the Sarus container engine is not fully functional, a fix will be deployed on the system soon

- The number of cpus per task specified for salloc or sbatch is not automatically inherited by srun with Slurm 23.02 and, if desired, must be requested again, either by specifying --cpus-per-task when calling srun, or by setting the SRUN_CPUS_PER_TASK environment variable. See the template batch script provided at https://confluence.cscs.ch/x/XQBYLw

28 May 2024

 Eiger maintenance from Monday, June 10 at 8:00 CEST, until Thursday, June 13 (EOB)

10 April 2024

The upgrade of Alps is progressing well and we have the opportunity to reduce the downtime by adjusting the upgrade schedule and returning the system today.

Please note that the user environment has not changed:
- We have completed the upgrade of the CSM (the management system of Alps) and the Slingshot interconnect.
- We will require another (shorter) downtime in order to complete the update of the Cray Operating System (COS), Cray Programming Environment (CPE) and Slurm workload manager: it will likely be scheduled during the second half of May.

Please do not hesitate to report any unexpected system behaviour, and note that access to Piz Daint for Eiger projects will remain valid for the rest of this quarter.

11 March 2024

The Alps infrastructure is undergoing a major upgrade as we ready the infrastructure towards its full deployment. The system will be upgraded from Monday, 18 March 2024, at 08:00 CET to Monday, 6 May 2024, with no access to compute nodes, login nodes or the scratch file system for the duration of the upgrade. Therefore, please make sure to transfer any important data from the capstor/scratch file system on Eiger before the start of the maintenance using the Data Transfer service.

Eiger will be returned to service with an updated HPE Slingshot, Cray Operating System (COS), Cray Programming Environment (CPE), and Slurm workload manager: as a consequence, you will need to recompile your applications. Due to the extent of this downtime, users will be granted access to the multicore partition of Piz Daint during the upgrade. We apologise for the inconvenience that this may cause

28 February 2024

The system will undergo a live intervention on Wednesday, February 28 at 09:00 CET. We expect no disruption in accessing the system or to running jobs. Following the intervention, the buildah and stackinator tools will be unavailable until further notice. Please do not hesitate to contact us if your workflow is adversely affected after the intervention.

24-25 January 2024

The system will be under maintenance from 08:00 CET on Wednesday 24 January 2024 until the end of business on Thursday 25 January 2024, in order to update the scratch file system capstor: a reservation will prevent batch jobs from running during the maintenance window and access to login nodes will be suspended.

22 November 2023The system will be under maintenance on Wednesday 22 November from 07:00 to 20:00 CET. The maintenance is required to bring the FirecREST and Globus Online services back into operation.
25 October 2023The next site-wide maintenance is planned on Wednesday October 25th. Eiger will not be available during the maintenance: please note that the default programming environment will not change, but we will deploy a new scratch filesystem and the $SCRATCH  environment variable will be modified accordingly, while the old scratch filesystem /scratch/e1000  will be read-only after the maintenance. The tentative schedule of the operations follows below:
  • Wednesday, 11 October: /capstor/scratch/cscs (“new scratch”) will be mounted on all Eiger compute nodes (CNs) in parallel to the current /scratch/e1000 , so users can start accessing it. The $SCRATCH environment variable will still be pointing to the old folder /scratch/e1000/$USER
  • Wednesday, 25 October (site-wide maintenance): The $SCRATCH environment variable will be modified to point to the new scratch folder /capstor/scratch/cscs/$USER on all Eiger user access nodes (UAN) and CNs. The /scratch/e1000 (“old scratch”) will switch to read-only on all Eiger nodes, therefore you won’t be able to use it in your jobs any longer. Please make sure that all files required by your simulations will be available in the new scratch from this point in time if using the $SCRATCH variable and update your workflow if using the old /scratch/e1000 hard coded in your scripts, otherwise your jobs will fail
  • Wednesday, 8 November: The old scratch will be unmounted from all Eiger UAN and CN and data deleted permanently

The new scratch will have a soft quota on both disk occupancy and inodes (files and folders), with a grace period to allow data transfer: we will start with soft quotas of 150 TB and 1 million inodes respectively, with a grace time of two weeks. These parameters might be adjusted in the future, if needed

02 - 16 August 2023 (extended to 18 August)The system will be under maintenance from Wednesday August 2nd at 08:00 until the end of business on August 16th, due to the update of the management plane in preparation for the forthcoming upgrade of Alps. Please note that the programming environment won’t change.
Eiger will be available during the rest of the month of August, but we cannot exclude the possibility of minor interference
s with users’ jobs.
14 June 2023

The system will be updated with the following changes impacting the user environment:

- No modules will be loaded by default at login and only the modules cray/21.12  (default) and cray/22.05  will be available. You will need to load the module cray  first, then you will be able to load the modules currently available in the default Cray programming environment (21.12). Therefore, please add the command module load cray  to your scripts and workflows after the intervention. This change is required to prepare the new user environment presented at the User Lab Day 2022 and drafted in the Knowledge Base article uenv user environments. Except for the additional command module load cray , this change should be transparent to you

- The hostnames of Eiger login nodes will be renamed as eiger-ln001 , eiger-ln002 , etc... Please adapt your scripts accordingly if needed

- A fix of the Sarus container engine will be deployed on the system, bringing Sarus back to full functionality

Note that the modules of the Cray programming environment won’t change. We will send additional communications to the mailing list as soon as the date of the intervention will be decided

3 April 2023The default programming environment is still CPE 21.12 after the upgrade of the Slingshot interconnect: we  advise users to rebuild their applications with the default programming environment in case of problems. Please note that sarus is not yet fully functional, we are currently addressing the issue.
27 - 31 March 2023 (extended to 3 April)The system will be under maintenance from Monday March 27th 7 AM until the end of business on Friday March 31st, due to the upgrade of the Slingshot interconnect and a major intervention involving the scratch filesystem. Please note that the programming environment won’t change. Maintenance extended to Monday April 3rd
1 February 2023The system will not be available on Wednesday February 1st, 2023 from 08:00 to 13:00 due to a planned system intervention.
19 October 2022The next site-wide maintenance is planned on Wednesday October 19th. Eiger will not be available during the maintenance: however, please note that the default programming environment will still be CPE 21.12. The system administrators have enabled again the user access nodes and restored the sarus container service.
29 August 2022The programming environment features CPE 21.12 (default) and 22.05: the supported applications are provided with the default programming environment 21.12. We  advise users to rebuild their applications with the default programming environment in case of problems. Please note that sarus is not yet functional, we are currently addressing the issue. UCX is no longer supported by the system interconnect, which is still being tuned, so the performance of some applications might be affected.  
16 February 2022The programming environment features CPE 21.08, 21.09, 21.10, 21.11, 21.12 (default) and 22.02. The supported applications provided with the default programming environment 21.12 have been updated: we also advise users to rebuild their applications with the default programming environment in case of problems. 
3 February 2022The next maintenance of Eiger is planned on Wednesday February 16th. The default programming environment after the intervention will be Cray PE 21.12, providing updated compilers and libraries, therefore users might need to rebuild their codes. Supported applications provided by CSCS staff will also be updated to the latest stable release: please check the available modules with the command module spider on the login nodes of the system after the intervention.
4 October 2021The latest CPE 21.09 has been installed as non default on the system. CSCS software stack is still provided with the default CPE 21.08 and is available after loading a toolchain module (cpeAMD, cpeCray, cpeGNU or cpeIntel)
8 September 2021The programming environment features only CPE 21.08, providing fixes to issues previously reported. We advise to rebuild your application with the updated programming environment in case of problems.
15 July 2021

HPE Cray has provided a pre-release from CPE 21.08 of the numerical library cray-libsci that fixes an issue reported by early users:

  • Users can load the non-default pre-release of cray-libsci by specifying the version with the module command   ml cray-libsci/21.08.1.1 

  • Existing executable files linked dynamically to older releases of cray-libsci can be linked at runtime to the pre-release setting LD_LIBRARY_PATH:
ml cray-libsci/21.08.1.1
export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
13 July 2021
  • Sarus
    • For the PID namespace issue when running multiple ranks per node when using the MPI hook (–mpi),  a workaround is now available by setting MPICH_NOLOCAL=1. (Refer to Sarus for more information)
  • Buildah
    • Added Eiger specific documentation (Refer to Buildah for more information)
8 July 2021
  • Introduction of Slurm policies (low partition and options --account=<project>  or -A <project>  and --constraint=mc  or -C mc)
  • CPE update
    • CPE 21.04 will remain the default PE
    • CPE 21.05 is added
    • CPE 21.06 is added (provides gcc/10.3.0 as non-default)
  • Container solutions including Sarus, Singularity and Buildah. 
    • Please refer to the documentation below for Eiger's specific Sarus information. 
  • No labels