Operational Strategy RWTH HPC Computing
The high-performance computer is operated by the IT Center at RWTH Aachen University and is available to members of RWTH and scientists from all over Germany. Following the "one-cluster concept", the operational strategy makes all resources of the cluster available to the users via an interface, so that different expansion stages, innovative architectures and data can be used by means of the same processes.
The One-Cluster Concept
As a result of its successive development, the IT Center has been faced with the challenge of operating a heterogeneous system landscape, integrating innovative architectures and providing access to various user groups in different ways. The 1-cluster concept has developed from these requirements. It aims to operate all components within a large cluster and offers the following advantages:
- The same interfaces are available to users within the cluster with regard to identity management, dialog systems, workload management system, operating system, software stack, and file systems. The necessary knowledge of key components remains limited for users, the focus on one interface facilitates communication, and documentation is easier to maintain.
- By using our own solution for cluster management, operational processes scale optimally and can be adapted to different scenarios. For example, linking the various cluster management tools allows changes to be made throughout the entire cluster based on monitoring data, as these are based on the same technical foundation.
- This has made it possible for years to operate the HPC system without fixed maintenance windows, resulting in a very high availability of the system with very few interruptions in operation. Such interruptions are only required in exceptional cases, such as maintenance work on the file systems or major modifications such as a change in operating system. For small maintenance tasks such as the installation of new kernel versions, the batch system is used, which means no restrictions for the user.
- Clustering in this way leads to highly scalable operational processes and enables new and innovative functions to be available immediately on all suitable architectures and expansion levels. In addition, this makes it possible to set up a large number of new systems in a very short time and to integrate them into the cluster, for example when expanding with a new expansion stage.
- User-side differentiations, e.g. between processor architectures or server types, are made taking the technological-scientific assessment as part of the approval process into account; on the operational side, such differentiation is kept as low as possible.
The structure of the cluster reflects the 1-cluster concept. The dialog systems constitute the user interface to the high-performance computer. These can be used to prepare, commission, control and evaluate calculation requests and to use development and analysis applications. Using special copy nodes with broadband connection to the university and research institutional network, large amounts of data can be transferred to and from the high-performance computer. The large groups of backend systems (CLAIX-2016, CLAIX-2018, Tier-3, Innovative Architectures (GPU, KNL), Integrative Hosting) are made available through the Workload Management System. They are not directly accessible. The file systems can be accessed from the entire cluster and can be addressed by the users as $HOME, $WORK and $HPCWORK. Large parts of the individual backend groups are interconnected via high-performance and redundant Omnipath networks.
For the storage of data, the users of the high-performance computer can choose between different file systems, depending on the intended usage scenarios. The file systems vary in certain characteristics, including performance metrics, available space, and data protection concepts. The following file systems can be used:
$HOME is an NFS-based file system that provides users with 150 GB of standard disk space to store the most important data such as source code and configuration files. The use of snapshot mechanisms and safe data storage in the backup system of RWTH Aachen University guarantees a very high level of data security. This is also reflected in the 100% availability of the file system in the years 2016 to 2019.
$WORK is also an NFS file system. However, its technical structure is intended for storing larger files. This includes, for example, the results of performed calculation jobs. With 250 GB, more storage space is available to users of this file system. However, this data is not backed up, so that it should be reproducible. Accidentally deleted files, however, can be recovered based on snapshots.
$HPCWORK is two file systems based on the parallel high performance file system Lustre. This filesystem this is characterized by high read and write rates. With a standard storage capacity of 1 TB, the space available here is significantly higher than with the other file systems. Due to the amount of data, however, central data backup is not possible here either.
The software stack of the cluster is managed by the IT Center and is partly developed by the IT Center itself. This approach has long been followed and offers a number of advantages:
- The independence from specific manufacturers ensures flexibility and allows quick adaptations to the frequently changing requirements in research and teaching (e.g. integration of innovative architectures)
- Saving of license and maintenance fees for software (e.g. operating system)
- Access to all layers of the software stack allows effective and efficient error and performance analysis as well as comprehensive changes in response to analysis results
- Consistent pursuit and implementation of an OpenSource strategy.
The operating system used is CentOS, an open Red Hat-based Linux variant.
The SLURM workload management system used to manage the computing jobs on the back-end systems successfully replaced its predecessor, IBM Platform LSF, in 2019. Experience has shown that for the one-cluster concept, the various integrated system architectures and the different requirements placed on the batch system (fair sharing, backfilling, use of different MPIs, etc.), the use of a professional solution is appropriate.
About 100 different ISV and open source software packages are made available to users in various subject-specific categories. The IT Center is responsible for the provision and maintenance of such software if there is sufficient demand and operates the necessary license servers, if necessary. In particular, tools for using the cluster (e.g. graphical interfaces), parallelization (various MPI implementations), programming (compilers, libraries) and application analysis (debuggers, performance analysis and visualization) are centrally provided to the users.
Individual institutes or projects often required to have their own systems. The Integrative Hosting service is based on the one-cluster concept and uses the possibilities for scalable expansion of the cluster. Based on framework agreements, the IT Center procures, installs and operates additional HPC resources for university requirements to support research and teaching. This offering pursues several objectives: The central supply of resources for the university allows synergy effects to be exploited with regard to energy consumption and operational infrastructure. The service allows users to concentrate on their area of application without having to deal with administrative concerns. Concerning cooperation with external users, integrative hosting offers a platform for collaboration.
IT Service Management: The resources specified in service level agreements are made available within the context of specific projects. These projects can be managed by the responsible persons themselves: other users are designated as project members and decisions are made as to whether the resources are used exclusively or jointly with other projects via Fairshare.