Comparing and modelling the performance of different ML frameworks and hardware accelerators in a coupled OpenFoam+ML application

  • Vergleich und Modellierung der Leistung verschiedener ML-Frameworks und Hardwarebeschleuniger in einer gekoppelten OpenFoam+ML-Applikation

Brose, Kim Sebastian; Müller, Matthias S. (Thesis advisor); Hasse, Christian (Thesis advisor); Orland, Fabian (Consultant)

Aachen (2022)
Master Thesis

Masterarbeit, RWTH Aachen University, 2022


We examine and model the performance of a coupled HPC+ML application. It is based on a reactive thermo-fluid simulation which is limited by the memory footprint of the lookup table that is integral to the state of the art implementation. Machine learning is an old concept that has become more attainable in the recent years thanks to the advent of powerful ultra parallel processors and vectorization capabilities that allow efficient computation of matrix operations common to machine learning algorithms. The lookup table implementation is replaced with a machine learning model by implementing couplings that connect the existing software to established as well as novel machine learning frameworks that use different kinds of hardware such as CPUs, GPUs and vector engines. We measure and model the performance of the different couplings under different conditions. As we increase the size of the machine learning model, the runtime increases significantly less than the quadratically increasing computational complexity. The execution of the model for multiple input data samples can be accelerated by choosing a batch size such that the resulting amount of work fits the lower and upper bounds given by the architecture of the hardware accelerator device. We examine the coupling's performance with regard to strong scaling. Since the machine learning part is practically completely parallelized, it scales perfectly as we add more devices. The communication overhead becomes measurable as we add more MPI ranks to the HPC part of the application, but remains insignificant compared to the actual framework time. We discover that under certain circumstances, it can be more efficient to run the machine learning framework on CPUs despite GPUs generally performing faster. Overall, the GPUs are by far the closest to reaching their peak performance with our application, followed by CPUs, and VEs are the farthest from reaching their peak performance. We investigate performance metrics established by the EU Center of Excellence POP as a performance model for the coupled HPC+ML application in the different scaling settings. The couplings have a small run time compared to the whole application, and thus no discernible impact on the metrics measured for the whole application. A more in-depth analysis is required to obtain conclusive results.