DFG & RFBR (2021-2024)
Performance analysis of HPC applications in noisy environments
Many contemporary HPC systems expose their jobs to substantial amounts of interference, leading to significant run-to-run variation. For example, application runtimes on Theta, a Cray XC40 system at Argonne National Laboratory, vary by up to 70%, caused by a mix of node-level and system-level effects, including network and file-system congestion in the resence of concurrently running jobs. This makes performance measurements generally irreproducible, heavily complicating performance analysis and modeling. On noisy systems, performance analysts usually have to repeat performance measurements several times and then apply statistics to capture trends. First, this is expensive and, second, extracting trends from a limited series of experiments is far from trivial, as the noise can follow quite irregular patterns. Attempts to learn from performance data how a program would perform under different execution configurations experience serious perturbation, resulting in models that reflect noise rather than intrinsic application behavior.
On the other hand, although noise heavily influences execution time and energy consumption, it does not change the computational effort a program performs. Effort metrics that count how many operations a machine executes on behalf of a program, such as floating-point operations, the exchange of MPI messages, or file reads and writes, remain largely unaffected and—rare non-determinism set aside— reproducible. The basic research questions we want to address in this project is to what extent such noise-resilient metrics can already be used to answer typical performance analysis questions, especially related to load imbalance and scalability, and how they can help to better interpret volatile time and energy measurements.
To support this goal, we also want to deepen our understanding of noise on the system level. For this purpose, we will make environmental metrics that indicate external interference potential, collected via system-monitoring tools, available to application performance measurement. This will also help gauge the noise sensitivity of applications, another line of research in this project, which is needed to choose the right noise-resilience strategy for application performance analysis. Moreover, we aim at automatic methods to identify applications with high passive or active interference capacity. Finally, from the knowledge of how applications respond to noise we will derive insights about the performance of algorithms in noisy environments, adding another dimension to the field of algorithm engineering.
Our efforts will build on and extend established tools and projects: The system-monitoring infrastructure JobDigest and the algorithm encyclopedia AlgoWiki contributed by our Russian partner and the application performance-analysis tools Score-P, Scalasca, and Extra-P contributed by our German partners.
- Technical University of Darmstadt
Laboratory for Parallel Programming
(Prof. Dr. Felix Wolf)
- Forschungszentrum Jülich
Jülich Supercomputing Centre
(Dr. Bernd Mohr)
- Moscow State University
Research Computing Center
(Dr. Dmitry Nikitenko)