SC22 full-day tutorial: Hands-on Practical Hybrid Parallel Application Performance Engineering (Dallas, TX, USA)

SC22 Program Entry Location: room D171, Dallas Convention Center

Date

Monday 14th November 2022

Presenters

Sameer Shende, University of Oregon
Anke Visser, Jülich Supercomputing Centre
Bert Wesarg, Technische Universität Dresden
Brian Wylie, Jülich Supercomputing Centre

Assistants

Marc Schlütter, Jülich Supercomputing Centre [remote]
Bill Williams, Technische Universität Dresden [remote]
Frank Winkler, Technische Universität Dresden

Logistics

This page will be updated as information becomes available, so check back before traveling to attend the tutorial. Tutorials are planned to be live-streamed as part of the SC22 Digital Experience, however, remote participants will not receive assistance for hands-on parts. The currently available software and exercises are being updated in preparation for the tutorial.

The full-day hands-on tutorial takes place as part of the SC22 conference scheduled in room D171 of the Kay Bailey Hutchison Convention Center, Dallas, Texas, USA. Registration via the conference website is possible for the tutorial with or without including the conference technical program, exhibition and workshops.

Hands-on exercises will use temporary accounts provided by Jülich Supercomputing Centre (JSC) on the JUWELS-Booster modular supercomputer to build and run an MPI+CUDA example code on two compute nodes each with dual AMD EPYC 7402 24-core 'Rome' CPUs and quad Nvidia A100 'Ampere' GPUs, measuring and analysing intra-node and inter-node performance with VI-HPS tools. Access will be via the Jupyter-JSC service allowing an Xpra remote graphical shell environment to run within common web browsers. Tutorial participants are expected to use their own notebook computers, connecting via the SC22 conference wireless network, but no additional software needs to be installed.

Tutorial participants are strongly encouraged to register for a JUDOOR account to access the training project and its allocation on JUWELS-Booster. (Note that the other SC22 tutorial on Distributed GPU Programming which will also use this system is scheduled to run concurrently and will use a different training project.)

Abstract

This tutorial presents state-of-the-art performance tools for leading-edge HPC systems founded on the community-developed Score-P instrumentation and measurement infrastructure, demonstrating how they can be used for performance engineering of effective scientific applications based on standard MPI, OpenMP, hybrid MPI+OpenMP, and increasingly common usage of accelerators. Parallel performance evaluation tools from the VI-HPS (Virtual Institute - High Productivity Supercomputing) are introduced and featured in hands-on exercises with Scalasca, Vampir and TAU. We present the complete workflow of performance engineering, including instrumentation, measurement (profiling and tracing, timing and PAPI hardware counters), data storage, analysis, and visualization. Emphasis is placed on how tools are used in combination for identifying performance problems and investigating optimization alternatives. Using their own notebook computers participants will conduct exercises on quad-A100 GPU nodes of the JUWELS-Booster modular supercomputer. This will help to prepare participants to locate and diagnose performance bottlenecks in their own parallel programs.

Programme (tentative)

08:30	Introduction & basic measurement [45] Introduction to VI-HPS & parallel application engineering [Wylie] [15] Setup for hands-on exercises with Jupyter-JSC & JUWELS-Booster [Visser] [30] Instrumentation & measurement of applications with Score-P [Wesarg]
10:00	(break)
10:30	Profile analyses [30] Exploration & visualization of call-path profiles with CUBE [Visser] [30] Configuration & customization of Score-P measurements [Wesarg] [30] Examination & visualization of profiles with TAU [Shende]
12:00	(lunch)
13:30	Trace analyses [15] Recap of exercise setup and collection of traces with Score-P [Wesarg] [45] Interactive visualization and time-interval statistics with Vampir [Wesarg] [30] Automated analysis of traces for inefficiencies with Scalasca [Wylie]
15:00	(break)
15:30	Further steps [15] Performance data management with TAU PerfExplorer [Shende] [30] Specialized Score-P measurements and analyses [Wesarg] [30] Finding typical parallel performance bottlenecks [Wesarg] [15] Review & conclusion [Wylie]
17:00	(adjourn)