

### FEPA A framework for systematic energy and performance analysis of extreme-scale applications in HPC computing centers

J. Treibig RRZE a joint project with:





Bundesministerium für Bildung und Forschung



- Provide tooling infrastructure which allows to globally profile application performance in large supercomputing centers
- Embed application profiling in a pattern-driven performance engineering process aiming for maximum resource utilization
- Provide knowledge which enables to significantly improve the efficient use of HPC compute resources across all application domains



#### **Technical Project Overview**





Builds upon the results of previous BMBF projects: ISAR (LRZ) TIMaCS (NEC)

Opportunity to establish the LIKWID Open Source project as an alternative to established solutions

All components will be Open Source and can also be used stand alone



#### Philosophy

- Motivated by a resource driven view
- Provide a structured iterative process based on:
  - Performance patterns
  - A diagnostic performance model
- Performance patterns are typical performance limiting bottlenecks
- Patterns are indicated by signatures which can consist of:
  - HPM data
  - Scaling behavior
  - Other data
- Uses one of the most powerful tools available:

Your brain !



You are a investigator making sense of what's going on. And there is no alternative to that.



- 1. Maximum Resource utilization
- 2. Hazards
- 3. Work related (Application or Processor)

#### The system offers two basic resources:

- Execution of instructions (primary)
- Transferring data (secondary)

A good architectures allows you to fully exploit the design capabilities without road blocks or detours.

SSE, AVX, AVX2 Alignment/Gather











# Notions of workApplication work

Processor work



#### Model: quantitative

## Find the relevant limiting bottleneck!

#### **Overview Performance Patterns**



| Pattern                              |                          | Behavior                                                                                        |
|--------------------------------------|--------------------------|-------------------------------------------------------------------------------------------------|
| Bandwidth saturation                 |                          | saturating speedup across cores sharing a data path                                             |
| Limited<br>Instruction<br>throughput | Pipeline saturation      | throughput at design limit                                                                      |
|                                      | Pipelining hazards       | in-core throughput far from design limit, performance insensitive to data size                  |
|                                      | Control flow issues      |                                                                                                 |
| Inefficient<br>data access           | Strided Access           | simple BW models far too optimistic                                                             |
|                                      | Erratic Access           |                                                                                                 |
| Microarchitectural anomalies         |                          | large discrepancy from simple performance models                                                |
| False cacheline sharing              |                          | very low speedup, or slowdown / discrepancy from model only in parallel case                    |
| Bad ccNUMA page placement            |                          | bad/no scaling across locality domains, better performance w/ interleaved placement             |
| Load imbalance                       |                          | saturating/sub-linear speedup                                                                   |
| Synchronization overhead             |                          | speedup going down as more cores are added / no speedup with small problem sizes                |
| Code<br>composition<br>issues        | Instruction overhead     | low application performance, good scaling across cores, performance insensitive to problem size |
|                                      | Expensive instructions   |                                                                                                 |
|                                      | Ineffective instructions |                                                                                                 |

#### **Connection to application monitoring**





#### Conclusion



- FEPA will provide a low overhead framework which allows to measure system wide application performance/energy data
- The effective interpretation of the raw profiling data is enabled by introducing performance patterns
- The effectiveness of the approach will be evaluated at several Gauss member HPC centers

#### HPC is computing at a bottleneck





### Thank you for your attention!

## **Any Questions?**

Visit us tomorrow at the Poster reception, 5:15PM : Pattern-Driven Node-Level Performance Engineering

