

### Performance patterns and hardware metrics on modern multicore processors: Best practices for performance engineering

Jan Treibig, <u>Georg Hager</u>, Gerhard Wellein Erlangen Regional Computing Center (RRZE) University of Erlangen-Nuremberg Erlangen, Germany

PROPER Workshop at Euro-Par 2012 August 28, 2012

Rhodes Island, Greece





- ... are ubiquitous as a starting point for performance analysis (including automatic analysis)
- ... are supported by many tools
- ... are often reduced to cache misses (what could be worse than cache misses?)

#### **Reality:**

- Modern parallel computing is plagued by bottlenecks
- There are typical performance patterns that cover a large part of possible performance behaviors
  - HPM signatures
  - Scaling behavior
  - Other sources of information

"Performance pattern"



- LIKWID: Lightweight command line tools for Linux
- Help to face the challenges without getting in the way
- Focus on x86 architecture
- Philosophy:
  - Simple
  - Efficient
  - Portable
  - Extensible



## Open source project (GPL v2): http://code.google.com/p/likwid/



#### Topology and Affinity:

- likwid-topology
- likwid-pin
- likwid-mpirun

#### Performance Profiling/Benchmarking:

- likwid-perfctr
- likwid-bench
- likwid-powermeter



#### How do we find out about the performance properties and requirements of a parallel code?

Profiling via advanced tools is often overkill

#### A coarse overview is often sufficient

- likwid-perfctr (similar to "perfex" on IRIX, "hpmcount" on AIX, "lipfpm" on Linux/Altix)
- Simple end-to-end measurement of hardware performance metrics
- Operating modes:
  - Wrapper
  - Stethoscope
  - Timeline
  - Marker API
- Preconfigured and extensible metric groups, list with
   likwid-perfctr -a

```
BRANCH: Branch prediction miss rate/ratio
CACHE: Data cache miss rate/ratio
CLOCK: Clock of cores
DATA: Load to store ratio
FLOPS_DP: Double Precision MFlops/s
FLOPS_SP: Single Precision MFlops/s
FLOPS_X87: X87 MFlops/s
L2: L2 cache bandwidth in MBytes/s
L2CACHE: L2 cache miss rate/ratio
L3: L3 cache bandwidth in MBytes/s
L3CACHE: L3 cache miss rate/ratio
MEM: Main memory bandwidth in MBytes/s
TLB: TLB miss rate/ratio
```







- To measure only parts of an application a marker API is available.
- The API only turns counters on/off. The configuration of the counters is still done by likwid-perfctr application.
- Multiple named regions can be measured
- Results on multiple calls are accumulated
- Inclusive and overlapping Regions are allowed

```
likwid_markerInit(); // must be called from serial region
```

```
likwid_markerStartRegion("Compute");
. . .
likwid_markerStopRegion("Compute");
```

```
likwid_markerStartRegion("postprocess");
```

```
likwid_markerStopRegion("postprocess");
```

likwid\_markerClose(); // must be called from serial region

#### likwid-perfctr Group files



SHORT PSTI EVENTSET FIXCO INSTR RETIRED ANY FIXC1 CPU CLK UNHALTED CORE FIXC2 CPU CLK UNHALTED\_REF FP COMP OPS EXE SSE FP PACKED PMC0 PMC1 FP COMP OPS EXE SSE FP SCALAR FP COMP OPS EXE SSE SINGLE PRECISION PMC2 FP COMP OPS EXE SSE DOUBLE PRECISION PMC3 UNC QMC NORMAL READS ANY UPMC0 UNC QMC WRITES FULL ANY UPMC1 UPMC2 UNC QHL REQUESTS REMOTE READS UPMC3 UNC QHL REQUESTS LOCAL READS METRICS Runtime [s] FIXC1\*inverseClock CPI FIXC1/FIXC0 Clock [MHz] 1.E-06\*(FIXC1/FIXC2)/inverseClock DP MFlops/s (DP assumed) 1.0E-06\*(PMC0\*2.0+PMC1)/time Packed MUOPS/s 1.0E-06\*PMC0/time Scalar MUOPS/s 1.0E-06\*PMC1/time SP MUOPS/s 1.0E-06\*PMC2/time DP MUOPS/s 1.0E-06\*PMC3/time Memory bandwidth [MBytes/s] 1.0E-06\*(UPMC0+UPMC1)\*64/time; Remote Read BW [MBytes/s] 1.0E-06\*(UPMC2)\*64/time; LONG Formula: DP MFlops/s = (FP COMP OPS EXE SSE FP PACKED\*2 + FP COMP OPS EXE SSE FP SCALAR) / runtime.

- Groups are architecture-specific
- They are defined in simple text files
- Code is generated on recompile



| Pattern                            | Peformance behavior                                         | Metric signature                                                                                                               |  |  |
|------------------------------------|-------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|--|--|
| Load imbalance                     | Saturating/sub-linear<br>speedup                            | Different amount of "work" on the<br>cores (FLOPS_DP, FLOPS_SP,<br>FLOPS_AVX); note that instruction<br>count is not reliable! |  |  |
| BW saturation in outer-level cache | Saturating speedup<br>across cores of OL<br>cache group     | OLC bandwidth meets BW of suitable streaming benchmark (L3)                                                                    |  |  |
| Memory BW<br>saturation            | Saturating speedup<br>across cores on a<br>memory interface | Memory BW meets BW of suitable streaming benchmark (MEM)                                                                       |  |  |
| Strided or erratic<br>data access  | Simple BW<br>performance model<br>much too optimistic       | Low BW utilization / Low cache hit<br>ratio, frequent CL evicts or<br>replacements (CACHE, DATA, MEM)                          |  |  |



| Pattern                              | Peformance behavior                                                                                  | Metric signature                                                                                                                                                                                                                                                |
|--------------------------------------|------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Bad<br>instruction<br>mix            | Peformance insensitive<br>to problem size vs.<br>cache levels                                        | Large ratio of instructions retired to FP<br>instructions if the useful work is FP / Many<br>cycles per instruction (CPI) if the problem is<br>large-latency arithmetic / Scalar instructions<br>dominating in data-parallel loops (FLOPS_DP,<br>FLOPS_SP, CPI) |
| Limited<br>instruction<br>throughput | Large discrepancy from<br>simple performance<br>model based on LD/ST<br>and arithmetic<br>throughput | Low CPI near theoretical limit if instruction<br>throughput is the problem / Static code<br>analysis predicting large pressure on single<br>execution port / High CPI due to bad<br>pipelining (FLOPS_DP, FLOPS_SP, DATA)                                       |
| Micro-<br>architectural<br>anomalies | Large discrepancy from performance model                                                             | Relevant events are very hardware-specific,<br>e.g., stalls due to 4k memory aliasing,<br>conflict misses, unaligned vs. aligned LD/ST,<br>requeue events. Code review required, with<br>architectural features in mind.                                        |



| Pattern                      | Peformance behavior                                                                                                              | Metric signature                                                                                             |  |  |
|------------------------------|----------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|--|--|
| Synchronization<br>overhead  | Speedup going down as<br>more cores are added /<br>No speedup with small<br>problem sizes / Cores busy<br>but low FP performance | Large non-FP instruction count<br>(growing with number of cores used) /<br>Low CPI (FLOPS_DP, FLOPS_DP, CPI) |  |  |
| False sharing of cache lines | Small speedup or slowdown when adding cores                                                                                      | Frequent (remote) CL evicts (CACHE)                                                                          |  |  |
| Bad ccNUMA<br>page placement | Bad or no scaling across<br>NUMA domains                                                                                         | Unbalanced bandwidth on memory interfaces / High remote traffic (MEM)                                        |  |  |

- Instructions retired / CPI may not be a good indication of useful workload – at least for numerical / FP intensive codes....
- Floating Point Operations Executed is often a better indicator
- Waiting / "Spinning" in barrier generates a high instruction count



#### The problem of instructions retired (2)



| +                                                                                                                                                                                                                    |                                                                                                          |                                                                                             | +                                                                                                      | +                                                                                         | +                                                    | +                                                                           | +                                                                                          |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|------------------------------------------------------|-----------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|
| Event                                                                                                                                                                                                                | core 0                                                                                                   | core 1                                                                                      | core 2                                                                                                 | core 3                                                                                    | cor                                                  | re 4                                                                        | core 5                                                                                     |
| INSTR_RETIRED_ANY<br>CPU_CLK_UNHALTED_CORE<br>CPU_CLK_UNHALTED_REF<br>FP_COMP_OPS_EXE_SSE_FP_PACKED<br>FP_COMP_OPS_EXE_SSE_FP_SCALAR<br>FP_COMP_OPS_EXE_SSE_SINGLE_PRECISION<br>FP_COMP_OPS_EXE_SSE_DOUBLE_PRECISION | 1.83124e+10  <br>2.24797e+10  <br>2.04416e+10  <br>3.45348e+09  <br>2.93108e+07  <br>19  <br>3.48279e+09 | 1.74784e+10<br>2.23789e+10<br>2.03445e+10<br>3.43035e+09<br>3.06063e+07<br>0<br>3.46096e+09 | 1.68453e+10<br>  2.23802e+10<br>  2.03456e+10<br>  3.37573e+09<br>  2.9704e+07<br>  0<br>  3.40543e+09 | 1.66794e+<br>  2.23808e+<br>  2.03462e+<br>  3.39272e+<br>  2.96507e+<br>0<br>  3.42237e+ | 10   2.237<br>10   2.034<br>09   3.261<br>07   2.411 | 085e+10  <br>299e+10  <br>53e+10  <br>32e+09  <br>41e+07  <br>0  <br>43e+09 | 1.91736e+10<br>2.23805e+10<br>2.03459e+10<br>3.2377e+09<br>2.37397e+07<br>0<br>3.26144e+09 |
| Higher CPI but better<br>performance                                                                                                                                                                                 | + Metric<br>+ Runtime [s]<br>  Clock [MHz<br>  CPI<br>  DP MFlops/s                                      | <u>2932.7</u><br>  1.2275<br>5   850.72                                                     | 8   8.39157<br>3   2933.5<br>7   1.28037<br>7   845.212                                                | core 2  <br>  8.39206  <br>  2933.51  <br>  1.32857  <br>  831.703                        | core 3<br>8.3923<br>2933.51<br>1.34182<br>835.865    | core 4<br>  8.39193<br>  2933.51<br>  1.26666<br>  802.952                  | 2933.51  <br>5   1.16726  <br>2   797.113                                                  |
| <b>!\$OMP PARALLEL DO</b>                                                                                                                                                                                            | Packed MUOPS<br>  Scalar MUOPS<br>  SP MUOPS/S<br>  DP MUOPS/S                                           | 5/s   3.5949<br>s   2.33033e                                                                | 4   3.75383<br>-06   0                                                                                 | 414.03<br>  3.64317  <br>  0  <br>  417.673                                               | 416.114<br>3.63663<br>0<br>419.751                   | 399.997<br>2.95757<br>0<br>402.955                                          | 0                                                                                          |
| DO I = 1, N<br>DO J = 1, N<br>x(I) = x(I) + A(J,I) * y(J)<br>ENDDO<br>ENDDO<br>!\$OMP END PARALLEL DO                                                                                                                |                                                                                                          |                                                                                             |                                                                                                        |                                                                                           |                                                      |                                                                             |                                                                                            |



C++ codes which suffer from overhead (inlining problems, complex abstractions) need a lot more overall instructions related to the arithmetic instructions.

- Often (but not always) "good" (i.e., low) CPI → "Bad instruction mix" pattern
- Lower bandwidth
- Instruction throughput limited
- High-level optimizations complex or impossible → "Strided access" pattern

Example: Matrix-matrix multiply with expression template frameworks on a 2.93 GHz Westmere core

|             | Total retired instructions [10 <sup>11</sup> ] | CPI  | Memory<br>Bandwidth [MB/s] | MFlops/s |
|-------------|------------------------------------------------|------|----------------------------|----------|
| Classic     | 12.5                                           | 0.44 | 5300                       | 1250     |
| Boost uBLAS | 10.1                                           | 4.6  | 630                        | 156      |
| Eigen3      | 2.1                                            | 0.41 | 371                        | 8555     |
| Blaze/DGEMM | 2.0                                            | 0.32 | 531                        | 11260    |
|             |                                                |      |                            |          |

#### Example 2: Image reconstruction by backprojection





- Simple roofline analysis
  - → Memory-bound algorithm → "Memory BW saturation" pattern
- Work reduction optimization

   → "Load imbalance" pattern identified by likwid-perfctr
   FLOPS\_SP group → corrected by round-robin schedule



 Automatic analysis is useful for the beginner, but will never match an experienced analyst

- Performance patterns are more than simple numbers
  - Scaling behavior
  - Bottleneck saturation
  - HPM signatures
- The set presented here is just a suggestion; it will have to be tested against more codes
- Power/energy patterns are still missing, but will have to be included



# Thank you.