# Hardware Performance Monitoring: Current and Future

Shirley Moore

<u>shirley@cs.utk.edu</u>

**VI-HPS** Inauguration

4July 2007





# History of PAPI

- <u>http://icl.cs.utk.edu/papi/</u>
- Started as a Parallel Tools Consortium project in 1998
- Goal was to produce a specification for a portable interface to the hardware performance counters available on most modern microprocessors.







Timeline of releases for each tool represented in the project. The vertical dashed lines indicate SC conference dates where the tools are regularly demonstrated. TAU's v1.0 release occurred at SC'97.

SDCI HPC Improvement: High-Productivity Performance Engineering (Tools, Methods, Training) for NSF HPC Applications Allen D. Malony, Sameer Shende, Shirley Moore, Nick Nystrom, Rick Kufrin

# **PAPI Counter Interfaces**

PAPI provides 3 interfaces to the underlying counter hardware:

- 1. The low level interface manages hardware events in user defined groups called *EventSets*, and provides access to advanced features.
- 2. The high level interface provides the ability to start, stop and read the counters for a specified list of events.
- 3. Graphical and end-user tools provide facile data collection and visualization.





ICL OUT

# PAPI Hardware Events

- Preset Events
  - Standard set of over 100 events for application performance tuning
  - No standardization of the exact definition
  - Mapped to either single or linear combinations of native events on each platform
  - Use *papi\_avail* utility to see what preset events are available on a given platform
- Native Events

6

- Any event countable by the CPU
- Same interface as for preset events
- Use *papi\_native\_avail* utility to see all available native events
- Use *papi\_event\_chooser* utility to select a compatible set of events



### **PAPI Preset Events**

• Of ~100 events, over half are cache related:

| PAPI_L1_DCH: | Level 1 data cache hits             |
|--------------|-------------------------------------|
| PAPI_L1_DCA: | Level 1 data cache accesses         |
| PAPI_L1_DCR: | Level 1 data cache reads            |
| PAPI_L1_DCW: | Level 1 data cache writes           |
| PAPI_L1_DCM: | Level 1 data cache misses           |
|              |                                     |
| PAPI_L1_ICH: | Level 1 instruction cache hits      |
| PAPI_L1_ICA: | Level 1 instruction cache accesses  |
| PAPI_L1_ICR: | Level 1 instruction cache reads     |
| PAPI_L1_ICW: | Level 1 instruction cache writes    |
| PAPI_L1_ICM: | Level 1 instruction cache misses    |
|              |                                     |
| PAPI_L1_TCH: | Level 1 total cache hits            |
| PAPI_L1_TCA: | Level 1 total cache accesses        |
| PAPI_L1_TCR: | Level 1 total cache reads           |
| PAPI_L1_TCW: | Level 1 total cache writes          |
| PAPI_L1_TCM: | Level 1 cache misses                |
|              |                                     |
| PAPI_L1_LDM: | Level 1 load misses                 |
| PAPI_L1_STM: | Level 1 store misses Levels 2 and 3 |
|              |                                     |

THE UNIVERSITY of TENNESSEE Computer Science Department

### PAPI Preset Events (ii)

• Other cache and memory events:

| Shared<br>cache    | PAPI_CA_SNP:<br>PAPI_CA_SHR:<br>PAPI_CA_CLN:<br>PAPI_CA_INV:<br>PAPI_CA_ITV:                                      | Requests for a snoop<br>Requests for exclusive access to shared cache line<br>Requests for exclusive access to clean cache line<br>Requests for cache line invalidation<br>Requests for cache line intervention                                     |
|--------------------|-------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| TLB                | PAPI_TLB_DM:<br>PAPI_TLB_IM:<br>PAPI_TLB_TL:<br>PAPI_TLB_SD:                                                      | Data translation lookaside buffer misses<br>Instruction translation lookaside buffer misses<br>Total translation lookaside buffer misses<br>Translation lookaside buffer shootdowns                                                                 |
| Resource<br>Stalls | PAPI_LD_INS:<br>PAPI_SR_INS:<br>PAPI_MEM_SCY:<br>PAPI_MEM_RCY:<br>PAPI_MEM_WCY:<br>PAPI_RES_STL:<br>PAPI_FP_STAL: | Load instructions<br>Store instructions<br>Cycles Stalled Waiting for memory accesses<br>Cycles Stalled Waiting for memory Reads<br>Cycles Stalled Waiting for memory writes<br>Cycles stalled on any resource<br>Cycles the FP unit(s) are stalled |

# PAPI Preset Events (iii)

#### • Program flow:

| Duanahaa | PAPI_BR_INS: | Branch instructions                                 |
|----------|--------------|-----------------------------------------------------|
| Branches | PAPI_BR_UCN: | Unconditional branch instructions                   |
|          | PAPI_BR_CN:  | Conditional branch instructions                     |
|          | PAPI_BR_TKN: | Conditional branch instructions taken               |
|          | PAPI_BR_NTK: | Conditional branch instructions not taken           |
|          | PAPI_BR_MSP: | Conditional branch instructions mispredicted        |
|          | PAPI_BR_PRC: | Conditional branch instructions correctly predicted |
|          | PAPI_BTAC_M: | Branch target address cache misses                  |

| Condition | PAPI_CSR_FAL: | Failed store conditional instructions     |
|-----------|---------------|-------------------------------------------|
| Stores    | PAPI_CSR_SUC: | Successful store conditional instructions |
| 010165    | PAPI_CSR_TOT: | Total store conditional instructions      |

THEUNIVERSITY of TENNESSEE Computer Science Department

## PAPI Preset Events (iv)

#### • Timing, efficiency, pipeline:

| PAPI_TOT_CYC: | Total cycles                           |
|---------------|----------------------------------------|
| PAPI_TOT_IIS: | Instructions issued                    |
| PAPI_TOT_INS: | Instructions completed                 |
| PAPI_INT_INS: | Integer instructions                   |
| PAPI_LST_INS: | Load/store instructions completed      |
| PAPI_SYC_INS: | Synchronization instructions completed |
|               |                                        |

| PAPI_BRU_IDL: | Cycles branch units are idle         |
|---------------|--------------------------------------|
| PAPI_FXU_IDL: | Cycles integer units are idle        |
| PAPI_FPU_IDL: | Cycles floating point units are idle |
| PAPI_LSU_IDL: | Cycles load/store units are idle     |

| PAPI_STL_ICY: | Cycles with no instruction issue           |
|---------------|--------------------------------------------|
| PAPI_FUL_ICY: | Cycles with maximum instruction issue      |
| PAPI_STL_CCY: | Cycles with no instructions completed      |
| PAPI_FUL_CCY: | Cycles with maximum instructions completed |

THEUNIVERSITY

**Computer Science Departmen** 

I F

Hardware interrupts

ICLOUT

INNOVATIVE COMPUTING LABORATORY

PAPI HW INT:

# PAPI Preset Events (v)

#### • Floating point:

| PAPI_FP_INS:  | Floating point instructions             |
|---------------|-----------------------------------------|
| PAPI_FP_OPS:  | Floating point operations               |
| PAPI_FML_INS: | Floating point multiply instructions    |
| PAPI_FAD_INS: | Floating point add instructions         |
| PAPI_FDV_INS: | Floating point divide instructions      |
| PAPI_FSQ_INS: | Floating point square root instructions |
| PAPI_FNV_INS: | Floating point inverse instructions     |
| PAPI_FMA_INS: | FMA instructions completed              |
| PAPI_VEC_INS: | Vector/SIMD instructions                |



# What's a Native Event?



#### PMD: AMD Athlon, Opteron

| 31      | 24 | 23  | 22 | 21       | 20  | 19 | 18 | 17 | 16  | 15 8                     | 7                                 | 0 |
|---------|----|-----|----|----------|-----|----|----|----|-----|--------------------------|-----------------------------------|---|
| CNT_MAS | SК | INV | EN | reserved | INT | РС | ш  | SO | USR | UNIT_MASK<br>8 mask bits | EVENT_SELECT<br>8 bits: 256 event | s |

#### PMC: Intel Pentium II, III, M, Core; AMD Athlon, Opteron



### Intel Pentium Core: L2\_ST

```
.pme name = "L2 ST",
.pme code = 0x2a,
.pme_flags = PFMLIB_CORE_CSPEC,
.pme_desc = "L2 store requests",
.pme umasks = {
  { .pme uname = "MESI",
    .pme_udesc = "Any cacheline access",
    .pme_ucode = 0xf\
  },
  { .pme_uname = "I_STATE",
    .pme_udesc = "Invalid cacheline",
    .pme ucode = 0x1
  },
   .pme_uname = "S_STATE",
    .pme udesc = "Shared cacheline",
    pme_ucode = 0x2
  },
    .pme uname = "E STATE",
    .pme udesc = "Exclusive cacheline",
    .pme ucode = 0x4
  },
    .pme_uname = "M_STATE",
    .pme udesc = "Modified cacheline",
    .pme ucode = 0x8
```

```
{ .pme_uname = "SELF",
 .pme_udesc = "This core",
 .pme_ucode = 0x40\
},
{ .pme_uname = "BOTH_CORES",
 .pme_udesc = "Both cores",
 .pme_ucode = 0xc0\
}
},
.pme_numasks = 7
},
```

```
. . .
```

```
PRESET,
PAPI_L2_DCA,
DERIVED_ADD,
L2_LD:SELF:ANY:MESI,
L2_ST:SELF:MESI
```

ICL & UF INNOVATIVE COMPUTING LABORATORY THE UNIVERSITY OF TENNESSEE Computer Science Department

# PAPI and BG/L



- Performance Counters:
  - 48 UPC Counters
    - shared by both CPUs
    - External to CPU cores
    - 32 bits
  - 2 Counters on each FPU
    - 1 counts load/stores
    - 1 counts arithmetic operations
  - Accessed via blg\_perfctr
  - 15 Preset Events

THEUNIVERSITY

**Computer Science Department** 

- 10 PAPI presets
- 5 Custom BG/L presets
- 328 native events available

ICL COMPUTING INNOVATIVE COMPUTING LABORATORY

### PAPI Data and Instruction Range Qualification

- Implemented a generalized PAPI interface for data structure and instruction address range qualification
- Applied that interface to the specific instance of the Itanium2 platform
- Extended an existing PAPI call, PAPI\_set\_opt(), with the capability of specifying starting and ending addresses of data structures or instructions to be instrumented

```
option.addr.eventset = EventSet;
option.addr.start = (caddr_t)array;
option.addr.end = (caddr_t)(array + size_array);
retval = PAPI_set_opt(PAPI_DATA_ADDRESS, &option);
```

ICLEDU

- An instruction range can be set using PAPI\_INSTR\_ADDRESS
- papi\_native\_avail was modified to list events that support data or instruction address range qualification.

# Tools that use PAPI

• TAU (U Oregon) <u>http://www.cs.uoregon.edu/research/tau/</u>



- KOJAK (UTK, FZ Juelich) http://icl.cs.utk.edu/kojak/
- PerfSuite (NCSA) <u>http://perfsuite.ncsa.uiuc.edu/</u>
- Titanium (UC Berkeley) http://www.cs.berkeley.edu/Research/Projects/titanium/
- SCALEA (Thomas Fahringer, U Innsbruck) http://www.par.univie.ac.at/project/scalea/
- Open|Speedshop (SGI) http://oss.sgi.com/projects/openspeedshop/
- SvPablo (UNC Renaissance Computing Institute)

http://www.renci.unc.edu/Software/Pablo/pablo.htm



# Component PAPI (PAPI-C)

- Goals:
  - Support simultaneous access to on- and off-processor counters
  - Isolate hardware dependent code in a separable 'substrate' module
  - Extend platform independent code to support multiple simultaneous substrates
  - Add or modify API calls to support access to any of several substrates
  - Modify build environment for easy selection and configuration of multiple available substrates
- Will be released as PAPI 4.0

### Architecture for Support of Multiple Components



ICL C UT INNOVATIVE COMPUTING LABORATORY

G THEUNIVERSITY OF TENNESSEE Computer Science Department

# **PAPI-C Status**

- PAPI 3.9 pre-release available with documentation
- Implemented Myrinet substrate (native counters)
- Implemented ACPI temperature sensor substrate
- Working on Inifinband and Cray Seastar substrates (access to Seastar counters not available under Catamount but expected under CNL)
- Asked by Cray engineers for input on desired metrics for next network switch
- Tested on HPC Challenge benchmarks
- Tested platforms include Pentium III, Pentium 4, Core2Duo, Itanium (I and II) and AMD Opteron



### **PAPI-C** New Routines

- PAPI\_get\_component\_info()
- PAPI\_num\_cmp\_hwctrs()
- PAPI\_get\_cmp\_opt()
- PAPI\_set\_cmp\_opt()
- PAPI\_set\_cmp\_domain()
- PAPI\_set\_cmp\_granularity()



# **PAPI-C** Building and Linking

- CPU components are automatically detected by *configure* and included in the build
- CPU component assumed to be present and always configured as component 0
- To include additional components, use configure option --with-<cmp> = yes
- Currently supported components
  - with-acpi = yes
  - with-mx = yes
  - with-net = yes
- The make process compiles and links sources for all requested components into a single library

## Myrinet MX Counters

LANAI\_UPTIME COUNTERS UPTIME BAD\_CRC8 BAD CRC32 UNSTRIPPED ROUTE PKT\_DESC\_INVALID RECV PKT ERRORS PKT MISROUTED DATA\_SRC\_UNKNOWN DATA\_BAD\_ENDPT DATA ENDPT CLOSED DATA BAD SESSION PUSH BAD WINDOW PUSH\_DUPLICATE PUSH\_OBSOLETE PUSH RACE DRIVER PUSH\_BAD\_SEND\_HANDLE\_MAGIC PUSH\_BAD\_SRC\_MAGIC PULL OBSOLETE PULL NOTIFY OBSOLETE PULL\_RACE\_DRIVER ACK BAD TYPE ACK\_BAD\_MAGIC ACK\_RESEND\_RACE LATE ACKh

ACK NACK FRAMES IN PIPE NACK BAD ENDPT NACK\_ENDPT\_CLOSED NACK BAD SESSION NACK\_BAD\_RDMAWIN NACK\_EVENTO\_FULL SEND BAD RDMAWIN CONNECT TIMEOUT CONNECT\_SRC\_UNKNOWN QUERY\_BAD\_MAGIC OUERY TIMED OUT QUERY\_SRC\_UNKNOWN RAW SENDS RAW\_RECEIVES RAW\_OVERSIZED\_PACKETS RAW RECV OVERRUN RAW DISABLED CONNECT\_SEND CONNECT RECV ACK SEND ACK\_RECV PUSH SEND PUSH RECV QUERY\_SEND OUERY RECV

REPLY SEND REPLY RECV QUERY\_UNKNOWN DATA SEND NULL DATA\_SEND\_SMALL DATA\_SEND\_MEDIUM DATA SEND RNDV DATA\_SEND\_PULL DATA\_RECV\_NULL DATA\_RECV\_SMALL\_INLINE DATA RECV SMALL COPY DATA\_RECV\_MEDIUM DATA RECV RNDV DATA\_RECV\_PULL ETHER\_SEND\_UNICAST\_CNT ETHER SEND MULTICAST CNT ETHER\_RECV\_SMALL\_CNT ETHER\_RECV\_BIG\_CNT ETHER\_OVERRUN ETHER OVERSIZED DATA RECV NO CREDITS PACKETS RESENT PACKETS DROPPED MAPPER ROUTES UPDATE

ROUTE DISPERSION OUT OF SEND HANDLES OUT\_OF\_PULL\_HANDLES OUT OF PUSH HANDLES MEDIUM\_CONT\_RACE CMD\_TYPE\_UNKNOWN UREO TYPE UNKNOWN INTERRUPTS OVERRUN WAITING\_FOR\_INTERRUPT\_DMA WAITING\_FOR\_INTERRUPT\_ACK WAITING\_FOR\_INTERRUPT\_TIMER SLABS\_RECYCLING SLABS PRESSURE SLABS STARVATION OUT\_OF\_RDMA\_HANDLES EVENTO FULL BUFFER DROP MEMORY\_DROP HARDWARE\_FLOW\_CONTROL SIMULATED PACKETS LOST LOGGING\_FRAMES\_DUMPED WAKE INTERRUPTS AVERTED WAKEUP RACE DMA METADATA RACE





#### **Multiple Measurements**

- HPCC HPL benchmark on Opteron with 3 performance metrics:
  - FLOPS; Temperature; Network Sends/Receives
    - Temperature is from an on-chip thermal diode



#### **Multiple Measurements (2)**

- HPCC HPL benchmark on Opteron with 3 performance metrics:
  - FLOPS; Temperature; Network Sends/Receives
    - Temperature is from an on-chip thermal diode



# Perfctr

- Written by Mikael Petterson
  - Labor of love...
  - First available: Fall 1999
  - First PAPI use: Fall 2000
- Supports:
  - Intel Pentium II, III, 4, M, Core
  - AMD K7 (Athlon), K8 (Opteron)
  - IBM PowerPC 970, POWER4, POWER5



# Perfctr Features

- Patches the Linux kernel
  - Saves perf counters on context switch
  - Virtualizes counters to 64-bits
  - Memory-maps counters for fast access
  - Supports counter overflow interrupts where available
- User space library
  - PAPI uses about a dozen calls



# Perfctr Timeline

- Steady development
  - 1999 2004
- Concerted effort for kernel inclusion
  - May 2004 May 2005
- Ported to Cray Catamount; Power Linux
   ~ 2005
- Maintenance only
  - 2005 →





## Perfmon

- Written by Stephane Eranian @ HP
- Originally Itanium only

   Built-in to the Linux-ia64 kernel since 2.4.0
- System call interface
- libpfm helper library for bookkeeping



# Perfmon2\*

- Provides a generic interface to access PMU
  - Not dedicated to one app, avoid fragmentation
- Must be portable across all PMU models:
  - Almost all PMU-specific knowledge in user level libraries
- Supports per-thread monitoring
  - Self-monitoring, unmodified binaries, attach/detach
  - multi-threaded and multi-process workloads
- Supports system-wide monitoring
- Supports counting and sampling
- No modification to applications or system

ABORATORY

• Built-in, efficient, robust, secure, simple, documented

\* Slide contents courtesy of Stephane Eranian

## Perfmon2

- Setup done through external support library
- Uses a system call for counting operations
  - More flexibility, ties with ctxsw, exit, fork
  - Kernel compile-time option on Linux
- Perfmon2 context encapsulates all PMU state
  - Each context uniquely identified by file descriptor
- int perfmonctl(int fd, int cmd, void \*arg, int narg)

| PFM_CREATE_CONTEXT | PFM_READ_PMDS      | PFM_START          |
|--------------------|--------------------|--------------------|
| PFM_WRITE_PMCS     | PFM_LOAD_CONTEXT   | PFM_STOP           |
| PFM_WRITE_PMDS     | PFM_UNLOAD_CONTEXT | PFM_RESTART        |
| PFM_CREATE_EVTSET  | PFM_DELETE_EVTSET  | PFM_GETINFO_EVTSET |
| PFM_GETINFO_PMCS   | PFM_GETINFO_PMDS   |                    |
| PFM_GET_CONFIG     | PFM_SET_CONFIG     |                    |



# Perfmon2 Features

- Support today for:
  - Intel Itanium, P6, M, Core, Pentium4, AMD Opteron, IBM Power, MIPS
- Full native event tables for supported processors
- Kernel based Multiplexing
  - Event set chaining
- Kernel based Sampling/Overflow
  - Time or event based
  - Custom sampling buffers



# Next Steps

- Kernel integration
  - Discussion underway \*now\*
  - Possible inclusion in 2.6.22 kernel
- Implementation in Cray CNK, X2
- Cell
  - IBM engineers have started a port
- Leverage libpfm for PAPI native events
  - Migration underway for P6, Core, P4, Opteron
- Begin testing on perfmon2 patched kernels
  - Torc10 currently being tested
  - Woodstock dual-boot?



# Cell Broadband Engine

- Each Cell contains 1 PPU and 8 SPUs.
  - ...and 1 PMU external to all of these.
  - 8 16-bit counters configurable as 4 32-bit counters
  - 1024 slot 128-bit trace buffer
  - 400 native events
- Working with IBM engineers on
  - developing perfmon2 pfmlib layer for Cell BE
  - Linux Cell BE kernel modifications
  - Porting PAPI-C (LANL grant)



## **Eclipse PTP IDE**



THEUNIVERSITY

**Computer Science Department** 

NNESSEE

IE)

ICL C UT INNOVATIVE COMPUTING LABORATORY

# Performance Evaluation within

**Eclipse PTP** 



35

# TAU and PAPI Plugins for Eclipse PTP

| eate, manage, and run c     | onfigurations       |                                          |                 | Counter                    | Definition                                                    |  |
|-----------------------------|---------------------|------------------------------------------|-----------------|----------------------------|---------------------------------------------------------------|--|
|                             | 5                   |                                          | (1)             | PAPI_L1_DCM                | Level 1 data cache misses                                     |  |
| eate a configuration to lau | nch a program to be | e instrumented and profiled by TAU.      | <u> </u>        | PAPI_L1_ICM                | Level 1 instruction cache misses                              |  |
|                             |                     |                                          |                 | PAPI_LI_ICM<br>PAPI_L2_DCM | Level 2 data cache misses                                     |  |
| * 🖹 🗶 🖻 🕻                   | Name: lammps-:      | 10Nov05withTAU                           |                 | PAPI_L2_DCM<br>PAPI_L2_ICM | Level 2 data cache misses<br>Level 2 instruction cache misses |  |
| /pe filter text             | (E) Matia (A)- A    | uments 🐻 Environment 🗮 Parallel 🗺 Analy  | »               | PAPI_L2_ICM<br>PAPI_L1_TCM | Level 1 cache misses                                          |  |
| C/C++ Local Applic          | 🖻 Main 🐶- Arg       | uments 🖉 Environment 🔠 Parallel 🚵 Analy  | 515 2           | PAPI_L1_TCM<br>PAPI_L2_TCM | Level 2 cache misses                                          |  |
| Parallel Application        |                     | PAPI Counters                            | //////×         | PAPI_FPU_IDL               | Cycles floating point units are idle                          |  |
| lammps-10Nov0               | MPI                 | Select the PAPI counters to use with TAU |                 | PAPI_TLB_DM                | Data translation lookaside buffer misses                      |  |
|                             | Callpath Pro        |                                          |                 | PAPI_TLB_IM                | Instruction translation lookaside buffer misses               |  |
|                             | 🗌 Phase Base        | ✓ PAPI_L1_DCM                            | <u> </u>        | PAPI_TLB_TL                | Total translation lookaside buffer misses                     |  |
|                             | 🗌 Memory Pro        |                                          |                 | PAPI_L1_LDM                | Level 1 load misses                                           |  |
|                             | 🗆 OPARI             | PAPI_L2_DCM                              |                 | PAPI_L1_STM                | Level 1 store misses                                          |  |
|                             | OpenMP              |                                          |                 | PAPI L2 LDM                | Level 2 load misses                                           |  |
|                             | Epilog              |                                          |                 | PAPI_L2_STM                | Level 2 store misses                                          |  |
|                             | PAPI                |                                          |                 | PAPI_STL_ICY               | Cycles with no instruction issue                              |  |
|                             |                     |                                          | hters           | PAPI_HW_INT                | Hardware interrupts                                           |  |
|                             | Perflib             |                                          |                 | PAPI_BR_TKN                | Conditional branch instructions taken                         |  |
|                             | Trace               |                                          |                 | PAPI_BR_MSP                | Conditional branch instructions mispredicted                  |  |
|                             | Select Makefile     | PAPI_TLB_TL PAPI_L1_LDM                  |                 | PAPI TOT INS               | Instructions completed                                        |  |
|                             |                     |                                          | <b>*</b>        | PAPI_FP_INS                | Floating point instructions                                   |  |
|                             | Selective Instru    |                                          |                 | PAPI_BR_INS                | Branch instructions                                           |  |
|                             | None                | Select All Deselect All Counter De       | escriptions     | PAPI_VEC_INS               | Vector/SIMD instructions                                      |  |
|                             | ○ Internal          |                                          |                 | PAPI_RES_STL               | Cycles stalled on any resource                                |  |
|                             | O User Define       |                                          |                 | PAPI_TOT_CYC               | Total cycles                                                  |  |
|                             |                     | ОК                                       | Cancel          | PAPI_L1_DCH                | Level 1 data cache hits                                       |  |
|                             | LL(                 |                                          | se              | PAPI_L2_DCH                | Level 2 data cache hits                                       |  |
| 111.                        |                     | Apply                                    | Re <u>v</u> ert | PAPI_L1_DCA                | Level 1 data cache accesses                                   |  |
|                             |                     |                                          |                 | PAPI_L2_DCA                | Level 2 data cache accesses                                   |  |
|                             |                     | D <sub>1</sub> -fl                       | Class           | PAPI_L2_DCR                | Level 2 data cache reads                                      |  |
| )                           |                     | <u>P</u> rofile                          | e Close         | PAPI_L2_DCW                | Level 2 data cache writes                                     |  |

ICL C UT INNOVATIVE COMPUTING LABORATORY

THEUNIVERSITY of TENNESSEE Computer Science Department

## Conclusions

- PAPI has a long track record of successful adoption and use.
- New architectures pose a challenge for offprocessor hardware monitoring as well as interpretation of counter values.
- Integration of perfmon2 into the Linux kernel will broaden the base of PAPI users still further.



# **Potential VI-HPS Interactions**

- Hardware monitoring support for performance analysis tools, debugging, applications
  - Tell us what you need and we will implement or talk to vendors.
  - Error counters may be useful for debugging.
- Deeper understanding of architectures  $\rightarrow$  better mapping of applications onto them

