





PRI



- Presenters
  - Marc-André Hermanns (German Research School for Sim. Sci.)
  - Emmanuel Oseret & Andres Charif-Rubial (UVSQ)
  - Sameer Shende (University of Oregon PRL)
  - Alexandre Strube (Jülich Supercomputing Centre)
  - Ronny Tschüter & Matthias Weber (TU Dresden)
- Thanks
  - Local arrangements & facilities (MdS)
    - Michel Kern, Aurélie Monteiro
    - ► Julien Derouillat, Pierre Kestener
    - Systems: MdS, IDRIS
  - Sponsors: CEA, GENCI, PRACE

#### Outline



## Monday 7 April

- 09:00 (early registration & set-up, individual preparation)
- 12:00-13:00 (lunch)
- 13:00-13:30 (setup)
- Welcome & introduction to VI-HPS
- Introduction to parallel performance engineering
- 15:00-15:30 (break)
- Lab setup: computer systems & software environment
- Building & running NPB-MZ-MPI/BT-MZ example code
- 17:30 (adjourn)

VI-HPS

## Tuesday 8 April

- 09:00-10:30 Score-P & CUBE
- 11:00-12:30 Score-P & ParaProf/PerfExplorer

Wednesday 9 April

- 09:00-10:30 Scalasca & Vampir
- 11:00-12:30 MAQAO

Thursday 10 April

- 09:00-10:30 **TAU**
- 11:00-12:30 Conclusion:

Using accelerators Engineering workflow

- Hands-on exercises part of each tool presentation every morning session
- Hands-on coaching to apply tools to analyse & tune your own codes each afternoon to 17:30

- We'd like to know a little about you, your application(s), and your expectations and desires from this tutorial
- What programming paradigms do you use in your app(s)?
  - only MPI, only OpenMP, mixed-mode/hybrid OpenMP/MPI, ...
  - Fortran, C, C++, multi-language, ...
- What platforms/systems *must* your app(s) run well on?
  - Cray XT/XE/XK, IBM BlueGene, SGI Altix, Linux cluster™, ...
- Who's already familiar with *serial* performance analysis?
  - Which tools have you used?
    - time, print/printf, prof/gprof, VTune, ...
- Who's already familiar with *parallel* performance analysis?
  - Which tools have you used?
    - ► time, print/printf, prof/gprof, Periscope, Scalasca, TAU, Vampir, ...

• Ensure your application codes build and run to completion with appropriate datasets

- initial configuration should ideally run in less than 15 minutes with 1-4 compute nodes (up to 64 processes/threads)
  - ► to facilitate rapid turnaround and quick experimentation
- Iarger/longer scalability configurations are also interesting
  - turnaround may be limited due to busyness of batch queues
- Compare your application performance on other systems
  - VI-HPS tools already installed on a number of HPC systems
    - if not, ask your system administrator to install them (or install a personal copy yourself)



# Tools will *not* automatically make you, your applications or computer systems more *productive*.

However, they can help you understand how your parallel code executes and when / where it's necessary to work on correctness and performance issues.

# DON'T PANIC!

## The workshop presenters are here to assist you.

NB: On the assumption that nothing terrible is going to happen and everything's suddenly going to be alright really, all advice may be safely ignored.

15th VI-HPS Tuning Workshop (7-10 April 2014) MdS, Saclay, France

### Workshop system (hardware)



| <b>System</b><br>Domain        | <i>poincare</i><br>mds.cea.fr | <i>curie</i><br>ccc.cea.fr             |                                          |
|--------------------------------|-------------------------------|----------------------------------------|------------------------------------------|
| Vendor<br>Network              | Intel                         | Bull<br>Infiniband                     |                                          |
| <b>Processors</b><br>Frequency | Intel E5-2670<br>2.6 GHz      | (fat nodes)<br>Intel X7560<br>2.26 GHz | (thin nodes)<br>Intel E5-2680<br>2.7 GHz |
| Compute nodes                  | 92                            | 360                                    | 5040                                     |
| Chips per node                 | 2                             | 4                                      | 2                                        |
| Cores per chip                 | 8                             | 8                                      | 8                                        |
| Threads per core               | 2                             | 2                                      | 2                                        |
| Memory per node                | 32 GB                         | 128 GB                                 | 64 GB                                    |

| <b>System</b>                         | <i>poincare</i>    | <i>curie</i>               |
|---------------------------------------|--------------------|----------------------------|
| domain                                | mds.cea.fr         | ccc.cea.fr                 |
| <b>Filesystem</b><br>Parallel filesys | GPFS               | <i>Lustre</i><br>\$WORKDIR |
| <b>Compiler</b>                       | <i>Intel</i>       | <i>Intel</i>               |
| OpenMP flag                           | -openmp            | -openmp                    |
| <b>MPI</b>                            | <i>Intel</i>       | <i>Bullx</i>               |
| C compiler                            | mpiicc             | mpicc                      |
| C++ compiler                          | mpiicpc            | mpicxx                     |
| F77 compiler                          | mpiifort           | mpif77                     |
| F90 compiler                          | mpiifort           | mpif90                     |
| <b>Queue</b>                          | <i>LoadLeveler</i> | <i>SLURM</i>               |
| job submit                            | Ilsubmit job       | ccc_msub job               |
| list jobs                             | Ilq                | qstat                      |

15th VI-HPS Tuning Workshop (7-10 April 2014) MdS, Saclay, France