Virtual Institute – High Productivity Supercomputing

6th Workshop on Extreme-Scale Programming Tools



Sunday, November 12, 2017
09:00 a.m. - 17:00 p.m.


Held in conjunction with SC17: The International Conference for High Performance Computing, Networking, Storage and Analysis
Denver, Colorado, USA


The path to exascale computing will challenge HPC application developers in their quest to achieve the maximum potential that the machines have to offer. Factors such as limited power budgets, heterogeneity, hierarchical memories, shrinking I/O bandwidths, and performance variability will make it increasingly difficult to create productive applications on future platforms. Tools for debugging, performance measurement and analysis, and tuning will be needed to overcome the architectural, system, and programming complexities envisioned in these exascale environments. At the same time, research and development progress for HPC tools faces equally difficult challenges from exascale factors. Increased emphasis on autotuning, dynamic monitoring and adaptation, heterogeneous analysis, and so on will require new methodologies, techniques, and engagement with application teams. This workshop will serve as a forum for HPC application developers, system designers, and tools researchers to discuss the requirements for exascale-ready/exascale-enabled tools and the roadblocks that need to be addressed.

The workshop is the sixth in a series of SC conference workshops organized by the Virtual Institute - High Productivity Supercomputing (VI-HPS), an international initiative of HPC researchers and developers focused on parallel programming and performance tools for large-scale systems.

Important Dates

  • October 1, 2017 : Paper submissions due (extended deadline)
  • October 23, 2017 : Author notification (revised)
  • November 12, 2017 : ESPT Workshop, 9:00 - 17:30, Sunday
  • December 18, 2017 : Final version of papers due

Workshop Program

09:00 – 09:15 Welcome and introduction by Allen Malony
09:15 – 10:00
"Scalability dimensions of HPC: what tool developers should keep in mind when developing tools to support HPC code development"

by Matthias S. Müller  (show abstract)
The performance growth of HPC systems and applications in the last years was driven by increasing the parallelism. Therefore, when we think about scalability we first think about scaling the number of parallel execution units a parallel application is exploiting during the execution. In my talk, I will focus on two things: first I will argue that scalability has more dimensions than just the number of cores. Second, I will point out where tool developers can make essential contributions to the development of future programming models and their usability.
10:00 – 10:30
"Improved Accuracy for Automated Communication Pattern Characterization Using Communication Graphs and Aggressive Search Space Pruning"
by Philip Roth  (show abstract)
An understanding of a parallel application's communication behavior is useful for a range of activities including debugging and optimization, job scheduling, target system selection, and system design. Because it can be challenging to understand communication behavior, especially for those who lack expertise or who are not familiar with the application, we recently developed an automated, search-based approach for recognizing and parameterizing application communication behavior using a library of common communication patterns. Our initial approach was effective for characterizing the behavior of many workloads, but we identified some combinations of communication patterns for which the method was inefficient or would fail. In this paper, we discuss one such troublesome pattern combination and propose modifications to our recognition method to handle it. Specifically, we propose an alternative approach that uses communication graphs instead of traditional communication matrices to improve recognition accuracy for collective communication operations, and that uses a non-greedy recognition technique to avoid search space dead-ends that trap our original greedy recognition approach. Our modified approach uses aggressive search space pruning and heuristics to control the potential for state explosion caused by its non-greedy pattern recognition method. We demonstrate the improved recognition accuracy and pruning efficacy of our modified approach using several synthetic and real-world communication pattern combinations.
10:30 – 11:00 Coffee break
11:00 – 11:30
"CAASCADE: A System for Static Analysis of HPC Software Portfolios"
by M. Graham Lopez, Oscar Hernandez, Reuben D. Budiardja, and Jack Wells  (show abstract)
With the increasing complexity of upcoming HPC systems, so-called ”co-design” efforts to develop the hardware and applications in concert for these systems also become more challenging. It is currently difficult to gather information about the usage of programming model features, libraries, and data structure considerations in a quantitative way across a variety of applications, and this information is needed to prioritize development efforts in systems software and hardware opti- mizations. In this paper we propose, CAASCADE, a system that can harvest this information in an automatic way in production HPC envi- ronments, and we show some early results from a prototype of the system based on GNU compilers and a MySQL database.
11:30 – 12:00
"Visual Comparison of Trace Files in Vampir"
by Matthias Weber, Ronny Brendel, Michael Wagner, Robert Dietrich, Ronny Tschueter and Holger Brunst  (show abstract)
Comparing data is a key activity of performance analysis. It is required to relate performance results before and after optimizations, while porting to new hardware, and when using new programming models and libraries. While comparing profiles is straightforward, relating detailed trace data remains challenging. This work introduces the Comparison View. This new view extends the trace visualizer Vampir to enable comparative visual performance analysis. It displays multiple traces in one synchronized view and adds a range of alignment techniques to aid visual inspection. We demonstrate the Comparison View's value in three real-world performance analysis scenarios.
12:00 – 12:30
"Enhancing PAPI with Low-Overhead rdpmc Reads"
by Yan Liu and Vincent Weaver  (show abstract)
The PAPI performance library is a widely used tool for gathering self-monitored performance data from running applications. A key aspect of self-monitoring is the ability to read hardware performance counters with minimum possible overhead. If read overhead becomes too large then the measurement time will start to interfere with the gathered results, adversely affecting the performance analysis.
On Linux systems PAPI uses the perf_event subsystem to access the counter values via the read() system call. On x86 systems the special rdpmc instruction allows userspace measurement of counters without the overhead of entering the operating system kernel. We modify PAPI to use rdpmc rather than read() and find it typically improves the latency by at least a factor of three (and often a factor of six or more) on most modern systems. We analyze the effectiveness and limitations of the rdpmc interface and propose that it be enabled by default in PAPI.
12:30 – 13:30 Lunch break
13:30 – 14:00
"Moya - A JIT Compiler for HPC"
by Tarun Prabhu and William Gropp  (show abstract)
We describe Moya, an annotation-driven JIT compiler for compiled languages such as Fortran, C and C++. We show that a combination of a small number of easy-to-use annotations coupled with aggressive static analysis that enables dynamic optimization can be used to improve the performance of computationally intensive, long-running numerical applications. We obtain speedups of upto 1.5 on JIT'ted functions and overcome the overheads of the JIT compilation within 25 timesteps in a combustion-simulation application.
14:00 – 14:30
"Polyhedral Optimization of TensorFlow Computation Graphs"
by Benoit Pradelle, Benoit Meister, Muthu Baskaran, Jonathan Springer and Richard Lethin  (show abstract)
We present R-Stream·TF, a polyhedral optimization tool for neural network computations. R-Stream·TF transforms computations performed in a neural network graph into C programs suited to the polyhedral representation and uses R-Stream, a polyhedral compiler, to parallelize and optimize the computations performed in the graph. R-Stream·TF can exploit the optimizations available with R-Stream to generate a highly optimized version of the computation graph, specifically mapped to the targeted architecture. During our experiments, R-Stream·TF was able to automatically reach performance levels close to the hand-optimized implementations, demonstrating its utility in porting neural network computations to parallel architectures.
14:30 – 15:00
"Compiler-Assisted Preloading in the Shared Memory for Thread-Dense Memory Requests"
by Hyunjun Kim, Sungin Hong, Seongsoo Park, Jeonghwan Park and Hwansoo Han  (show abstract)
Most GPU architectures provide on-chip memory along with off-chip memory. The shared memory is an on-chip memory, which is faster by a factor of 10--100 than off-chip global memory, but has limitation in its capacity. In this work, we devise a data preloading technique for the shared memory. Our technique primarily targets the memory requests shared by many threads --- we call these data thread-dense --- and converts them to preloading instructions to utilize a full memory bandwidth between on-chip memory and off-chip memory. First, we develop a memory trace generator for GPUs to extract the traces for off-chip memory accesses. Then, we identify thread-dense memory requests from the traces. We also develop a data selection algorithm to choose an appropriate amount of data, which does not hurt the thread-level parallelism. Finally, our compiler transforms GPU kernels to preload the selected data in the shared memory. Among 38 applications in our study, our proposed technique found that 13 applications have thread-dense memory requests. Compared to the 18 baseline kernels from the 13 applications, our proposed method achieves geometric mean speedup of 1.22x, and the highest speedup of 2.83x on Nvidia GTX980.
15:00 – 15:30 Coffee break
15:30 – 16:00
"Generic Library Interception for Improved Performance Measurement and Insight"
by Ronny Brendel, Bert Wesarg, Ronny Tschüter, Matthias Weber, Thomas Ilsche and Sebastian Oeste  (show abstract)

As applications grow in capability, they also grow in complexity. This complexity in turn gets pushed into modules and libraries. In addition, hardware configurations become increasingly elaborate, too. These two trends make understanding, debugging and analyzing the performance of applications more and more difficult.

To enable detailed insight into library usage of applications, we present an approach and implementation in Score-P that supports intuitive and robust creation of wrappers for arbitrary C/C++ libraries. Runtime analysis then uses these wrappers to keep track of how applications interact with libraries, how they interact with each other, and record the exact timing of their functions.

16:00 – 16:30
"A Brief History of the Virtual Institute – High Productivity Supercomputing"
by Felix Wolf  (show abstract)

With initial funding from the Helmholtz Association, the Virtual Institute – High Productivity Supercomputing (VI-HPS) was founded in 2007 on the initiative of Forschungszentrum Jülich together with RWTH Aachen University, TU Dresden, and the University of Tennessee as founding members. The institute was established to increase the productivity of application programmers in high-performance computing (HPC), helping them to focus on the science to accomplish instead of having to spend major portions of their time solving problems related to their software.

To achieve this, the members of the institute developed powerful programming tools, in particular for the purpose of analyzing HPC application correctness and performance, which are today used across the globe. Major emphasis was given to the definition of common interfaces and exchange formats between these tools to improve the interoperability between them and lower their development cost. A series of international tuning workshops taught hundreds of application developers how to use them. Finally, the institute organized numerous academic workshops to foster the HPC tools community and offer especially young researchers a forum to present novel program analysis methods. Today, the institute encompasses twelve member organizations from five countries.

16:30 – 17:00 General discussion and conclusion

Note, the LNCS publication of accepted papers will occur after the ESPT workshop. Final versions of the papers will be due in December. This will give the authors the opportunity to update their paper based on workshop discussions.

Organizing committee

Allen D. Malony, University of Oregon, USA
Judit Gimenez, Barcelona Supercomputing Center, Spain
William Jalby, Université de Versailles St-Quentin-en-Yvelines, France
Martin Schulz, Lawrence Livermore National Laboratory, USA


Allen D. Malony (Email, phone +1-541-346-4407)

Program committee

Jean-Baptiste Besnard, ParaTools SAS, France
Michael Gerndt, Technische Universität München, Germany
Judit Gimenez, Barcelona Supercomputing Center, Spain
Kevin Huck, University of Oregon,USA
Heike Jagode, University of Tennessee, USA
William Jalby, Université de Versailles St-Quentin-en-Yvelines, France
Andreas Knüpfer, Technische Universität Dresden, Germany
Allen D. Malony, University of Oregon, USA
Barton P. Miller, University of Wisconsion, Madison, USA
Pablo Oliveira, Université de Versailles St-Quentin-en-Yvelines, France
Martin Schulz, Lawrence Livermore National Laboratory, USA
Sameer Shende, University of Oregon, USA
Jan Treibig, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
Felix Wolf, Technische Universität Darmstadt, Germany
Brian Wylie, Jülich Supercomputing Centre, Germany

Previous workshops