Monday, 16 November 2015
In cooperation with:
Held in conjunction with SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis
Salon B, Hilton Hotel close to the Austin Convention Center
Austin, Texas, USA
Call for Papers
The architectural complexity in HPC is growing and this brings various challenges such as tight power budgets, variability in CPU clock frequencies, load balancing in heterogeneous systems, hierarchical memories and shrinking I/O bandwidths. This is especially prominent on the path to exascale. Therefore, tool support for debugging and performance optimization becomes more necessary than ever. However, the challenges mentioned above also apply to tools development and, in particular, raise the importance of topics such as automatic tuning and methodologies for tool-aided application development. This workshop will serve as a forum for HPC application developers, system designers, and tools researchers to discuss the requirements for exascale-enabled tools and the roadblocks that need to be addressed on the way. We also highly encourage application developers to share their experiences using existing tools. The event will serve as a community forum for everyone interested in interoperable tool sets ready for an exascale software stack.
Workshop format and topics
The workshop consisted of regular paper presentations and work-in-progress presentations as well as two keynote presentations. Each regular paper submission underwent a peer review process. Accepted contributions are published in the SC workshop proceedings in cooperation with SIGHPC through the ACM Digital Library and IEEE Xplore.
Workshop topics include
- Programming tools, such as performance analysis and tuning tools, debuggers, correctness checking tools, IDEs, and more
- Methodologies for performance engineering
- Tool technologies tackling extreme-scale challenges, such as scalability, resilience, power, etc.
- Tool infrastructures
- Application developer experiences with programming tools
The workshop is the fourth in a series at SC conferences organized by the Virtual Institute - High Productivity Supercomputing in collaboration with the Priority Programme "Software for Exascale Computing" of the German Research Foundation (DFG).
|09:00 – 09:20
||Welcome and introduction by Andreas Knüpfer [PDF]
|09:20 – 10:00
||Keynote presentation: "Providing a Robust Tools Landscape for CORAL Machines”
by Michael J. Brim, Dong H. Ahn, Scott Parker, and Gregory Watson
CORAL, the Collaboration of Oak Ridge, Argonne, and Lawrence Livermore national laboratories, is a joint effort to procure and deploy the next generation of capability-class supercomputers for the Department of Energy's Office of Science and National Nuclear Safety Administration agencies. The joint procurement selected two diverse system architectures from IBM and Intel. ORNL and LLNL will deploy machines (Summit and Sierra, respectively) based on the IBM design that includes multiple IBM POWER9 processors and NVIDIA Volta GPUs per node and a Mellanox EDR Infiniband inter-connect. ANL will deploy the Aurora machine based on the Intel design that includes 3rd-generation Xeon Phi processors and a 2nd-generation Intel OmniPath interconnect. In this talk, we will present a brief overview of each machine's hardware, system software, and programming environment, followed by our strategy to provide a robust tools landscape for these machines by the time they are deployed for use in 2018. In particular, tools community engagement opportunities will be discussed, such as:
- CORAL tools community announcements and discussion venues
- Obtaining more detailed information about the CORAL machines
- Providing tool support requirements and test cases that can be used during acceptance testing
- Potential interactions with early-science application teams
- Wish lists for CORAL tools based on prior experiences with existing tools on Mira, Titan, and Sequoia
- Potential access to early evaluation systems to aid in porting and feature development
|10:10 – 10:30
||Work-in-progress presentation: "Large-scale debugging with graphs"
by Nikoli Dryden
Understanding and manipulating the state of applications is critical to identifying and fixing bugs. As applications scale, it grows increasingly difficult to usefully debug them due to the large quantities of information generated by a debugger. We address this by representing debugger output and program state as graphs, which provide a simple route to scaling both the debugger and the presentation of results. Graphs are cut when collective operations are performed and are represented as equivalence classes to reduce data volume and improve comprehensibility. This approach is currently being implemented in PGDB, an existing open-source parallel debugger for MPI applications, and early results indicate this is a viable approach to debugging large-scale applications.
|10:30 – 11:00
|11:00 – 11:30
||Full-paper presentation: “Preventing the explosion of exascale profile data with smart thread-level aggregation” |
by Daniel Lorenz, Sergei Shudler, Felix Wolf
State of the art performance analysis tools, such as Score-P, record performance profiles on a per-thread basis. However, for exascale systems the number of threads is expected to be in the order of a billion threads, and this would result in extremely large performance profiles. In most cases the user almost never inspects the individual per-thread data. In this paper, we propose to aggregate per-thread performance data in each process to reduce its amount to a reasonable size. Our goal is to aggregate the threads such that the thread-level performance issues are still visible and analyzable. Therefore, we implemented four aggregation strategies in Score-P: (i) SUM – aggregates all threads of a process into a process profile; (ii) SET – calculates statistical key data as well as the sum; (iii) KEY – identifies three threads (i.e., key threads) of particular interest for performance analysis and aggregates the rest of the threads; (iv) CALLTREE – clusters threads that have the same call-tree structure. For each one of these strategies we evaluate the compres- sion ratio and how they maintain thread-level performance behavior information. The aggregation does not incur any additional performance overhead at application run-time.
|11:30 – 12:00
||Work-in-progress presentation: "Progress report on the Integrative Model for Parallelism"
by Victor Eijkhout
The Integrative Model for Parallelism (IMP) is a parallel programming system, primarily aimed at scientific applications. Its aim is to offer an integrated model for programming many different modes of parallelism. In fact, in the IMP system one source file can be compiled into message passing, DAG-based, and hybrid execution models. Also, performance enhancing techniques such as redundant computation and latency hiding are automatically achieved without need for programmer intervention.
|12:00 – 12:30
||Work-in-progress presentation: “A Principled Approach to HPC Monitoring"
by Aaron Gonzales
As exascale computing looms on the horizon, understanding the effectiveness of current system and resource monitoring strategies becomes crucial. In partnership with Los Alamos National Laboratory, the University of New Mexico is investigating what current monitoring systems lack in predictive or analytic power and what contemporary types of monitoring, e.g., the Lightweight Distributed Metric Service or MRNet can give HPC operators and users. More precisely, our studies aim to identify the qualitative and quantitative resource monitoring needs for exascale systems, tools and service. In addition, we are constructing an Open Data project to publicly release monitoring data from a large cluster of machines, which will enable other HPC researchers to investigate this problem space.
|12:30 – 13:40
|13:40 – 14:30
||Keynote presentation: "Performance Optimization and Productivity"
by Judit Gimenez and Jesus Labarta
In the current context where the complexity of applications and systems is dramatically exploding, it is frequent for code developers or users not to have time or methodologies to perform a deep analysis and understand the actual behavior of their codes. On October 1st and within the framework of EU H2020 a Center of Excellence on Performance Optimization and Productivity (POP) has been established gathering leading experts in performance tools/analysis and programming models. POP offers free services to the academic and industrial communities to help them better understand the behavior of their applications, suggest the most productive directions for optimizing the performance of the codes and help implementing those transformations in the most productive way. The talk will present the POP project as well as techniques that can be used to obtain a deeper insight on the applications' performance.
|14:30 – 15:00
||Work-in-progress presentation: "The OpenACC 2.5 Profiling Interface: A Tool Study"
by Robert Dietrich, Bert Wesarg and Guido Juckeland
With the release of the OpenACC 2.5 standard an interface for performance analysis becomes part of the specification. This standardized interface allows to build portable and robust tools. The new tool interface is dedicated to instrumentation-based tools for profile and trace data collection. Based on event callbacks, the tool interface enables recording of OpenACC runtime events that occur during the execution of an OpenACC application. In this talk we present the implementation of the interface in the measurement infrastructure Score-P to explore its practical benefit for performance analysis. We will showcase these new analysis capabilities and depict the synergy with performance analysis for lower-level programming models such as CUDA and OpenCL. Score-P can generate profiles in the CUBE4 format and traces that can be visualized with Vampir. It can therfore fully exploit the capabilities of the new interface and provides the program developer with a clearer understanding of the dynamic runtime behavior of the application and illustrates potential bottlenecks. Experiments with the OpenACC benchmarks of the SPEC ACCEL suite and an extremely short running reduction kernel have been used to evaluate the instrumentation overhead.
|15:00 – 15:30
|15:30 – 16:00
||Full-paper presentation: "HPC I/O Trace Extrapolation"
by Xiaoqing Luo, Frank Mueller, Philip Carns, John Jenkins, Robert Latham, Robert Ross and Shane Snyder
Today’s rapid development of supercomputers has caused I/O performance to become a major performance bottleneck for many scientific applications. Trace analysis tools have thus become vital for diagnosing root causes of I/O problems. This work contributes an I/O tracing framework with elastic traces. After gathering a set of smaller traces, we extrapolate the application trace to a large numbers of nodes. The traces can in principle be extrapolated even beyond the scale of present-day systems. Experiments with I/O benchmarks on up to 320 processors indicate that extrapolated I/O trace replays closely resemble the I/O behavior of equivalent applications.
|16:00 – 16:30
||Work-in-progress presentation: "Exploring the Impact of Overlay Network Topology on Tool and Application Performance"
by Whit Schonbein and Dorian Arnold
Software infrastructure, like MRNet, provide tools and applications a lightweight tree-based overlay network for gathering data from leaf processes (or back-ends) and delivering potentially aggregated data to a root process (or front-end). Tree structures are common to many software architectures, but there are only limited studies on how different types of workloads perform under different types of tree-based overlay network topologies -- particularly when the internal processes of the tree have data aggregation capabilities. We are empirically assessing this relationship. Specifically, we have designed a benchmarking tool that allows us to generate different tree topologies and input workloads whose parameters vary along important dimensions like packet size, packet rate and aggregation latency. The benchmark outputs performance data. As a part of this study, we also investigate several real world tool case studies and use them to inform our synthetic workloads as well as to generate real ones.
|16:30 – 17:00
||Work-in-progress presentation: "Initial Validation of DRAM and GPU RAPL Power Measurements"
by Spencer Desrochers, Chad Paradis, and Vincent Weaver
Recent Intel processors support gathering estimated energy readings via the RAPL (Running Average Power Limit) interface. There is some existing work on validating that these energy results are sane, although such work tends to be brief and concentrate on the CPU values. We have undertaken instrumenting a Haswell desktop machine for detailed external power readings, and use this to validate not only the RAPL CPU results, but also RAPL estimates of DRAM and GPU energy consumption. We will present the various complications found when trying to instrument both this desktop machine as well as a high-end Sandybridge-EP server. This includes problems with intercepting power lines, proprietary connectors, and limitations of the perf tool and the Linux perf_event_open() interface. We have taken actual and RAPL measurements across the STREAM, HPL Linpack, and various GPU tests. Our initial findings show that RAPL seems to underestimate DRAM energy values, especially when the memory subsystem is idle.
|17:00 – 17:30
Andreas Knüpfer, TU Dresden, Germany (email@example.com)
Martin Schulz, Lawrence Livermore National Laboratory, USA
Felix Wolf, TU Darmstadt, Germany
Brian Wylie, Jülich Supercomputing Centre, Germany
Luiz DeRose, Cray Inc., USA
Karl Fürlinger, LMU Munich, Germany
Jim Galarowicz, Krell Institute, USA
Judit Gimenez, Barcelona Supercomputing Center, Spain
Chris Gottbrath, Rogue Wave Software, Inc., USA
Oscar Hernandez, Oak Ridge National Lab, USA
Heike Jagode, University of Tennessee, Knoxville, USA
Christos Kartsaklis, Oak Ridge National Lab, USA
Andreas Knüpfer, TU Dresden, Germany
David Lecomber, Allinea Software Ltd., UK
John Mellor-Crummey, Rice University, USA
Barton Miller, University of Wisconsin-Madison, USA
Martin Schulz, Lawrence Livermore National Laboratory, USA
Jan Eitzinger, Friedrich-Alexander-Universitaet Erlangen-Nürnberg, Germany
Felix Wolf, TU Darmstadt, Germany
Brian Wylie, Jülich Supercomputing Centre, Germany
Andreas Knüpfer (Email firstname.lastname@example.org, phone +49 351 463 38323)