Virtual Institute — High Productivity Supercomputing

5th Workshop on Extreme-Scale Programming Tools

SC16

Date

Sunday, November 13, 2016
14:00-17:30

Location

Held in conjunction with SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis
Salt Lake City Conversion Center, Room 155-E
Salt Lake City, Utah, USA

In cooperation with:
SIGHPC

Description

The path to exascale computing will challenge HPC application developers in their quest to achieve the maximum potential that the machines have to offer. Factors such as limited power budgets, heterogeneity, hierarchical memories, shrinking I/O bandwidths, and performance variability will make it increasingly difficult to create productive applications on future platforms. Tools for debugging, performance measurement and analysis, and tuning will be needed to overcome the architectural, system, and programming complexities envisioned in these exascale environments. At the same time, research and development progress for HPC tools faces equally difficult challenges from exascale factors. Increased emphasis on autotuning, dynamic monitoring and adaptation, heterogeneous analysis, and so on will require new methodologies, techniques, and engagement with application teams. This workshop will serve as a forum for HPC application developers, system designers, and tools researchers to discuss the requirements for exascale-ready/exascale-enabled tools and the roadblocks that need to be addressed.

The workshop is the fifth in a series of SC conference workshops organized by the Virtual Institute - High Productivity Supercomputing (VI-HPS), an international initiative of HPC researchers and developers focused on parallel programming and performance tools for large-scale systems. The ESPT-2016 proceedings are in the IEEE Digital Library.

Workshop Program

14:00 – 14:05 Welcome and introduction by Allen Malony
14:05 – 14:40
Keynote presentation:
"Exascale Application Drivers for Software Technologies"

by Doug Kothe
14:40 – 15:05
"Methodology and Application of HPC I/O Characterization with MPIProf and IOT"
by Yan-Tyng Sherry Chang, Henry Jin and John Bauer  (show abstract)
Combining the strengths of MPIProf and IOT, an efficient and systematic method is devised for I/O characterization at the per-job, per-rank, per-file and per-call levels of programs running on the high-performance computing resources at the NASA Advanced Supercomputing (NAS) facility. This method is applied to four I/O questions in this paper. A total of 13 MPI programs and 15 cases, ranging from 24 to 5968 ranks, are analyzed to establish the I/O landscape from answers to the four questions. Four of the 13 programs use MPI I/O, and the behavior of their collective writes depends on the specific implementation of the MPI library used. The SGI MPT library, the prevailing MPI library for NAS systems, was found to automatically gather small writes from a large number of ranks in order to perform larger writes by a small subset of collective buffering ranks. The number of collective buffering ranks invoked by MPT depends on the Lustre stripe count and the number of nodes used for the run. A demonstration of varying the stripe count to achieve double-digit speedup of one program's I/O was presented. Another program, which concurrently opens private files by all ranks and could potentially create a heavy load on the Lustre servers, was identified. The ability to systematically characterize I/O for a large number of programs running on a supercomputer, seek I/O optimization opportunity, and identify programs that could cause a high load and instability on the filesystems is important for pursuing exascale in a real production environment.
15:05 – 15:30 Coffee break
15:30 – 15:55
"Modular HPC I/O Characterization with Darshan"
by Shane Snyder, Philip Carns, Kevin Harms, Robert Ross, Glenn Lockwood and Nicholas Wright  (show abstract)
Contemporary high-performance computing (HPC) applications encompass a broad range of distinct I/O strategies and are often executed on a number of different compute platforms in their lifetime. These large-scale HPC platforms employ increasingly complex I/O subsystems to provide a suitable level of I/O performance to applications. Tuning I/O workloads for such a system is nontrivial, and the results generally are not portable to other HPC systems. I/O profiling tools can help to address this challenge, but most existing tools only instrument specific components within the I/O subsystem that provide a limited perspective on I/O performance. The increasing diversity of scientific applications and computing platforms calls for greater flexibility and scope in I/O characterization.
In this work, we consider how the I/O profiling tool Darshan can be improved to allow for more flexible, comprehensive instrumentation of current and future HPC I/O workloads. We evaluate the performance and scalability of our design to ensure that it is lightweight enough for full-time deployment on production HPC systems. We also present two case studies illustrating how a more comprehensive instrumentation of application I/O workloads can enable insights into I/O behavior that were not previously possible. Our results indicate that Darshan’s modular instrumentation methods can provide valuable feedback to both users and system administrators, while imposing negligible overheads on user applications.
15:55 – 16:20
"Floating-Point Shadow Value Analysis"
by Michael Lam and Barry Rountree  (show abstract)
Real-valued arithmetic has a fundamental impact on the performance and accuracy of scientific computation. As scientific application developers prepare their applications for exascale computing, many are investigating the possibility of using either lower precision (for better performance) or higher precision (for more accuracy). However, exploring alternative representations often requires significant code revision. We present a novel program analysis technique that emulates execution with alternative real number implementations at the binary level. We also present a Pin-based implementation of this technique that supports supports x86 64 programs and a variety of alternative representations.
16:20 – 16:45
"Runtime Verification of Scientific Computing: Towards an Extreme Scale"
by Minh Ngoc Dinh, Chao Jin, David Abramson and Clinton Jeffery  (show abstract)
Relative debugging helps trace software errors by comparing two concurrent executions of a program -- one code being a reference version and the other faulty. By locating data divergence between the runs, relative debugging is effective at finding coding errors when a program is scaled up to solve larger problem sizes or migrated from one platform to another. In this work, we envision potential changes to our current relative debugging scheme in order to address exascale factors such as the increase of faults and the non-deterministic outputs. First, we propose a statistical-based comparison scheme to support verifying results that are stochastic. Second, we leverage a scalable data reduction network to adapt to the complex network hierarchy of an exascale system, and extend our debugger to support the statistical-based comparison in an environment subject to failures.
16:45 – 17:10
"Automatic Code Generation and Data Management for an Asynchronous Task-based Runtime"
by Muthu Baskaran, Benoit Pradelle, Benoit Meister, Athanasios Konstantinidis and Richard Lethin  (show abstract)
Hardware scaling and low-power considerations associated with the quest for exascale and extreme scale computing are driving system designers to consider new runtime and execution models such as the event-driven-task (EDT) models that enable more concurrency and reduce the amount of synchronization. Further, for performance, productivity, and code sustainability reasons, there is an increasing demand for auto-parallelizing compiler technologies to automatically produce code for EDT-based runtimes. However achieving scalable performance in extreme-scale systems with auto-generated codes is a non-trivial challenge. Some of the key requirements that are important for achieving good scalable performance across many EDT-based systems are: (1) scalable dynamic creation of task-dependence graph and spawning of tasks, (2) scalable creation and management of data and communications, and (3) dynamic scheduling of tasks and movement of data for scalable asynchronous execution. In this paper, we develop capabilities within R-Stream -- an automatic source-to-source optimization compiler -- for automatic generation and optimization of code and data management targeted towards Open Community Runtime (OCR) -- an exascale-ready asynchronous task-based runtime. We demonstrate the effectiveness of our techniques through performance improvements on various benchmarks and proxy application kernels that are relevant to the extreme-scale computing community.
17:10 – 17:35
"A Scalable Observation System for Introspection and In Situ Analytics"
by Chad Wood, Sudhanshu Sane, Daniel Ellsworth, Alfredo Gimenez, Kevin Huck, Todd Gamblin and Allen Malony  (show abstract)
SOS is a new model for the online in situ characterization and analysis of complex high-performance computing applications. SOS employs a data framework with distributed information management and structured query and access capabilities. The primary design objectives of SOS are flexibility, scalability, and programmability. SOS provides a complete framework that can be configured with and used directly by an application, allowing for a detailed workflow analysis of scientific applications. This paper describes the model of SOS and the experiments used to validate and explore the performance characteristics of its implementation in SOSflow. Experimental results demonstrate that SOS is capable of observation, introspection, feedback and control of complex high-performance applications, and that it has desirable scaling properties.

Organizing committee

Allen D. Malony, University of Oregon, USA
Martin Schulz, Lawrence Livermore National Laboratory, USA
Felix Wolf, TU Darmstadt, Germany
William Jalby, Université de Versailles St-Quentin-en-Yvelines, France

Program committee

Luiz DeRose, Cray Inc., USA
Michael Gerndt, Technische Universität München, Germany
Jeffrey K. Hollingsworth, University of Maryland, USA
William Jalby, Université de Versailles St-Quentin-en-Yvelines, France
Andreas Knüpfer, Technische Universität Dresden, Germany
David Lecomber, Allinea Software, UK
Allen D. Malony, University of Oregon, USA
John Mellor-Crummey, Rice University, USA
Martin Schulz, Lawrence Livermore National Laboratory, USA
Sameer Shende, University of Oregon, USA
Felix Wolf, Technische Universität Darmstadt, Germany
Brian Wylie, Jülich Supercomputing Centre, Germany

Previous workshops

Contact

Allen D. Malony (Email malony@cs.uoregon.edu, phone +1-541-346-4407)