Abstracts
Benchmarking the Stack Trace Analysis Tool for BlueGene/L
Gregory L. Lee
1, Dong H. Ahn
1, Dorian C. Arnold
2, Bronis R. de Supinski
1, Barton P.
Miller
2, Martin Schulz
1
1Lawrence Livermore National Laboratory
2University of Wisconsin, Madison
We present STATBench, an emulator of a scalable, lightweight, and effective tool to help debug
extreme-scale parallel applications, the Stack Trace Analysis Tool (STAT). STAT periodically samples
stack traces from application processes and organizes the samples into a call graph prefix tree
that depicts process equivalence classes based on trace similarities. We have developed STATBench
which only requires limited resources and yet allows us to evaluate the feasibility of and identify
potential roadblocks to deploying STAT on entire large scale systems like the 131,072 processor
BlueGene/L (BG/L) at Lawrence Livermore National Laboratory.
In this paper, we describe the implementation of STATBench and show how our design strategy
is generally useful for emulating tool scaling behavior. We validate STATBench's emulation of STAT
by comparing execution results from STATBench with previously collected data from STAT on the
same platform. We then use STATBench to emulate STAT on configurations up to the full BG/L
system size - at this scale, STATBench predicts latencies below three seconds.
Scalable, Automated Performance Analysis with TAU and PerfExplorer
Kevin A. Huck, Allen D. Malony
University of Oregon
Scalable performance analysis is a challenge for parallel development tools. The potential size of data
sets and the need to compare results from multiple experiments presents a challenge to manage and
process the information, and to characterize the performance of parallel applications running on potentially
hundreds of thousands of processor cores. In addition, many exploratory analysis processes
represent potentially repeatable processes which can and should be automated. In this paper, we
will discuss the current version of PerfExplorer, a performance analysis framework which provides
dimension reduction, clustering and correlation analysis of individual trails of large dimensions, and
can perform relative performance analysis between multiple application executions. PerfExplorer
analysis processes can be captured in the form of Python scripts, automating what would otherwise
be time-consuming tasks. We will give examples of large-scale analysis results, and discuss the future
development of the framework, including the encoding and processing of expert performance
rules, and the increasing use of performance metadata.
Developing Scalable Applications with Vampir
Matthias Müller, Holger Brunst, Matthias Jurenz, Andreas Knüpfer, Wolfgang E. Nagel
Technische Universität Dresden
With petaflop systems consisting of hundreds of thousands of processors at the high end and multicore
CPUs entering the market at the low end application developers face the challenge to exploit
this vast range of parallelism by writing scalable applications. Any applied tool has to provide
the same scalability. Since many performance problems and limitations are only reveilled at high
processors counts this is especially true for performance analysis tools. At ZIH the tool Vampir for
the analysis of large trace file was developed. We perform some scalability studies with large
trace files containing events from many thousand processors on one hand. The usability for real
applications is analyzed with data collected with real applications, e.g. the thirteen applications
contained in the SPEC MPI benchmark suite.
The analysis covers all phases of performance analysis: instrumenting the application, collecting
the performance data, and finally viewing and analyzing the data. Examined aspects include instrumenting
effort, monitoring overhead, trace file sizes, load time and response time during analysis.
Scalable Collation and Presentation of Call-Path Profile Data with CUBE
Markus Geimer
1, Björn Kuhlmann
1,3, Farzona Pulatova
1,2, Felix Wolf
1,3, Brian Wylie
1
1Forschungszentrum Jülich
2University of Tennessee
3RWTH Aachen University
Developing performance-analysis tools for applications running on thousands of processors is extremely
challenging due to the vast amount of performance data being generated. One aspect where
this is particularly obvious is the visual presentation of analysis results. For example, interactive
response times may become unacceptably long, the amount of data may exceed the available memory,
or insufficient display size may prevent a meaningful presentation. Already writing the files to
store the data for later presentation can consume a substantial amount of time on modern large-scale
systems. In this talk, we describe how CUBE, a presentation component for call-path profiles, that is
primarily used to display runtime summaries and trace-analysis results in the SCALASCA toolkit, has
been modified to more efficiently handle data sets from thousands of processes. The modifications
target both the scalable collation of input data files suitable for CUBE as well as the interactive display
of the corresponding data.
Challenges addressed to increase scalability of the collation step include avoiding to write large
numbers of files as well as memory limitations of individual nodes. Instead of writing one file per
process, the process-local data is now centrally collected in small portions via MPI gather operations,
which allow utilizing special network hardware, such as the global tree network of Blue Gene/L. The
file itself is written incrementally by a single master node as the portions sent by other nodes arrive,
minimizing the amount of data held in memory at a time.
The capability of the display to show large data sets is extended by reducing the memory footprint
of the data and increasing the available memory. The reduction of the memory footprint is
achieved through an optimization of the internal data structures used to hold the data. The amount of
available memory is increased by using a remote server with a more generous memory configuration
in combination with a graphical client running on the local desktop to store only the data currently
on display.
Coupling DDT and Marmot for Debugging of MPI Applications
Bettina Krammer
1, Valentin Himmler
1, David Lecomber
2
1HLRS - High Performance Computing Center Stuttgart
2Allinea Software
Parallel programming is a complex, and since the multi-core era has dawned, also a more and more
common task that can be alleviated considerably by tools supporting the application development
and porting process. Therefore, we plan to couple existing tools, namely the MPI (Message Passing
Interface) correctness checker Marmot1, and the parallel debugger DDT, to provide MPI application
developers with a powerful and user-friendly environment. So far, both tools have been used on
a wide range of platforms as stand-alone tools to cover different aspects of correctness debugging.
While (parallel) debuggers are great help in examining code at source level, e.g. by monitoring
the execution, tracking values of variables, displaying the stack, finding memory leaks, etc., they
give little insight into why a program actually gives wrong results or crashes when the failure is
due to incorrect usage of the MPI API. To unravel such kinds of errors, the MARMOT library has
been developed. The tool checks at run-time for errors frequently made in MPI applications, e.g.
deadlocks, the correct construction and destruction of resources, etc., and also issues warnings in
case of non-portable constructs.
In the final paper we will describe these two tools in more detail and report first experiences and
results with their integration.
Compiler Support for Efficient Profiling and Tracing
Oscar Hernandez, Barbara Chapman
University of Houston
We are developing an integrated environment for application tuning that combines robust, existing,
open source software - the OpenUH compiler, Dragon program analysis tool and three performance
tools, TAU, KOJAK and PerfSuite. As a result, we are able to accomplish a scalable strategy for
performance analysis, which is essential if performance tuning tools are to address the needs of
emerging very large scale systems. The performance tools provide different levels of detail of performance
information but at given cost; being tracing the most accurate but expensive one.
We have discovered that one of the benefits of working with compiler technology is that it can
direct the performance tools to decide which regions of code they should measure selectively combining
both coarse grain (parallel region level, call path/procedure level) and fine grain regions (control
flow level) of the code. Using the internal cost models in the compiler inter procedural analyzer,
we can estimate the importance of a region by estimating cost vectors which includes its size and
how often gets invoked. Using this analysis we can set different thresholds that a region must meet
in order to be instrumented or not. This approach has shown to significantly reduce overheads to
acceptable levels for both profiling and tracing. In this paper we present how the compiler helped to
select the important regions of the code to measure in the NAS parallel benchmarks and in a weather
code, significantly reducing its overhead by approximately 10 times, to acceptable levels within 5%
of overhead. The goal of the system is to provide an automated, scalable performance measurement
and optimization to increase user productivity by reducing the manual effort of existing approaches.
Comparing Intel Thread Checker and Sun Thread Analyzer
Christian Terboven
RWTH Aachen University
Multiprocessor compute servers have been available for many years now. It is expected that the
number of cores per processor chip will increase in the future and at least some multicore architectures
will even support multiple threads running simultaneously. Hence, parallel programming will
become more wide-spread and land on almost any programmer's desk. Both multicore systems and
also larger SMP or ccNUMA systems can be programmed employing shared-memory parallelization
paradigms.
Posix-Threads and OpenMP are the most wide-spread programming paradigms for sharedmemory
parallelization. At the first sight, programming for Posix-Threads or OpenMP may seem
to be easily understandable. But for non-trivial applications, reasoning about the correctness of a
parallel program is much harder than for sequential control flow. The typical programming errors of
shared-memory parallelization are Data Races, where the result of a computation is non-deterministic
and dependent on the timing of other events, or Deadlocks, where two or more threads are waiting
for each other. Finding those errors with traditional debuggers is hard, if not impossible.
This talk will compare the two software tools Intel Thread Checker and Sun Thread Analyzer,
that help the programmer in finding errors like Data Races and Deadlocks in multi-threaded programs.
Experiences using these tools on OpenMP and Posix-Threads applications will be presented
together with findings on the strenghts and limitations of each individual product. Recommendations
for embedding such tools into the software development process will be given.
Continuous Runtime Profiling of OpenMP Applications
Karl Fürlinger, Shirley Moore
University of Tennessee
Profiling and tracing are the two common techniques for performance analysis of parallel applications.
Profiling is often preferred over tracing because it gives smaller amounts of data, making a
manual interpretation easier. Tracing, on the other hand, allows the full temporal behavior of the
application to be reconstructed at the expense of larger amounts of performance data and an often
more intrusive collection process.
In this paper we investigate the possibility of combing the advantages of tracing and profiling
with the goal of limiting the data volume and enabling manual interpretation while retaining
some temporal information about the program execution. Our starting point is a profiling tool for
OpenMP applications called ompP. Instead of capturing profiles only at the end of program execution
("one-shot" profiling), in the new approach profiles are captured at several points of time while
the application executes. We call our technique incremental or continuous profiling and demonstrate
its usefulness on a number of benchmark applications.
We discuss in general the dimensions of performance data and which new kind of performance
displays can be derived by adding a temporal dimension to profiling-type data. Among the most
useful new displays are overheads over time which allows the location of when overheads such as
synchronization arise in the target application and performance counter heatmaps, that show performance
counters for each thread over time.
Understanding Memory Access Bottlenecks on Multi-core
Josef Weidendorfer
Technische Universität München
This talk focuses on scalability and usability issues of analyzing the memory access behavior of
multi-threaded applications on multi-core chips. The main objective is to help in the development
of optimization strategies for application controlled prefetching agents running on dedicated cores,
ensuring optimal exploitation of the limited connection to the main memory.
To reach this goal, the multi-core simulation collects metrics such as read/write bandwidth requirements
and working set size of the threads as well as working set overlapping. The data is
associated to the execution stream of the threads in an aggregated way, in order to pinpoint code
regions where cache optimization is required, and where prefetch requests are useful to be handled
by the prefetching agent.
Although the tool scenario does not target parallel systems with thousands of processors, the
issues which needs to be solved regarding the amount of collected information, as well as regarding
methods for easy to understand visualization, is quite related. For both, the amount of measurement
data has to kept at a manageable size by using techniques for online aggregation. To allow quick
browsing in the visualization tool, fast data structures have to be used with persistent indexing, as
well as aggregation views with support for interactive selection and filtering of data.
The tool is being developed in the scope of the Munich Multicore Initiative as an extension
of the suite consisting of Callgrind, based on Valgrind, and KCachegrind, a visualization tool for
profiling data. As it is work in progress, we focus on existing parts, active development issues and
design alternatives.