Abstracts

Benchmarking the Stack Trace Analysis Tool for BlueGene/L

Gregory L. Lee1, Dong H. Ahn1, Dorian C. Arnold2, Bronis R. de Supinski1, Barton P. Miller2, Martin Schulz1

1Lawrence Livermore National Laboratory
2University of Wisconsin, Madison

We present STATBench, an emulator of a scalable, lightweight, and effective tool to help debug extreme-scale parallel applications, the Stack Trace Analysis Tool (STAT). STAT periodically samples stack traces from application processes and organizes the samples into a call graph prefix tree that depicts process equivalence classes based on trace similarities. We have developed STATBench which only requires limited resources and yet allows us to evaluate the feasibility of and identify potential roadblocks to deploying STAT on entire large scale systems like the 131,072 processor BlueGene/L (BG/L) at Lawrence Livermore National Laboratory.

In this paper, we describe the implementation of STATBench and show how our design strategy is generally useful for emulating tool scaling behavior. We validate STATBench's emulation of STAT by comparing execution results from STATBench with previously collected data from STAT on the same platform. We then use STATBench to emulate STAT on configurations up to the full BG/L system size - at this scale, STATBench predicts latencies below three seconds.

Scalable, Automated Performance Analysis with TAU and PerfExplorer

Kevin A. Huck, Allen D. Malony

University of Oregon

Scalable performance analysis is a challenge for parallel development tools. The potential size of data sets and the need to compare results from multiple experiments presents a challenge to manage and process the information, and to characterize the performance of parallel applications running on potentially hundreds of thousands of processor cores. In addition, many exploratory analysis processes represent potentially repeatable processes which can and should be automated. In this paper, we will discuss the current version of PerfExplorer, a performance analysis framework which provides dimension reduction, clustering and correlation analysis of individual trails of large dimensions, and can perform relative performance analysis between multiple application executions. PerfExplorer analysis processes can be captured in the form of Python scripts, automating what would otherwise be time-consuming tasks. We will give examples of large-scale analysis results, and discuss the future development of the framework, including the encoding and processing of expert performance rules, and the increasing use of performance metadata.

Developing Scalable Applications with Vampir

Matthias Müller, Holger Brunst, Matthias Jurenz, Andreas Knüpfer, Wolfgang E. Nagel

Technische Universität Dresden

With petaflop systems consisting of hundreds of thousands of processors at the high end and multicore CPUs entering the market at the low end application developers face the challenge to exploit this vast range of parallelism by writing scalable applications. Any applied tool has to provide the same scalability. Since many performance problems and limitations are only reveilled at high processors counts this is especially true for performance analysis tools. At ZIH the tool Vampir for the analysis of large trace file was developed. We perform some scalability studies with large trace files containing events from many thousand processors on one hand. The usability for real applications is analyzed with data collected with real applications, e.g. the thirteen applications contained in the SPEC MPI benchmark suite.

The analysis covers all phases of performance analysis: instrumenting the application, collecting the performance data, and finally viewing and analyzing the data. Examined aspects include instrumenting effort, monitoring overhead, trace file sizes, load time and response time during analysis.

Scalable Collation and Presentation of Call-Path Profile Data with CUBE

Markus Geimer1, Björn Kuhlmann1,3, Farzona Pulatova1,2, Felix Wolf1,3, Brian Wylie1

1Forschungszentrum Jülich
2University of Tennessee
3RWTH Aachen University

Developing performance-analysis tools for applications running on thousands of processors is extremely challenging due to the vast amount of performance data being generated. One aspect where this is particularly obvious is the visual presentation of analysis results. For example, interactive response times may become unacceptably long, the amount of data may exceed the available memory, or insufficient display size may prevent a meaningful presentation. Already writing the files to store the data for later presentation can consume a substantial amount of time on modern large-scale systems. In this talk, we describe how CUBE, a presentation component for call-path profiles, that is primarily used to display runtime summaries and trace-analysis results in the SCALASCA toolkit, has been modified to more efficiently handle data sets from thousands of processes. The modifications target both the scalable collation of input data files suitable for CUBE as well as the interactive display of the corresponding data.

Challenges addressed to increase scalability of the collation step include avoiding to write large numbers of files as well as memory limitations of individual nodes. Instead of writing one file per process, the process-local data is now centrally collected in small portions via MPI gather operations, which allow utilizing special network hardware, such as the global tree network of Blue Gene/L. The file itself is written incrementally by a single master node as the portions sent by other nodes arrive, minimizing the amount of data held in memory at a time.

The capability of the display to show large data sets is extended by reducing the memory footprint of the data and increasing the available memory. The reduction of the memory footprint is achieved through an optimization of the internal data structures used to hold the data. The amount of available memory is increased by using a remote server with a more generous memory configuration in combination with a graphical client running on the local desktop to store only the data currently on display.

Coupling DDT and Marmot for Debugging of MPI Applications

Bettina Krammer1, Valentin Himmler1, David Lecomber2

1HLRS - High Performance Computing Center Stuttgart
2Allinea Software

Parallel programming is a complex, and since the multi-core era has dawned, also a more and more common task that can be alleviated considerably by tools supporting the application development and porting process. Therefore, we plan to couple existing tools, namely the MPI (Message Passing Interface) correctness checker Marmot1, and the parallel debugger DDT, to provide MPI application developers with a powerful and user-friendly environment. So far, both tools have been used on a wide range of platforms as stand-alone tools to cover different aspects of correctness debugging. While (parallel) debuggers are great help in examining code at source level, e.g. by monitoring the execution, tracking values of variables, displaying the stack, finding memory leaks, etc., they give little insight into why a program actually gives wrong results or crashes when the failure is due to incorrect usage of the MPI API. To unravel such kinds of errors, the MARMOT library has been developed. The tool checks at run-time for errors frequently made in MPI applications, e.g. deadlocks, the correct construction and destruction of resources, etc., and also issues warnings in case of non-portable constructs.

In the final paper we will describe these two tools in more detail and report first experiences and results with their integration.

Compiler Support for Efficient Profiling and Tracing

Oscar Hernandez, Barbara Chapman

University of Houston

We are developing an integrated environment for application tuning that combines robust, existing, open source software - the OpenUH compiler, Dragon program analysis tool and three performance tools, TAU, KOJAK and PerfSuite. As a result, we are able to accomplish a scalable strategy for performance analysis, which is essential if performance tuning tools are to address the needs of emerging very large scale systems. The performance tools provide different levels of detail of performance information but at given cost; being tracing the most accurate but expensive one.

We have discovered that one of the benefits of working with compiler technology is that it can direct the performance tools to decide which regions of code they should measure selectively combining both coarse grain (parallel region level, call path/procedure level) and fine grain regions (control flow level) of the code. Using the internal cost models in the compiler inter procedural analyzer, we can estimate the importance of a region by estimating cost vectors which includes its size and how often gets invoked. Using this analysis we can set different thresholds that a region must meet in order to be instrumented or not. This approach has shown to significantly reduce overheads to acceptable levels for both profiling and tracing. In this paper we present how the compiler helped to select the important regions of the code to measure in the NAS parallel benchmarks and in a weather code, significantly reducing its overhead by approximately 10 times, to acceptable levels within 5% of overhead. The goal of the system is to provide an automated, scalable performance measurement and optimization to increase user productivity by reducing the manual effort of existing approaches.

Comparing Intel Thread Checker and Sun Thread Analyzer

Christian Terboven

RWTH Aachen University

Multiprocessor compute servers have been available for many years now. It is expected that the number of cores per processor chip will increase in the future and at least some multicore architectures will even support multiple threads running simultaneously. Hence, parallel programming will become more wide-spread and land on almost any programmer's desk. Both multicore systems and also larger SMP or ccNUMA systems can be programmed employing shared-memory parallelization paradigms.

Posix-Threads and OpenMP are the most wide-spread programming paradigms for sharedmemory parallelization. At the first sight, programming for Posix-Threads or OpenMP may seem to be easily understandable. But for non-trivial applications, reasoning about the correctness of a parallel program is much harder than for sequential control flow. The typical programming errors of shared-memory parallelization are Data Races, where the result of a computation is non-deterministic and dependent on the timing of other events, or Deadlocks, where two or more threads are waiting for each other. Finding those errors with traditional debuggers is hard, if not impossible.

This talk will compare the two software tools Intel Thread Checker and Sun Thread Analyzer, that help the programmer in finding errors like Data Races and Deadlocks in multi-threaded programs. Experiences using these tools on OpenMP and Posix-Threads applications will be presented together with findings on the strenghts and limitations of each individual product. Recommendations for embedding such tools into the software development process will be given.

Continuous Runtime Profiling of OpenMP Applications

Karl Fürlinger, Shirley Moore

University of Tennessee

Profiling and tracing are the two common techniques for performance analysis of parallel applications. Profiling is often preferred over tracing because it gives smaller amounts of data, making a manual interpretation easier. Tracing, on the other hand, allows the full temporal behavior of the application to be reconstructed at the expense of larger amounts of performance data and an often more intrusive collection process.

In this paper we investigate the possibility of combing the advantages of tracing and profiling with the goal of limiting the data volume and enabling manual interpretation while retaining some temporal information about the program execution. Our starting point is a profiling tool for OpenMP applications called ompP. Instead of capturing profiles only at the end of program execution ("one-shot" profiling), in the new approach profiles are captured at several points of time while the application executes. We call our technique incremental or continuous profiling and demonstrate its usefulness on a number of benchmark applications.

We discuss in general the dimensions of performance data and which new kind of performance displays can be derived by adding a temporal dimension to profiling-type data. Among the most useful new displays are overheads over time which allows the location of when overheads such as synchronization arise in the target application and performance counter heatmaps, that show performance counters for each thread over time.

Understanding Memory Access Bottlenecks on Multi-core

Josef Weidendorfer

Technische Universität München

This talk focuses on scalability and usability issues of analyzing the memory access behavior of multi-threaded applications on multi-core chips. The main objective is to help in the development of optimization strategies for application controlled prefetching agents running on dedicated cores, ensuring optimal exploitation of the limited connection to the main memory.

To reach this goal, the multi-core simulation collects metrics such as read/write bandwidth requirements and working set size of the threads as well as working set overlapping. The data is associated to the execution stream of the threads in an aggregated way, in order to pinpoint code regions where cache optimization is required, and where prefetch requests are useful to be handled by the prefetching agent.

Although the tool scenario does not target parallel systems with thousands of processors, the issues which needs to be solved regarding the amount of collected information, as well as regarding methods for easy to understand visualization, is quite related. For both, the amount of measurement data has to kept at a manageable size by using techniques for online aggregation. To allow quick browsing in the visualization tool, fast data structures have to be used with persistent indexing, as well as aggregation views with support for interactive selection and filtering of data.

The tool is being developed in the scope of the Munich Multicore Initiative as an extension of the suite consisting of Callgrind, based on Valgrind, and KCachegrind, a visualization tool for profiling data. As it is work in progress, we focus on existing parts, active development issues and design alternatives.