Stack Trace Analysis for Large Scale Debugging
Dorian Arnold,Dong H. Ahn,B R de Supinski,Gregory L. Lee,Barton P. Miller,Martin Schulz +5 more
- 26 Mar 2007
- pp 1-10
TL;DR: The Stack Trace Analysis Tool (STAT) is presented to aid in debugging extreme-scale applications and leverages MRNet, an infrastructure for tool control and data analyses, to overcome scalability barriers faced by heavy-weight debuggers.
read more
Abstract: We present the Stack Trace Analysis Tool (STAT) to aid in debugging extreme-scale applications. STAT can reduce problem exploration spaces from thousands of processes to a few by sampling stack traces to form process equivalence classes, groups of processes exhibiting similar behavior. We can then use full-featured debuggers on representatives from these behavior classes for root cause analysis. STAT scalably collects stack traces over a sampling period to assemble a profile of the application's behavior. STAT routines process the samples to form a call graph prefix tree that encodes common behavior classes over the program's process space and time. STAT leverages MRNet, an infrastructure for tool control and data analyses, to overcome scalability barriers faced by heavy-weight debuggers. We present STAT's design and an evaluation that shows STAT gathers informative process traces from thousands of processes with sub-second latencies, a significant improvement over existing tools. Our case studies of production codes verify that STAT supports the quick identification of errors that were previously difficult to locate.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
24/7 Characterization of petascale I/O workloads
Philip Carns,Robert Latham,Robert Ross,Kamil Iskra,Samuel Lang,Katherine Riley +5 more
- 16 Oct 2009
TL;DR: Darshan is demonstrated to have the ability to characterize the I/O behavior of four scientific applications and it is demonstrated that it induces negligible overhead for I-O intensive jobs with as many as 65,536 processes.
Binary code is not easy
Xiaozhu Meng,Barton P. Miller +1 more
- 18 Jul 2016
TL;DR: New code parsing algorithms in the open source Dyninst tool kit are presented, including a new model for describing jump tables that improves the ability to precisely determine the control flow targets, a new interprocedural analysis to determine when a function is non-returning, and techniques for handling tail calls.
147
ScalaTrace: Scalable compression and replay of communication traces for high-performance computing
Michael Noeth,Prasun Ratn,Frank Mueller,Martin Schulz,Bronis R. de Supinski +4 more
- 01 Aug 2009
TL;DR: An approach is contributed that provides orders of magnitude smaller, if not near-constant size, communication traces regardless of the number of nodes while preserving structural information.
ScalaTrace: Scalable Compression and Replay of Communication Traces for High Performance Computing
Michael Noeth,Prasun Ratn,Frank Mueller,Martin Schulz,B R de Supinski +4 more
- 16 May 2008
TL;DR: In this article, the authors introduce intra-and inter-node compression techniques of MPI events that are capable of extracting an application's communication structure and present a replay mechanism for the traces generated by their approach and discuss results of their implementation for BlueGene/L.
116
Hardware transactional memory for GPU architectures
Wilson W. L. Fung,Inderpreet Singh,Andrew Brownsword,Tor M. Aamodt +3 more
- 03 Dec 2011
TL;DR: KILO TM is proposed, a novel hardware TM design for GPUs that scales to 1000s of concurrent transactions that uses word-level, value-based conflict detection to avoid broadcast communication and reduce on-chip storage overhead.
References
Pinpoint: problem determination in large, dynamic Internet services
Mike Y. Chen,Emre Kiciman,Eugene Fratkin,Armando Fox,Eric Brewer +4 more
- 23 Jun 2002
TL;DR: This work presents a dynamic analysis methodology that automates problem determination in these environments by coarse-grained tagging of numerous real client requests as they travel through the system and using data mining techniques to correlate the believed failures and successes of these requests to determine which components are most likely to be at fault.
An API for Runtime Code Patching
Bryan R. Buck,Jeffrey K. Hollingsworth +1 more
- 01 Nov 2000
TL;DR: The authors present a postcompiler program manipulation tool called Dyninst, which provides a C++ class library for program instrumentation that permits machine-independent binary instrumentation programs to be written.
The Community Climate System Model
Maurice L. Blackmon,Byron A. Boville,Frank O. Bryan,Robert E. Dickinson,Peter R. Gent,Jeffrey T. Kiehl,Richard E. Moritz,David A. Randall,Jagadish Shukla,Susan Solomon,Gordon B. Bonan,Scott C. Doney,Inez Fung,James J. Hack,Elizabeth Hunke,James W. Hurrell,John E. Kutzbach,Jerry Meehl,Bette L. Otto-Bliesner,Ramalingam Saravanan,Edwin K. Schneider,Lisa C. Sloan,Michael A. Spall,Karl E. Taylor,Joseph Tribbia,Warren M. Washington +25 more
TL;DR: The history of the CCSM, its current capabilities, and plans for its future development and applications are outlined, with the goal of providing a summary useful to present and future users.
Automated known problem diagnosis with event traces
Chun Yuan,Ni Lao,Ji-Rong Wen,Jiwei Li,Zheng Zhang,Yi-Min Wang,Wei-Ying Ma +6 more
- 18 Apr 2006
TL;DR: This work proposes to use system behavior information such as system event traces to build correlations with solved problems, instead of using only vague text descriptions as in existing practices to enable automatic identification of the root cause of a problem if it is a known one, which would further lead to its resolution.
Automated, scalable debugging of MPI programs with Intel® Message Checker
Jayant DeSouza,Bob Kuhn,Bronis R. de Supinski,Victor Samofalov,Sergey N. Zheltov,Stanislav Viktorovich Bratanov +5 more
- 15 May 2005
TL;DR: It is described how automated tools can detect such errors and IMC's unique technology automatically detects several kinds of MPI errors such as various types of mismatches, race conditions, deadlocks and potential deadlocks, and resource misuse.
110