Journal Article10.1109/TDSC.2009.4
A Large-Scale Study of Failures in High-Performance Computing Systems
Bianca Schroeder,Garth A. Gibson +1 more
906
TL;DR: Analysis of failure data collected at two large high-performance computing sites finds that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate.
read more
Abstract: Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations are publicly available. This paper analyzes failure data collected at two large high-performance computing sites. The first data set has been collected over the past nine years at Los Alamos National Laboratory (LANL) and has recently been made publicly available. It covers 23,000 failures recorded on more than 20 different systems at LANL, mostly large clusters of SMP and NUMA nodes. The second data set has been collected over the period of one year on one large supercomputing system comprising 20 nodes and more than 10,000 processors. We study the statistics of the data, including the root cause of failures, the mean time between failures, and the mean time to repair. We find, for example, that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate. From one system to another, mean repair time varies from less than an hour to more than a day, and repair times are well modeled by a lognormal distribution.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids
Yang Zhang,Anirban Mandal,Charles Koelbel,Keith D. Cooper +3 more
- 18 May 2009
TL;DR: This work proposes new approaches that combine fault tolerance techniques with existing workflow scheduling algorithms that have an impact on the reliability of workflow execution, workflow performance and resource usage under different reliability models, failure prediction accuracies and workflow application types.
67
An Analysis of Failure-Related Energy Waste in a Large-Scale Cloud Environment
TL;DR: This paper presents the first comprehensive analysis of the impact of failures on energy consumption in a real-world large-scale cloud system (comprising over 12 500 servers), including the study of failure and energy trends of the spatial and temporal environmental characteristics.
66
Diagnosing the root-causes of failures from cluster log files
Edward Chuah,Shyh-hao Kuo,Paul Hiew,William-Chandra Tjhi,Gary Lee,John Hammond,Marek T. Michalewicz,Terence Hung,James C. Browne +8 more
- 01 Dec 2010
TL;DR: A diagnostics tool, FDiag, is developed to extract the log entries as structured message templates and uses statistical correlation analysis to establish probable cause and effect relationships for the fault being analyzed.
66
Failure prediction using machine learning in a virtualised HPC system and application
TL;DR: Experimental results indicates that the average prediction accuracy of the model using SVM when predicting failure is 90% accurate and effective compared to other algorithms, implying that the method can effectively predict all possible future system and application failures within the system.
Matrix Multiplication on GPUs with On-Line Fault Tolerance
Chong Ding,Christer Karlsson,Hui Liu,Teresa Davies,Zizhong Chen +4 more
- 26 May 2011
TL;DR: The main contribution of the paper is to extend the traditional algorithm-based fault tolerance (ABFT) from offline to online and apply it to matrix multiplication on GPUs.
References
Introduction to Probability Models.
A. Csenki,Sheldon M. Ross +1 more
TL;DR: Download Introduction to Probability Models Sheldon M Download Pdf octave levenspiel solution manual pdf stochastic processes sheldon m ross pdf.
3.9K
Self-similarity through high-variability: statistical analysis of Ethernet LAN traffic at the source level
TL;DR: In this article, the authors provide a plausible physical explanation for the occurrence of self-similarity in local-area network (LAN) traffic, based on convergence results for processes that exhibit high variability and is supported by detailed statistical analyzes of real-time traffic measurements from Ethernet LANs at the level of individual sources.
Self-similarity through high-variability: statistical analysis of ethernet LAN traffic at the source level
Walter Willinger,Murad S. Taqqu,Robert Sherman,Daniel V. Wilson +3 more
- 01 Oct 1995
TL;DR: This paper provides a plausible physical explanation for the occurrence of self-similarity in high-speed network traffic based on convergence results for processes that exhibit high variability and is supported by detailed statistical analyses of real-time traffic measurements from Ethernet LAN's at the level of individual sources.
1.1K
•Proceedings Article
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
Bianca Schroeder,Garth A. Gibson +1 more
- 13 Feb 2007
TL;DR: In this article, the authors present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites, and find that in the field, annual disk replacement rates typically exceed 1%, with 2-4% common and up to 13% observed on some systems.
Why do computers stop and what can be done about it
Jim Gray
- 01 Jan 1985
TL;DR: It is pointed out that faults in production software are often soft (transient) and that a ransaction mechanism combined with persistent processpairs provides fault-tolerant execution -- the key to software fault -tolerance.