A Large-Scale Study of Failures in High-Performance Computing Systems

doi:10.1109/TDSC.2009.4

Journal Article10.1109/TDSC.2009.4

A Large-Scale Study of Failures in High-Performance Computing Systems

Bianca Schroeder, +1 more

- 01 Oct 2010

- IEEE Transactions on Dependable and Secu...

- Vol. 7, Iss: 4, pp 337-351

906

TL;DR: Analysis of failure data collected at two large high-performance computing sites finds that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Proceedings Article•10.1109/CCGRID.2009.59

Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

Yang Zhang, +3 more

- 18 May 2009

TL;DR: This work proposes new approaches that combine fault tolerance techniques with existing workflow scheduling algorithms that have an impact on the reliability of workflow execution, workflow performance and resource usage under different reliability models, failure prediction accuracies and workflow application types.

...read moreread less

67

•Journal Article•10.1109/TETC.2014.2304500

An Analysis of Failure-Related Energy Waste in a Large-Scale Cloud Environment

Peter Garraghan, +3 more

- 04 Feb 2014

- IEEE Transactions on Emerging Topics in ...

TL;DR: This paper presents the first comprehensive analysis of the impact of failures on energy consumption in a real-world large-scale cloud system (comprising over 12 500 servers), including the study of failure and energy trends of the spatial and temporal environmental characteristics.

...read moreread less

66

Proceedings Article•10.1109/HIPC.2010.5713159

Diagnosing the root-causes of failures from cluster log files

Edward Chuah, +8 more

- 01 Dec 2010

TL;DR: A diagnostics tool, FDiag, is developed to extract the log entries as structured message templates and uses statistical correlation analysis to establish probable cause and effect relationships for the fault being analyzed.

...read moreread less

66

•Journal Article•10.1007/S10586-019-02917-1

Failure prediction using machine learning in a virtualised HPC system and application

Bashir Mohammed, +3 more

- 01 Jun 2019

- Cluster Computing

TL;DR: Experimental results indicates that the average prediction accuracy of the model using SVM when predicting failure is 90% accurate and effective compared to other algorithms, implying that the method can effectively predict all possible future system and application failures within the system.

...read moreread less

64

Proceedings Article•10.1109/ISPA.2011.50

Matrix Multiplication on GPUs with On-Line Fault Tolerance

Chong Ding, +4 more

- 26 May 2011

TL;DR: The main contribution of the paper is to extend the traditional algorithm-based fault tolerance (ABFT) from offline to online and apply it to matrix multiplication on GPUs.

...read moreread less

64

...

Expand

References

Journal Article•10.2307/2348601

Introduction to Probability Models.

A. Csenki, +1 more

- 01 Jan 1994

- The Statistician

TL;DR: Download Introduction to Probability Models Sheldon M Download Pdf octave levenspiel solution manual pdf stochastic processes sheldon m ross pdf.

...read moreread less

3.9K

•Journal Article•10.1109/90.554723

Self-similarity through high-variability: statistical analysis of Ethernet LAN traffic at the source level

Walter Willinger, +3 more

- 01 Feb 1997

- IEEE ACM Transactions on Networking

TL;DR: In this article, the authors provide a plausible physical explanation for the occurrence of self-similarity in local-area network (LAN) traffic, based on convergence results for processes that exhibit high variability and is supported by detailed statistical analyzes of real-time traffic measurements from Ethernet LANs at the level of individual sources.

...read moreread less

1.7K

Proceedings Article•10.1145/217382.217418

Self-similarity through high-variability: statistical analysis of ethernet LAN traffic at the source level

Walter Willinger, +3 more

- 01 Oct 1995

TL;DR: This paper provides a plausible physical explanation for the occurrence of self-similarity in high-speed network traffic based on convergence results for processes that exhibit high variability and is supported by detailed statistical analyses of real-time traffic measurements from Ethernet LAN's at the level of individual sources.

...read moreread less

1.1K

•Proceedings Article

Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

Bianca Schroeder, +1 more

- 13 Feb 2007

TL;DR: In this article, the authors present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites, and find that in the field, annual disk replacement rates typically exceed 1%, with 2-4% common and up to 13% observed on some systems.

...read moreread less

920

Why do computers stop and what can be done about it

Jim Gray

- 01 Jan 1985

TL;DR: It is pointed out that faults in production software are often soft (transient) and that a ransaction mechanism combined with persistent processpairs provides fault-tolerant execution -- the key to software fault -tolerance.

...read moreread less

849

...

Expand

A Large-Scale Study of Failures in High-Performance Computing Systems

Chat with Paper

AI Agents for this Paper

Citations

Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

An Analysis of Failure-Related Energy Waste in a Large-Scale Cloud Environment

Diagnosing the root-causes of failures from cluster log files

Failure prediction using machine learning in a virtualised HPC system and application

Matrix Multiplication on GPUs with On-Line Fault Tolerance

References

Introduction to Probability Models.

Self-similarity through high-variability: statistical analysis of Ethernet LAN traffic at the source level

Self-similarity through high-variability: statistical analysis of ethernet LAN traffic at the source level

Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

Why do computers stop and what can be done about it

Related Papers (5)

A survey of rollback-recovery protocols in message-passing systems

A higher order estimate of the optimum checkpoint interval for restart dumps

What Supercomputers Say: A Study of Five System Logs

Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

A survey of online failure prediction methods