A Large-Scale Study of Failures in High-Performance Computing Systems

doi:10.1109/TDSC.2009.4

Journal Article10.1109/TDSC.2009.4

A Large-Scale Study of Failures in High-Performance Computing Systems

Bianca Schroeder, +1 more

- 01 Oct 2010

- IEEE Transactions on Dependable and Secu...

- Vol. 7, Iss: 4, pp 337-351

906

TL;DR: Analysis of failure data collected at two large high-performance computing sites finds that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate.

Abstract: Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations are publicly available. This paper analyzes failure data collected at two large high-performance computing sites. The first data set has been collected over the past nine years at Los Alamos National Laboratory (LANL) and has recently been made publicly available. It covers 23,000 failures recorded on more than 20 different systems at LANL, mostly large clusters of SMP and NUMA nodes. The second data set has been collected over the period of one year on one large supercomputing system comprising 20 nodes and more than 10,000 processors. We study the statistics of the data, including the root cause of failures, the mean time between failures, and the mean time to repair. We find, for example, that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate. From one system to another, mean repair time varies from less than an hour to more than a day, and repair times are well modeled by a lognormal distribution.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Proceedings Article•10.1109/HPCC-SMARTCITY-DSS.2017.70

A Diskless Checkpointing Scheme Based on Vertical Encoding to Lower Fault Tolerance Overhead

Jin-Min Yang, +1 more

- 01 Dec 2017

TL;DR: Experimental results show that the proposed scheme reduces significantly the communication overhead of both checkpointing and fault recovery, with no encoding overhead introduced.

...read moreread less

Proceedings Article•10.1109/ICMULT.2010.5630628

Design of a 50° Field-of-View Object for Head-Mounted Projective Displays and Investigation of Retro-Reflective Materials

Changjiang Fan, +1 more

- 11 Nov 2010

TL;DR: In this paper, a head mounted projective display lens with a field-of-view of 50o for medical application is designed, which can satisfy the requirement of a 0.9 inch miniature display with a display model of SXGA.

...read moreread less

Estimating Time to Repair Failures in a Distributed System

Matilda Söderholm, +1 more

- 01 Jan 2016

TL;DR: To ensure the quality of important services, high availability is critical and one aspect to be considered in availability is the downtime of the system, which can be measured in time to recover from downtime.

...read moreread less

Proceedings Article•10.1109/FCST.2009.46

DFHR: A Design Framework for HPC Reliability

Yongqin Huang

- 17 Dec 2009

TL;DR: It is demonstrated that DFHR is suitable for the cost-effective reliability design of HPC systems.

...read moreread less

Book Chapter•10.1007/978-3-030-71590-8_2

The Framework of the MDATA Computing Model

Yan Jia, +6 more

- 07 Mar 2021

TL;DR: In this paper, the authors proposed a computing architecture named fog-cloud computing for big data in ubiquitous cyberspace, where multiple knowledge actors in the fog, middle layer, and cloud are realized based on the collaborative computing language and models.

...read moreread less

...

Expand

References

Journal Article•10.2307/2348601

Introduction to Probability Models.

A. Csenki, +1 more

- 01 Jan 1994

- The Statistician

TL;DR: Download Introduction to Probability Models Sheldon M Download Pdf octave levenspiel solution manual pdf stochastic processes sheldon m ross pdf.

...read moreread less

3.9K

•Journal Article•10.1109/90.554723

Self-similarity through high-variability: statistical analysis of Ethernet LAN traffic at the source level

Walter Willinger, +3 more

- 01 Feb 1997

- IEEE ACM Transactions on Networking

TL;DR: In this article, the authors provide a plausible physical explanation for the occurrence of self-similarity in local-area network (LAN) traffic, based on convergence results for processes that exhibit high variability and is supported by detailed statistical analyzes of real-time traffic measurements from Ethernet LANs at the level of individual sources.

...read moreread less

1.7K

Proceedings Article•10.1145/217382.217418

Self-similarity through high-variability: statistical analysis of ethernet LAN traffic at the source level

Walter Willinger, +3 more

- 01 Oct 1995

TL;DR: This paper provides a plausible physical explanation for the occurrence of self-similarity in high-speed network traffic based on convergence results for processes that exhibit high variability and is supported by detailed statistical analyses of real-time traffic measurements from Ethernet LAN's at the level of individual sources.

...read moreread less

1.1K

•Proceedings Article

Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

Bianca Schroeder, +1 more

- 13 Feb 2007

TL;DR: In this article, the authors present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites, and find that in the field, annual disk replacement rates typically exceed 1%, with 2-4% common and up to 13% observed on some systems.

...read moreread less

920

Why do computers stop and what can be done about it

Jim Gray

- 01 Jan 1985

TL;DR: It is pointed out that faults in production software are often soft (transient) and that a ransaction mechanism combined with persistent processpairs provides fault-tolerant execution -- the key to software fault -tolerance.

...read moreread less

849

...

Expand

A Large-Scale Study of Failures in High-Performance Computing Systems

Chat with Paper

AI Agents for this Paper

Citations

A Diskless Checkpointing Scheme Based on Vertical Encoding to Lower Fault Tolerance Overhead

Design of a 50° Field-of-View Object for Head-Mounted Projective Displays and Investigation of Retro-Reflective Materials

Estimating Time to Repair Failures in a Distributed System

DFHR: A Design Framework for HPC Reliability

The Framework of the MDATA Computing Model

References

Introduction to Probability Models.

Self-similarity through high-variability: statistical analysis of Ethernet LAN traffic at the source level

Self-similarity through high-variability: statistical analysis of ethernet LAN traffic at the source level

Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

Why do computers stop and what can be done about it

Related Papers (5)

A survey of rollback-recovery protocols in message-passing systems

A higher order estimate of the optimum checkpoint interval for restart dumps

What Supercomputers Say: A Study of Five System Logs

Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

A survey of online failure prediction methods