Proceedings Article10.1109/DSN.2006.61
Reliability for Networked Storage Nodes
KK Rao,James Lee Hafner,Richard A. Golding +2 more
- 25 Jun 2006
- pp 237-248
TL;DR: This paper presents alternatives for distributing this redundancy, and models to determine the reliability of such systems, and performs sensitivity analyses, where selected parameters are varied to observe their effect on reliability.
read more
Abstract: High-end enterprise storage has traditionally consisted of monolithic systems with customized hardware, multiple redundant components and paths, and no single point of failure. Distributed storage systems realized through networked storage nodes offer several advantages over monolithic systems such as lower cost and increased scalability. In order to achieve reliability goals associated with enterprise-class storage systems, redundancy will have to be distributed across the collection of nodes to tolerate both node and drive failures. In this paper, we present alternatives for distributing this redundancy, and models to determine the reliability of such systems. We specify a reliability target and determine the configurations that meet this target. Further, we perform sensitivity analyses where selected parameters are varied to observe their effect on reliability.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Proceedings Article
Mean time to meaningless: MTTDL, Markov models, and storage system reliability
Kevin M. Greenan,James S. Plank,Jay J. Wylie +2 more
- 22 Jun 2010
TL;DR: The storage community needs to replace MTTDL with a metric that can be used to accurately compare the reliability of systems in a way that reflects the impact of data loss in the real world.
Flat XOR-based erasure codes in storage systems: Constructions, efficient recovery, and tradeoffs
Kevin M. Greenan,Xiaozhou Li,Jay J. Wylie +2 more
- 03 May 2010
TL;DR: This paper describes constructions of two novel flat XOR-code, Stepped Combination and HD-Combination codes, and describes an algorithm for flatXOR-codes that enumerates recovery equations, i.e., sets of disks that can recover a failed disk.
Higher reliability redundant disk arrays: Organization, operation, and coding
Alexander Thomasian,Mario Blaum +1 more
TL;DR: Variations to RAID5 and RAID6 organizations are described, including clustered RAID, different methods to update parities, rebuild processing, disk scrubbing to eliminate sector errors, and the intra-disk redundancy (IDR) method to deal with sector errors.
74
Reliability for Networked Storage Nodes
KK Rao,James Lee Hafner,Richard A. Golding +2 more
- 25 Jun 2006
TL;DR: This paper presents alternatives for distributing this redundancy, and models to determine the reliability of such systems, and performs sensitivity analyses, where selected parameters are varied to observe their effect on reliability.
Guest paper: Failure trends in a large disk drive population
Eduardo Pinheiro,Wolf-Dietrich Weber,Luiz Andre Barroso +2 more
- 23 Apr 2007
TL;DR: It is found that temperature and activity levels were much less correlated with drive failures than previously reported, and models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures.
46
References
•Book
Probability and Statistics With Reliability, Queuing and Computer Science Applications
Kishor S. Trivedi
- 01 Jan 1982
TL;DR: Probability and Statistics with Reliability, Queuing and Computer Science Applications, Second Edition as discussed by the authors is a comprehensive introduction to probabiliby, stochastic processes, and statistics for students of computer science, electrical and computer engineering, and applied mathematics.
•Proceedings Article
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
Bianca Schroeder,Garth A. Gibson +1 more
- 13 Feb 2007
TL;DR: In this article, the authors present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites, and find that in the field, annual disk replacement rates typically exceed 1%, with 2-4% common and up to 13% observed on some systems.
EVENODD: an efficient scheme for tolerating double disk failures in RAID architectures
TL;DR: A novel method for tolerating up to two disk failures in RAID architectures based on Reed-Solomon error-correcting codes, which can be used in any system requiring large symbols and relatively short codes, for instance, in multitrack magnetic recording.
•Proceedings Article
Row-diagonal parity for double disk failure correction
Peter F. Corbett,Bob English,Atul Goel,Tomislav Grcanac,Steve Kleiman,James Leong,Sunitha Sankar +6 more
- 31 Mar 2004
TL;DR: Implementation results show that RDP performance can be made nearly equal to single parity RAID-4 and RAID-5 performance.
Row-Diagonal Parity for Double Disk Failure Correction (Awarded Best Paper!).
Peter F. Corbett,Robert M. English,Atul Goel,Tomislav Grcanac,Steven R. Kleiman,James Leong,Sunitha Sankar +6 more
- 01 Jan 2004
TL;DR: Implementation results show that RDP performance can be made nearly equal to single parity RAID-4 and RAID-5 performance.
443