Cores that don't count
Peter Hochschild,Paul Turner,Jeffrey C. Mogul,Rama K. Govindaraju,Parthasarathy Ranganathan,David E. Culler,Amin Vahdat +6 more
- 01 Jun 2021
- pp 9-16
TL;DR: In this article, a call-to-action for software-based approaches to mercurial cores is presented, ranging from better detection and isolating mechanisms to methods for tolerating the silent data corruption they cause.
read more
Abstract: We are accustomed to thinking of computers as fail-stop, especially the cores that execute instructions, and most system software implicitly relies on that assumption. During most of the VLSI era, processors that passed manufacturing tests and were operated within specifications have insulated us from this fiction. As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often "silent" - the only symptom is an erroneous computation. We refer to a core that develops such behavior as "mercurial." Mercurial cores are extremely rare, but in a large fleet of servers we can observe the disruption they cause, often enough to see them as a distinct problem - one that will require collaboration between hardware designers, processor vendors, and systems software architects. This paper is a call-to-action for a new focus in systems research; we speculate about several software-based approaches to mercurial cores, ranging from better detection and isolating mechanisms, to methods for tolerating the silent data corruption they cause.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
HPC Forecast
TL;DR: An examination of how the technology landscape has changed and possible future directions for HPC operations and innovation can be found in this article , where the authors present a survey of the current state of the art.
12
Architecting Decentralization and Customizability in DNN Accelerators for Hardware Defect Adaptation
TL;DR: In this article , a one-time training of DNNs with Hardware-Aware Dropout/Dropconnect techniques boosts model decentralization and facilitates accurate neural network inference in the degraded computational fabrics.
9
HPC Forecast: Cloudy and Uncertain
TL;DR: Building the next generation of leading edge HPC systems will require rethinking many fundamentals and historical approaches by embracing end-to-end co-design; custom hardware configurations and packaging; large-scale prototyping, as was common thirty years ago; and collaborative partnerships with the dominant computing ecosystem companies.
9
Impact of Voltage Scaling on Soft Errors Susceptibility of Multicore Server CPUs
Dimitris Agiakatsikas,George Papadimitriou,Vasileios Karakostas,Dimitris Gizopoulos,Mihalis Psarakis,C. Bélanger-Champagne,Ewart Blackmore +6 more
- 28 Oct 2023
TL;DR: This work assessments the trade-offs between voltage scaling and soft error rate (SER) on a microprocessor system executing workloads on real hardware and a full software stack setup and shows that the SER of SRAM arrays can increase up to 40.4% when the device operates at reduced supply voltage levels.
8
The Future of Design for Test and Silicon Lifecycle Management
Janusz Rajski,Vivek Chickermane,Jean-François Côté,Stephan Eggersglüß,Nilanjan Mukherjee,Jerzy Tyszer +5 more
TL;DR: The future of design for test (DFT) focuses on integrating DFT across the silicon lifecycle to improve reliability for safety-critical applications.
7
References
Practical Byzantine fault tolerance
Miguel Castro,Barbara Liskov +1 more
- 22 Feb 1999
TL;DR: A new replication algorithm that is able to tolerate Byzantine faults that works in asynchronous environments like the Internet and incorporates several important optimizations that improve the response time of previous algorithms by more than an order of magnitude.
End-to-end arguments in system design
TL;DR: The end-to-end argument as discussed by the authors suggests that functions placed at low levels of a distributed computer system may be redundant or of little value when compared with the cost of providing them at that low level.
•Book
End-to end arguments in system design
Jerome H. Saltzer,David P. Reed,David D. Clark +2 more
- 01 Dec 1988
TL;DR: The end-to-end argument as mentioned in this paper suggests that functions placed at low levels of a distributed computer system may be redundant or of little value when compared with the cost of providing them at that low level.
1.4K
The use of triple-modular redundancy to improve computer reliability
R. E. Lyons,W. Vanderkulk +1 more
TL;DR: One of the proposed techniques for meeting the severe reliability requirements inherent in certain future computer applications is described, which involves the use of triple-modular redundancy, which is essentially theuse of the two-out-of-three votingc oncept at a low level.
840
Spanner: Google’s Globally Distributed Database
James C. Corbett,Jeffrey Dean,Michael James Boyer Epstein,Andrew Fikes,Christopher Frost,J. J. Furman,Sanjay Ghemawat,Andrey Gubarev,Christopher Heiser,Peter Hochschild,Wilson C. Hsieh,Sebastian Kanthak,Eugene Kogan,Hongyi Li,Alexander Lloyd,Sergey Melnik,David Mwaura,David Nagle,Sean Quinlan,Rajesh Rao,Lindsay Rolig,Yasushi Saito,Michal Piotr Szymaniak,Chris Jorgen Taylor,Ruth Wang,Dale Woodford +25 more
TL;DR: Spanner as mentioned in this paper is Google's scalable, multiversion, globally distributed, and synchronously replicated database, which is the first system to distribute data at global scale and support externally-consistent distributed transactions.
672
Related Papers (5)
Sumit Ghosh
- 01 Oct 2006
Philip Koopman,John Devale +1 more
- 01 Jan 2001
Erven Rohou,David Guyon +1 more
- 01 Jan 2015