Evaluating operating system vulnerability to memory errors
Kurt B. Ferreira,Kevin Pedretti,Ron Brightwell,Patrick G. Bridges,David Fiala,Frank Mueller +5 more
- 29 Jun 2012
- pp 11
TL;DR: The results show the Kitten lightweight operating system may be an easier target to harden against memory errors due to its smaller memory footprint, largely deterministic state, and simpler system structure.
read more
Abstract: Reliability is of great concern to the scalability of extreme-scale systems. Of particular concern are soft errors in main memory, which are a leading cause of failures on current systems and are predicted to be the leading cause on future systems. While great effort has gone into designing algorithms and applications that can continue to make progress in the presence of these errors without restarting, the most critical software running on a node, the operating system (OS), is currently left relatively unprotected. OS resiliency is of particular importance because, though this software typically represents a small footprint of a compute node's physical memory, recent studies show more memory errors in this region of memory than the remainder of the system. In this paper, we investigate the soft error vulnerability of two operating systems used in current and future high-performance computing systems: Kitten, the lightweight kernel developed at Sandia National Laboratories, and CLE, a high-performance Linux-based operating system developed by Cray. For each of these platforms, we outline major structures and subsystems that are vulnerable to soft errors and describe methods that could be used to reconstruct damaged state. Our results show the Kitten lightweight operating system may be an easier target to harden against memory errors due to its smaller memory footprint, largely deterministic state, and simpler system structure.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Figure 1. Comparison of Linux Kernel and Kitten Kernel source lines of code (SLOC). 
Figure 3. Comparison of the worst case Kitten static and dynamic kernel size to the average case measured on CLE. The average CLE memory footprint is an order of magnitude larger then the worst case for Kitten. 
Figure 2. Physical memory layout of Kitten and Linux.
Citations
Hobbes: composition and virtualization as the foundations of an extreme-scale OS/R
Ron Brightwell,Ron A. Oldfield,Arthur B. Maccabe,David E. Bernholdt +3 more
- 10 Jun 2013
TL;DR: Hobbes as mentioned in this paper is an operating system and runtime (OS/R) framework for extreme-scale systems that makes use of virtualization technologies to provide the flexibility to support requirements of application components for different node-level operating systems and runtimes, as well as different mappings of the components onto the hardware.
Effectiveness of Fault Detection Mechanisms in Static and Dynamic Operating System Designs
Martin Hoffmann,Christoph Borchert,Christian Dietrich,Horst Schirmeier,Rüdiger Kapitza,Olaf Spinczyk,Daniel Lohmann +6 more
- 10 Jun 2014
TL;DR: This work quantifies the difference in vulnerability for soft errors in main memory of a flexible (dynamic) operating systems (eCos) and a static system (CiAO), which has an OSEK-compliant structure and analyzes the additional degree of robustness that is achieved by hardening an operating system with software-based and hardware-based fault-tolerance measures and the corresponding costs.
What is a Lightweight Kernel
Rolf Riesen,Arthur Maccabe,Balazs Gerofi,David N. Lombard,John Jack Lange,Kevin Pedretti,Kurt B. Ferreira,Michael Lang,Pardo Keppel,Robert W. Wisniewski,Ron Brightwell,Todd A. Inglett,Yoonho Park,Yutaka Ishikawa +13 more
- 16 Jun 2015
TL;DR: What is meant by the term lightweight kernel, and what makes LWKs different from other operating system kernels are described, are described and no single definition for a lightweight kernel exists.
16
RelaxFault memory repair
Dong Wan Kim,Mattan Erez +1 more
- 18 Jun 2016
TL;DR: It is shown that RelaxFault provides better repair capability than prior work of similar cost, improves memory reliability to a greater extent, and significantly reduces the number of maintenance events and memory module replacements.
16
Patent
Page retirement in a NAND flash memory system
Charles J. Camp,Ioannis Koltsidas,Roman A. Pletka,Andrew D. Walls +3 more
- 04 Dec 2013
TL;DR: In this article, a page is a smallest granularity of the NVRAM array that can be accessed by read and write operations, and a memory block containing multiple pages is the smallest granular of the memory array that cannot be erased.
15
References
Algorithm-Based Fault Tolerance for Matrix Operations
Kuang-Hua Huang,Abraham +1 more
TL;DR: Algorithm-based fault tolerance schemes are proposed to detect and correct errors when matrix operations such as addition, multiplication, scalar product, LU-decomposition, and transposition are performed using multiple processor systems.
1.4K
A Large-Scale Study of Failures in High-Performance Computing Systems
Bianca Schroeder,Garth A. Gibson +1 more
TL;DR: Analysis of failure data collected at two large high-performance computing sites finds that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate.
912
SWIFT: Software Implemented Fault Tolerance
George A. Reis,Jonathan Chang,Neil Vachharajani,Ram Rangan,David I. August +4 more
- 20 Mar 2005
TL;DR: A novel, software-only, transient-fault-detection technique, called SWIFT, which efficiently manages redundancy by reclaiming unused instruction-level resources present during the execution of most programs and provides a high level of protection and performance with an enhanced control-flow checking mechanism.
800
A large-scale study of failures in high-performance computing systems
Bianca Schroeder,Garth A. Gibson +1 more
- 25 Jun 2006
TL;DR: Analysis of failure data collected at two large high-performance computing sites finds that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate.
716
Error detection by duplicated instructions in super-scalar processors
TL;DR: EDDI can provide over 98% fault-coverage without any extra hardware for error detection, which is especially useful when designers cannot change the hardware, but they need dependability in the computer system.
667