Evaluating operating system vulnerability to memory errors

doi:10.1145/2318916.2318930

Open AccessProceedings Article10.1145/2318916.2318930

Evaluating operating system vulnerability to memory errors

Kurt B. Ferreira, +5 more

- 29 Jun 2012

- pp 11

15

TL;DR: The results show the Kitten lightweight operating system may be an easier target to harden against memory errors due to its smaller memory footprint, largely deterministic state, and simpler system structure.

Abstract: Reliability is of great concern to the scalability of extreme-scale systems. Of particular concern are soft errors in main memory, which are a leading cause of failures on current systems and are predicted to be the leading cause on future systems. While great effort has gone into designing algorithms and applications that can continue to make progress in the presence of these errors without restarting, the most critical software running on a node, the operating system (OS), is currently left relatively unprotected. OS resiliency is of particular importance because, though this software typically represents a small footprint of a compute node's physical memory, recent studies show more memory errors in this region of memory than the remainder of the system. In this paper, we investigate the soft error vulnerability of two operating systems used in current and future high-performance computing systems: Kitten, the lightweight kernel developed at Sandia National Laboratories, and CLE, a high-performance Linux-based operating system developed by Cray. For each of these platforms, we outline major structures and subsystems that are vulnerable to soft errors and describe methods that could be used to reconstruct damaged state. Our results show the Kitten lightweight operating system may be an easier target to harden against memory errors due to its smaller memory footprint, largely deterministic state, and simpler system structure.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

Figure 1. Comparison of Linux Kernel and Kitten Kernel source lines of code (SLOC).

Figure 3. Comparison of the worst case Kitten static and dynamic kernel size to the average case measured on CLE. The average CLE memory footprint is an order of magnitude larger then the worst case for Kitten.

Figure 2. Physical memory layout of Kitten and Linux.

Citations

Proceedings Article•10.1145/2491661.2481427

Hobbes: composition and virtualization as the foundations of an extreme-scale OS/R

Ron Brightwell, +3 more

- 10 Jun 2013

TL;DR: Hobbes as mentioned in this paper is an operating system and runtime (OS/R) framework for extreme-scale systems that makes use of virtualization technologies to provide the flexibility to support requirements of application components for different node-level operating systems and runtimes, as well as different mappings of the components onto the hardware.

...read moreread less

53

•Proceedings Article•10.1109/ISORC.2014.26

Effectiveness of Fault Detection Mechanisms in Static and Dynamic Operating System Designs

Martin Hoffmann, +6 more

- 10 Jun 2014

TL;DR: This work quantifies the difference in vulnerability for soft errors in main memory of a flexible (dynamic) operating systems (eCos) and a static system (CiAO), which has an OSEK-compliant structure and analyzes the additional degree of robustness that is achieved by hardening an operating system with software-based and hardware-based fault-tolerance measures and the corresponding costs.

...read moreread less

19

Proceedings Article•10.1145/2768405.2768414

What is a Lightweight Kernel

Rolf Riesen, +13 more

- 16 Jun 2015

TL;DR: What is meant by the term lightweight kernel, and what makes LWKs different from other operating system kernels are described, are described and no single definition for a lightweight kernel exists.

...read moreread less

16

Journal Article•10.1145/3007787.3001205

RelaxFault memory repair

Dong Wan Kim, +1 more

- 18 Jun 2016

TL;DR: It is shown that RelaxFault provides better repair capability than prior work of similar cost, improves memory reliability to a greater extent, and significantly reduces the number of maintenance events and memory module replacements.

...read moreread less

16

Patent

Page retirement in a NAND flash memory system

Charles J. Camp, +3 more

- 04 Dec 2013

TL;DR: In this article, a page is a smallest granularity of the NVRAM array that can be accessed by read and write operations, and a memory block containing multiple pages is the smallest granular of the memory array that cannot be erased.

...read moreread less

15

References

Journal Article•10.1109/TC.1984.1676475

Algorithm-Based Fault Tolerance for Matrix Operations

Kuang-Hua Huang, +1 more

- 01 Jun 1984

- IEEE Transactions on Computers

TL;DR: Algorithm-based fault tolerance schemes are proposed to detect and correct errors when matrix operations such as addition, multiplication, scalar product, LU-decomposition, and transposition are performed using multiple processor systems.

...read moreread less

1.4K

Journal Article•10.1109/TDSC.2009.4

A Large-Scale Study of Failures in High-Performance Computing Systems

Bianca Schroeder, +1 more

- 01 Oct 2010

- IEEE Transactions on Dependable and Secu...

TL;DR: Analysis of failure data collected at two large high-performance computing sites finds that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate.

...read moreread less

912

Proceedings Article•10.1109/CGO.2005.34

SWIFT: Software Implemented Fault Tolerance

George A. Reis, +4 more

- 20 Mar 2005

TL;DR: A novel, software-only, transient-fault-detection technique, called SWIFT, which efficiently manages redundancy by reclaiming unused instruction-level resources present during the execution of most programs and provides a high level of protection and performance with an enhanced control-flow checking mechanism.

...read moreread less

800

Proceedings Article•10.1109/DSN.2006.5

A large-scale study of failures in high-performance computing systems

Bianca Schroeder, +1 more

- 25 Jun 2006

TL;DR: Analysis of failure data collected at two large high-performance computing sites finds that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate.

...read moreread less

716

Journal Article•10.1109/24.994913

Error detection by duplicated instructions in super-scalar processors

Nahmsuk Oh, +2 more

- 07 Aug 2002

- IEEE Transactions on Reliability

TL;DR: EDDI can provide over 98% fault-coverage without any extra hardware for error detection, which is especially useful when designers cannot change the hardware, but they need dependability in the computer system.

...read moreread less

667