Stable storage

Topic Tools

Papers published on a yearly basis

Papers

Journal Article•10.1145/214451.214456•

Distributed snapshots: determining global states of distributed systems

[...]

K. Mani Chandy¹, Leslie Lamport²•Institutions (2)

University of Texas at Austin¹, SRI International²

01 Feb 1985-ACM Transactions on Computer Systems

TL;DR: An algorithm by which a process in a distributed system determines a global state of the system during a computation, which helps to solve an important class of problems: stable property detection.

...read moreread less

Abstract: This paper presents an algorithm by which a process in a distributed system determines a global state of the system during a computation. Many problems in distributed systems can be cast in terms of the problem of detecting global states. For instance, the global state detection algorithm helps to solve an important class of problems: stable property detection. A stable property is one that persists: once a stable property becomes true it remains true thereafter. Examples of stable properties are “computation has terminated,” “ the system is deadlocked” and “all tokens in a token ring have disappeared.” The stable property detection problem is that of devising algorithms to detect a given stable property. Global state detection can also be used for checkpointing.

...read moreread less

2,861 citations

Journal Article•10.1109/TSE.1987.232562•

Checkpointing and Rollback-Recovery for Distributed Systems

[...]

Richard Koo¹, Sam Toueg¹•Institutions (1)

Cornell University¹

01 Jan 1987-IEEE Transactions on Software Engineering

TL;DR: In this article, the authors consider the problem of bringing a distributed system to a consistent state after transient failures, and propose a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system from transient failures.

...read moreread less

Abstract: We consider the problem of bringing a distributed system to a consistent state after transient failures. We address the two components of this problem by describing a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system to a consistent state. In contrast to previous algorithms, they tolerate failures that occur during their executions. Furthermore, when a process takes a checkpoint, a minimal number of additional processes are forced to take checkpoints. Similarly, when a process rolls back and restarts after a failure, a minimal number of additional processes are forced to roll back with it. Our algorithms require each process to store at most two checkpoints in stable storage. This storage requirement is shown to be minimal under general assumptions.

...read moreread less

911 citations

Journal Article•10.1109/71.730527•

Diskless checkpointing

[...]

James S. Plank¹, Kai Li², M.A. Puening•Institutions (2)

University of Tennessee¹, Princeton University²

01 Oct 1998-IEEE Transactions on Parallel and Distributed Systems

TL;DR: It is concluded that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures.

...read moreread less

Abstract: Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkpointing and present the basic diskless checkpointing scheme along with several variants for improved performance. The performance of the basic scheme and its variants is evaluated on a high-performance network of workstations and compared to traditional disk-based checkpointing. We conclude that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures.

...read moreread less

385 citations

Patent•

Electronic mail system with methodology providing distributed message store

[...]

Nikolai Paul Christenson, Scott Lystig Fritchie, James Stephen Larson

12 Dec 2000

TL;DR: In this article, a distributed electronic mail system with a methodology providing distributed message storage and processing is described, which breaks up how the individual components of message data are stored, such that message data itself is broken up into two parts: a metadata (mutable) portion and an immutable portion.

...read moreread less

Abstract: An electronic mail system with a methodology providing distributed message storage and processing is described. In particular, this methodology breaks up how the individual components of message data are stored. Message data itself is broken up into two parts: a metadata (mutable) portion, and an immutable portion. The metadata portion represents that part of the message data that may change over time. This includes message status flags (e.g., the IMAP “message deleted” flag) and the message's position within a particular message folder, among other information. The immutable portion, which comprises the bulk of electronic mail data (namely, the message itself), once stored is never edited. Immutable data is written f+1 times on as many unique servers, to tolerate f number of server failures using Lampson's stable storage algorithm. The metadata portion is stored 2f+1 times on as many unique servers to tolerate f number of server failures using quorum voting. Once the message has been stored once, instead of being copied, its location is passed around by reference. The system utilizes a two-tier architecture. One tier consists of servers which store message metadata and immutable data, the Data Servers, and servers that operating upon those data, the Access Servers. Message store integrity is maintained in the event of server failure and as the set of Data Servers changes. In the latter case, I/O and storage workloads are dynamically redistributed across Data Servers in an efficient way.

...read moreread less

275 citations

Patent•

Database computer system with application recovery and dependency handling write cache

[...]

David B. Lomet¹•Institutions (1)

Microsoft¹

10 Mar 1997

TL;DR: In this paper, a database computer system and a method for making applications recoverable from system crashes is described, where the application state (i.e., address space) is treated as a single object which can be atomically flushed in a manner akin to flushing individual pages in database recovery techniques.

...read moreread less

Abstract: This invention concerns a database computer system and method for making applications recoverable from system crashes. The application state (i.e., address space) is treated as a single object which can be atomically flushed in a manner akin to flushing individual pages in database recovery techniques. To enable this monolithic treatment of the application, executions performed by the application are mapped to logical loggable operations which can be posted to the stable log. Any modifications to the application state are accumulated and the application state is periodically flushed to stable storage using an atomic procedure. The application recovery integrates with database recovery, and effectively eliminates or at least substantially reduces the need for check pointing applications. In addition, optimization techniques are described to make the read, write, and recovery phases more efficient.

...read moreread less

227 citations

...

Expand

Year	Papers
2021	4
2020	5
2019	5
2018	4
2017	11
2016	9

Topic Tools

Papers published on a yearly basis

Papers

Distributed snapshots: determining global states of distributed systems

Checkpointing and Rollback-Recovery for Distributed Systems

Diskless checkpointing

Electronic mail system with methodology providing distributed message store

Database computer system with application recovery and dependency handling write cache

Related Topics (5)

Performance Metrics