Checkpointing Algorithms for Parallel Computers

Open AccessDissertation

Checkpointing Algorithms for Parallel Computers

S Kalaiselvi

- 01 Feb 1997

1

TL;DR: Dedicated to m y beloved P a r e n t s a n d m y dear Uncle.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Software Performance Engineering Considerations in Unreliable Computing Environments.

Pierre M. Fiorini, +1 more

- 01 Jan 2005

TL;DR: It is shown that if proper checkpointing mechanisms are not properly implemented, then under certain conditions completion times of applications executing on the system exhibit properties of heavy-tail or power-tail distributions, which can lead to unpredictable and long completion times.

...read moreread less

1

References

•Journal Article•10.1145/214451.214456

Distributed snapshots: determining global states of distributed systems

K. Mani Chandy, +1 more

- 01 Feb 1985

- ACM Transactions on Computer Systems

TL;DR: An algorithm by which a process in a distributed system determines a global state of the system during a computation, which helps to solve an important class of problems: stable property detection.

...read moreread less

2.8K

Journal Article•10.1145/176979.176981

RAID: high-performance, reliable secondary storage

Peter M. Chen, +4 more

- 01 Jun 1994

- ACM Computing Surveys

TL;DR: A comprehensive overview of disk array technology and implementation topics such as refining the basic RAID levels to improve performance and designing algorithms to maintain data consistency are discussed.

...read moreread less

1.4K

•Journal Article•10.1109/TSE.1987.232562

Checkpointing and Rollback-Recovery for Distributed Systems

Richard Koo, +1 more

- 01 Jan 1987

- IEEE Transactions on Software Engineerin...

TL;DR: In this article, the authors consider the problem of bringing a distributed system to a consistent state after transient failures, and propose a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system from transient failures.

...read moreread less

911

Journal Article•10.1016/0020-0190(87)90125-6

On distributed snapshots

Ten H. Lai, +1 more

- 29 May 1987

- Information Processing Letters

TL;DR: An efficient snapshot algorithm that needs no control messages and does not require channels to be first-in-first-out and several stable properties can be detected with uncoordinated distributed snapshots are shown.

...read moreread less

262

Journal Article•10.1109/71.238629

An efficient protocol for checkpointing recovery in distributed systems

J.L. Kim, +1 more

- 01 Aug 1993

- IEEE Transactions on Parallel and Distri...

TL;DR: In this protocol, a process takes a checkpoint when it knows that all processes on which it computationally depends took their checkpoints, hence the process need not always wait for the decision made by the checkpointing coordinator as in the conventional synchronized protocols.

...read moreread less

137

...

Expand

Checkpointing Algorithms for Parallel Computers

Chat with Paper

AI Agents for this Paper

Citations

Software Performance Engineering Considerations in Unreliable Computing Environments.

References

Distributed snapshots: determining global states of distributed systems

RAID: high-performance, reliable secondary storage

Checkpointing and Rollback-Recovery for Distributed Systems

On distributed snapshots

An efficient protocol for checkpointing recovery in distributed systems

Related Papers (5)

Network multicomputing using recoverable distributed shared memory

Scalable parallel algorithms for random data accesses and shared-memory simulation

Address tracing for parallel machines

Workstation clusters for parallel computing

The SDAARC architecture