Open AccessDissertation
Checkpointing Algorithms for Parallel Computers
S Kalaiselvi
- 01 Feb 1997
1
TL;DR: Dedicated to m y beloved P a r e n t s a n d m y dear Uncle.
read more
Abstract: Dedicated to m y beloved P a r e n t s a n d m y dear Uncle
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Software Performance Engineering Considerations in Unreliable Computing Environments.
Pierre M. Fiorini,Yiping Ding +1 more
- 01 Jan 2005
TL;DR: It is shown that if proper checkpointing mechanisms are not properly implemented, then under certain conditions completion times of applications executing on the system exhibit properties of heavy-tail or power-tail distributions, which can lead to unpredictable and long completion times.
1
References
Distributed snapshots: determining global states of distributed systems
K. Mani Chandy,Leslie Lamport +1 more
TL;DR: An algorithm by which a process in a distributed system determines a global state of the system during a computation, which helps to solve an important class of problems: stable property detection.
RAID: high-performance, reliable secondary storage
TL;DR: A comprehensive overview of disk array technology and implementation topics such as refining the basic RAID levels to improve performance and designing algorithms to maintain data consistency are discussed.
Checkpointing and Rollback-Recovery for Distributed Systems
Richard Koo,Sam Toueg +1 more
TL;DR: In this article, the authors consider the problem of bringing a distributed system to a consistent state after transient failures, and propose a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system from transient failures.
On distributed snapshots
Ten H. Lai,Tao H. Yang +1 more
TL;DR: An efficient snapshot algorithm that needs no control messages and does not require channels to be first-in-first-out and several stable properties can be detected with uncoordinated distributed snapshots are shown.
262
An efficient protocol for checkpointing recovery in distributed systems
TL;DR: In this protocol, a process takes a checkpoint when it knows that all processes on which it computationally depends took their checkpoints, hence the process need not always wait for the decision made by the checkpointing coordinator as in the conventional synchronized protocols.
137
Related Papers (5)
John E. Stone,Fikret Ercal +1 more
R. Moore,Bernd Klauer,Klaus Waldschmidt +2 more
- 07 Feb 2001