Journal Article10.1007/BF01784887
Fault tolerant processes
37
TL;DR: This paper presents a model for message-logging based schemes to support fault-tolerant processes and develops conditions for proper message recovery in asynchronous systems and shows that requiring messages to be recovered in the same order as they were received before failure is a stricter requirement than necessary.
read more
Abstract: A process is said to be fault tolerant if the system provides proper service despite the failure of the process. For supporting fault-tolerant processes, measures have to be provided to recover messages lost due to the failure. One approach for recovering messages is to use message-logging techniques. In this paper, we present a model for message-logging based schemes to support fault-tolerant processes and develop conditions for proper message recovery in asynchronous systems. We show that requiring messages to be recovered in the same order as they were received before failure is a stricter requirement than necessary. We then propose a distributed scheme to support fault-tolerant processes that can also handle multiple process failures.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
A survey of rollback-recovery protocols in message-passing systems
TL;DR: This survey covers rollback-recovery techniques that do not require special language constructs and distinguishes between checkpoint-based and log-based protocols, which rely solely on checkpointing for system state restoration.
Software Implemented Fault Tolerance: Technologies and Experience
Yennun Huang,Chandra M. R. Kintala +1 more
- 01 Jan 1993
TL;DR: Watchd, libft and nDFS are reusable components that provide up to the 3rd level of software fault tolerance and experience indicates that these modules provide eficient and economical means to increase the level of fault tolerance in a software product.
188
The Delta-4 extra performance architecture (XPA)
P.A. Barret,A.M. Hilborne,P.G. Bond,DT Seaton,Paulo Veríssimo,Luís Rodrigues,Neil A. Speirs +6 more
- 26 Jun 1990
TL;DR: The design of an extra performance architecture for Delta-4, which explicitly supports the requirements of real-time systems with respect to throughput and response, is presented and a solution based on message selection and preemption synchronization messages is proposed.
114
Software Fault Tolerance in the Application Layer
Yennun Huang,Chandra M. R. Kintala +1 more
- 01 Jan 1995
TL;DR: Three reusable software components that provide up to the third level of software fault tolerance in the application layer are described that have been ported to a number of UNIX 2 platforms and can be used in any application with minimal programming effort.
Distributed recovery with K-optimistic logging
Yi-Min Wang,Om P. Damani,Vijay K. Garg +2 more
- 27 May 1997
TL;DR: This paper introduces the concept of K-optimistic logging where K is the degree of optimism that can be used to fine-tune the tradeoff between failure-free overhead and recovery efficiency, and proves that only dependencies on those states that may be lost upon a failure need to be tracked on-line.
References
System structure for software fault tolerance
TL;DR: In this article, the authors present a method for structuring complex computing systems by the use of what they term "recovery blocks", "conversations", and "fault-tolerant interfaces".
System structure for software fault tolerance
Brian Randell
- 01 Jan 1975
TL;DR: In this article, the authors present a method for structuring complex computing systems by the use of what they term "recovery blocks", "conversations", and "fault-tolerant interfaces".
1.1K
Guardians and Actions: Linguistic Support for Robust, Distributed Programs
TL;DR: In this article, the authors present an integrated programming language and system designed to support the construction and maintenance of distributed programs: programs in which modules reside and execute at communicating, but geographically distinct, nodes.
628
A principle for resilient sharing of distributed resources
Peter A. Alsberg,John D. Day +1 more
- 13 Oct 1976
TL;DR: The single primary, multiple backup technique for resource sharing is shown to have minimal delay and is superior to multiple primary techniques.