Fault tolerant processes

doi:10.1007/BF01784887

Journal Article10.1007/BF01784887

Fault tolerant processes

Pankaj Jalote

- 01 Dec 1989

- Distributed Computing

- Vol. 3, Iss: 4, pp 187-195

37

TL;DR: This paper presents a model for message-logging based schemes to support fault-tolerant processes and develops conditions for proper message recovery in asynchronous systems and shows that requiring messages to be recovered in the same order as they were received before failure is a stricter requirement than necessary.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1145/568522.568525

A survey of rollback-recovery protocols in message-passing systems

Elmootazbellah Nabil Elnozahy, +3 more

- 01 Sep 2002

- ACM Computing Surveys

TL;DR: This survey covers rollback-recovery techniques that do not require special language constructs and distinguishes between checkpoint-based and log-based protocols, which rely solely on checkpointing for system state restoration.

...read moreread less

2K

Software Implemented Fault Tolerance: Technologies and Experience

Yennun Huang, +1 more

- 01 Jan 1993

TL;DR: Watchd, libft and nDFS are reusable components that provide up to the 3rd level of software fault tolerance and experience indicates that these modules provide eficient and economical means to increase the level of fault tolerance in a software product.

...read moreread less

188

Proceedings Article•10.1109/FTCS.1990.89386

The Delta-4 extra performance architecture (XPA)

P.A. Barret, +6 more

- 26 Jun 1990

TL;DR: The design of an extra performance architecture for Delta-4, which explicitly supports the requirements of real-time systems with respect to throughput and response, is presented and a solution based on message selection and preemption synchronization messages is proposed.

...read moreread less

114

Software Fault Tolerance in the Application Layer

Yennun Huang, +1 more

- 01 Jan 1995

TL;DR: Three reusable software components that provide up to the third level of software fault tolerance in the application layer are described that have been ported to a number of UNIX 2 platforms and can be used in any application with minimal programming effort.

...read moreread less

83

Proceedings Article•10.1109/ICDCS.1997.597853

Distributed recovery with K-optimistic logging

Yi-Min Wang, +2 more

- 27 May 1997

TL;DR: This paper introduces the concept of K-optimistic logging where K is the degree of optimism that can be used to fine-tune the tradeoff between failure-free overhead and recovery efficiency, and proves that only dependencies on those states that may be lost upon a failure need to be tracked on-line.

...read moreread less

73

...

Expand

References

Journal Article•10.1145/390016.808467

System structure for software fault tolerance

Brian Randell

- 01 Apr 1975

- Sigplan Notices

TL;DR: In this article, the authors present a method for structuring complex computing systems by the use of what they term "recovery blocks", "conversations", and "fault-tolerant interfaces".

...read moreread less

1.8K

Proceedings Article•10.1145/800027.808467

System structure for software fault tolerance

Brian Randell

- 01 Jan 1975

TL;DR: In this article, the authors present a method for structuring complex computing systems by the use of what they term "recovery blocks", "conversations", and "fault-tolerant interfaces".

...read moreread less

1.1K

•Journal Article•10.1145/357369.357371

Fail-stop processors: an approach to designing fault-tolerant computing systems

Richard D. Schlichting, +1 more

- 01 Aug 1983

- ACM Transactions on Computer Systems

758

Journal Article•10.1145/2166.357215

Guardians and Actions: Linguistic Support for Robust, Distributed Programs

Barbara Liskov, +1 more

- 01 Jul 1983

- ACM Transactions on Programming Language...

TL;DR: In this article, the authors present an integrated programming language and system designed to support the construction and maintenance of distributed programs: programs in which modules reside and execute at communicating, but geographically distinct, nodes.

...read moreread less

628

•Proceedings Article•10.5555/800253.807732

A principle for resilient sharing of distributed resources

Peter A. Alsberg, +1 more

- 13 Oct 1976

TL;DR: The single primary, multiple backup technique for resource sharing is shown to have minimal delay and is superior to multiple primary techniques.

...read moreread less

503