Failure transparency

Topic Tools

Papers

Failure transparency in remote procedure calls

[...]

Kaliappa Ravindran¹, Samuel T. Chanson²•Institutions (2)

bell northern research¹, University of British Columbia²

01 Aug 1989-IEEE Transactions on Computers

TL;DR: A model of remote procedure call which reflects certain generic properties of the application layer that can be exploited by the RPC layer during failure recovery is presented and a technique of adopting orphans caused by failures, which is based on the model, is described.

...read moreread less

Abstract: A model of remote procedure call (RPC) which reflects certain generic properties of the application layer that can be exploited by the RPC layer during failure recovery is presented. A technique of adopting orphans caused by failures, which is based on the model, is described. The technique minimizes the rollback which may be required in orphan-killing techniques. Algorithmic details of the adoption technique are described, and a quantitative analysis is presented. The model is implemented as a prototype on a local area network. The simplicity and generality of the failure recovery renders the RPC model useful in distributed systems, particularly those that are large and heterogeneous and hence have complex failure modes. >

...read moreread less

19 citations

Journal Issue•10.1002/CPE.V21:8•

WS-Naming: location migration, replication, and failure transparency support for Web Services

[...]

Andrew S. Grimshaw¹, Mark Morgan¹, Karolina Sarnowska¹•Institutions (1)

University of Virginia¹

01 Jun 2009-Concurrency and Computation: Practice and Experience

TL;DR: This paper shows how the WS-Naming profile on WS-Addressing Endpoint References can be used for identity, transparent failover, replication, and migration in the Web Services realm.

...read moreread less

Abstract: Naming transparencies, i.e. abstracting the name and binding of the entity being used from the endpoints that are actually doing the work, are used in distributed systems to simplify application development by hiding the complexity of the environment. In this paper, we demonstrate how to apply traditional distributed systems naming and binding techniques in the Web Services realm. Specifically, we show how the WS-Naming profile on WS-Addressing Endpoint References can be used for identity, transparent failover, replication, and migration. We begin with a discussion of the traditional distributed systems transparencies. We then present four detailed use cases. Next, we provide a brief background on both WS-Addressing and WS-Naming. Finally, we show how WS-Naming can be used to provide transparent implementations of our use cases. Copyright © 2009 John Wiley & Sons, Ltd.

...read moreread less

5 citations

Failure and its Recovery in an Object-Oriented Distributed System

[...]

S. Crane, Brendan Tangney

1 Jul 1991

TL;DR: This paper describes a method for recovering permanent object state in an object-oriented distributed system and recommends that the user be insulated to the greatest possible degree from failure and its recovery and that the resulting system be as efficient as possible under normal conditions.

...read moreread less

Abstract: This paper describes a method for recovering permanent object state in an object-oriented distributed system. Inspiration for this work was derived from observation of the lengths to which programmers have traditionally been forced to go in order to make their programs resilient to failure. This experience led to the decision that such a burden was unacceptable and that the onus of recovery be shifted onto the underlying operating system. Further goals were that the user be insulated to the greatest possible degree from failure and its recovery (failure transparency) and that the resulting system be as efficient as possible under normal conditions.

...read moreread less

4 citations

Theory and practice of failure transparency

[...]

David E. Lowell, Peter M. Chen

1 Jan 1999

TL;DR: A theory of consistent recovery is constructed that provides the fundamental rules for recovering transparently after a failure, and it is found that failure transparency is feasible, even for the challenging application domain the authors target.

...read moreread less

Abstract: System and application failures are all too common. In this dissertation we argue that operating systems should provide the fundamental abstraction we call failure transparency—the illusion that systems and applications do not fail. Systems that provide failure transparency attempt to completely mask failures from users, and failure handling from programmers. We construct a theory of consistent recovery that provides the fundamental rules for recovering transparently after a failure. In addition to aiding our quest for failure transparency, the theory unifies all existing recovery protocols: they are all simply variations on the theme of the theory's central invariant. Using the theory as a launching point, we construct a series of systems that get us closer to providing failure transparency. The first such system is Vista, a lightweight transaction library. Vista is built on reliable memory, and as a result realizes remarkable performance and simplicity. Vista improves transaction performance by three orders of magnitude over a similar disk-based system yet has 1/10th the code. Vista exposes the high cost in complexity of disk's slow performance. We use Vista to construct Vistagrams, a distributed system that can provide distributed recovery with almost no overhead. However, both Vista and Vistagrams depend on the programmers help in guaranteeing consistent recovery. Therefore, they cannot be said to provide failure transparency. To get us closer to that goal, we construct Discount Checking, a lightweight check-pointing system. Discount Checking can preserve and recover the complete state of a running process, including significant kernel state despite being itself a user-level system. Using Discount Checking's fast checkpoints, we construct seven recovery protocols and show the performance of each on a wide variety of real, interactive applications. We find that we can provide failure transparency with overhead of 0–2%. We conclude failure transparency is feasible, even for the challenging application domain we target.

...read moreread less

4 citations

Book Chapter•10.1007/978-0-387-35188-9_17•

The persistent object group service—an approach to fault tolerance of open distributed applications

[...]

M. Zweiacker

30 May 1997

TL;DR: The Persistent Object Group Service (POGS) described in this paper keeps track of the state of a distributed application as far as global checkpoint consistency is concerned, and is a necessary step on the way to integrating fault tolerance mechanisms in a late stage of the software development process.

...read moreread less

Abstract: Fault tolerance is an issue of high importance to distributed systems, a fact that is well identified in the ISO/ITU Reference Model of ODP by the inclusion of failure transparency. The Persistent Object Group Service (POGS) described in this paper keeps track of the state of a distributed application as far as global checkpoint consistency is concerned. Application objects take checkpoints of their own in a non-coordinated fashion, using the POGS to detect global state inconsistencies. As a consequence of consulting POGS, objects take additional checkpoints that would not have occurred otherwise, but which are necessary to ensure global state consistency. The advantage of the POGS approach lies in the fact that global checkpoint consistency control is separated from the objects that actually do the checkpointing. This is a necessary step on the way to integrating fault tolerance mechanisms in a late stage of the software development process. A prototype of the POGS has been implemented in the Swiss Telecom R&D laboratories, using CORBA as a standard distributed systems technology.

...read moreread less

2 citations

Topic Tools

Papers

Failure transparency in remote procedure calls

WS-Naming: location migration, replication, and failure transparency support for Web Services

Failure and its Recovery in an Object-Oriented Distributed System

Theory and practice of failure transparency

The persistent object group service—an approach to fault tolerance of open distributed applications

Related Topics (5)

Performance Metrics