TL;DR: A model of remote procedure call which reflects certain generic properties of the application layer that can be exploited by the RPC layer during failure recovery is presented and a technique of adopting orphans caused by failures, which is based on the model, is described.
Abstract: A model of remote procedure call (RPC) which reflects certain generic properties of the application layer that can be exploited by the RPC layer during failure recovery is presented. A technique of adopting orphans caused by failures, which is based on the model, is described. The technique minimizes the rollback which may be required in orphan-killing techniques. Algorithmic details of the adoption technique are described, and a quantitative analysis is presented. The model is implemented as a prototype on a local area network. The simplicity and generality of the failure recovery renders the RPC model useful in distributed systems, particularly those that are large and heterogeneous and hence have complex failure modes. >
TL;DR: This paper shows how the WS-Naming profile on WS-Addressing Endpoint References can be used for identity, transparent failover, replication, and migration in the Web Services realm.
TL;DR: This paper describes a method for recovering permanent object state in an object-oriented distributed system and recommends that the user be insulated to the greatest possible degree from failure and its recovery and that the resulting system be as efficient as possible under normal conditions.
Abstract: This paper describes a method for recovering permanent object state in an object-oriented distributed system. Inspiration for this work was derived from observation of the lengths to which programmers have traditionally been forced to go in order to make their programs resilient to failure. This experience led to the decision that such a burden was unacceptable and that the onus of recovery be shifted onto the underlying operating system. Further goals were that the user be insulated to the greatest possible degree from failure and its recovery (failure transparency) and that the resulting system be as efficient as possible under normal conditions.
TL;DR: A theory of consistent recovery is constructed that provides the fundamental rules for recovering transparently after a failure, and it is found that failure transparency is feasible, even for the challenging application domain the authors target.
Abstract: System and application failures are all too common. In this dissertation we argue that operating systems should provide the fundamental abstraction we call failure transparency—the illusion that systems and applications do not fail. Systems that provide failure transparency attempt to completely mask failures from users, and failure handling from programmers. We construct a theory of consistent recovery that provides the fundamental rules for recovering transparently after a failure. In addition to aiding our quest for failure transparency, the theory unifies all existing recovery protocols: they are all simply variations on the theme of the theory's central invariant. Using the theory as a launching point, we construct a series of systems that get us closer to providing failure transparency. The first such system is Vista, a lightweight transaction library. Vista is built on reliable memory, and as a result realizes remarkable performance and simplicity. Vista improves transaction performance by three orders of magnitude over a similar disk-based system yet has 1/10th the code. Vista exposes the high cost in complexity of disk's slow performance. We use Vista to construct Vistagrams, a distributed system that can provide distributed recovery with almost no overhead. However, both Vista and Vistagrams depend on the programmers help in guaranteeing consistent recovery. Therefore, they cannot be said to provide failure transparency. To get us closer to that goal, we construct Discount Checking, a lightweight check-pointing system. Discount Checking can preserve and recover the complete state of a running process, including significant kernel state despite being itself a user-level system. Using Discount Checking's fast checkpoints, we construct seven recovery protocols and show the performance of each on a wide variety of real, interactive applications. We find that we can provide failure transparency with overhead of 0–2%. We conclude failure transparency is feasible, even for the challenging application domain we target.
TL;DR: The Persistent Object Group Service (POGS) described in this paper keeps track of the state of a distributed application as far as global checkpoint consistency is concerned, and is a necessary step on the way to integrating fault tolerance mechanisms in a late stage of the software development process.
Abstract: Fault tolerance is an issue of high importance to distributed systems, a fact that is well identified in the ISO/ITU Reference Model of ODP by the inclusion of failure transparency. The Persistent Object Group Service (POGS) described in this paper keeps track of the state of a distributed application as far as global checkpoint consistency is concerned. Application objects take checkpoints of their own in a non-coordinated fashion, using the POGS to detect global state inconsistencies. As a consequence of consulting POGS, objects take additional checkpoints that would not have occurred otherwise, but which are necessary to ensure global state consistency. The advantage of the POGS approach lies in the fact that global checkpoint consistency control is separated from the objects that actually do the checkpointing. This is a necessary step on the way to integrating fault tolerance mechanisms in a late stage of the software development process. A prototype of the POGS has been implemented in the Swiss Telecom R&D laboratories, using CORBA as a standard distributed systems technology.