TL;DR: The optimal task to processor assignment is found by an algorithm based on results in Markov decision theory, which is completely general and applicable to N-processor systems.
Abstract: In a distributed computing system made up of different types of processors each processor in the system may have different performance and reliability characteristics. In order to take advantage of this diversity of processing power, a modular distributed program should have its modules assigned in such a way that the applicable system performance index, such as execution time or cost, is optimized. This paper describes an algorithm for making an optimal module to processor assignment for a given performance criteria. We first propose a computational model to characterize distributed programs, consisting of tasks and an operational precedence relationship. This model alows us to describe probabilistic branching as well as concurrent execution in a distributed program. The computational model along with a set of seven program descriptors completely specifies a model for dynamic execution of a program on a distributed system. The optimal task to processor assignment is found by an algorithm based on results in Markov decision theory. The algorithm given in this paper is completely general and applicable to N-processor systems.
TL;DR: A study of bounded clock synchronization under a more severe fault model than that proposed by Lamport and Melliar-Smith [1985] is initiated, and two randomized self-stabilizing protocols for synchronizing bounded clocks in the presence of Byzantine processor failures are presented.
Abstract: We initiate a study of bounded clock synchronization under a more severe fault model than that proposed by Lamport and Melliar-Smith [1985]. Realistic aspects of the problem of synchronizing clocks in the presence of faults are considered. One aspect is that clock synchronization is an on-going task, thus the assumption that some of the processors never fail is too optimistic. To cope with this reality, we suggest self-stabilizing protocols that stabilize in any (long enough) period in which less than a third of the processors are faulty. Another aspect is that the clock value of each processor is bounded. A single transient fault may cause the clock to reach the upper bound. Therefore, we suggest a bounded clock that wraps around when appropriate.We present two randomized self-stabilizing protocols for synchronizing bounded clocks in the presence of Byzantine processor failures. The first protocol assumes that processors have a common pulse, while the second protocol does not. A new type of distributed counter based on the Chinese remainder theorem is used as part of the first protocol.
TL;DR: In this article, uniform self-stabilizing leader election algorithms are presented for two natural classes of network graphs: complete graphs and rings, under either a local or global fairness condition, whereas the second requires global fairness.
Abstract: This paper considers the self-stabilizing leader-election problem in a model of interacting anonymous finite-state agents. Leader election is a fundamental problem in distributed systems; many distributed problems are easily solved with the help of a central coordinator. Self-stabilizing algorithms do not require initialization in order to operate correctly and can recover from transient faults that obliterate all state information in the system. Anonymous finite-state agents model systems of identical simple computational nodes such as sensor networks and biological computers. Self-stabilizing leader election is easily shown to be impossible in such systems without additional structure.
An eventual leader detector Ω? is an oracle that eventually detects the presence or absence of a leader. With the help of Ω?, uniform self-stabilizing leader election algorithms are presented for two natural classes of network graphs: complete graphs and rings. The first algorithm works under either a local or global fairness condition, whereas the second requires global fairness. With only local fairness, uniform self-stabilizing leader election in rings is impossible, even with the help of Ω?.
TL;DR: It is shown how to use the idea of self-stabilization, which originates in the context of distributed control, to make fault-tolerant iterative solvers, and has promise to become a useful tool for constructing resilient solvers more generally.
Abstract: We show how to use the idea of self-stabilization, which originates in the context of distributed control, to make fault-tolerant iterative solvers. Generally, a self-stabilizing system is one that, starting from an arbitrary state (valid or invalid), reaches a valid state within a finite number of steps. This property imbues the system with a natural means of tolerating transient faults. We give two proof-of-concept examples of self-stabilizing iterative linear solvers: one for steepest descent (SD) and one for conjugate gradients (CG). Our self-stabilized versions of SD and CG require small amounts of fault-detection, e.g., we may check only for NaNs and infinities. We test our approach experimentally by analyzing its convergence and overhead for different types and rates of faults. Beyond the specific findings of this paper, we believe self-stabilization has promise to become a useful tool for constructing resilient solvers more generally.
TL;DR: This paper surveys self-stabilizing algorithms that construct a spanning tree within a network of processing entities to find those that are well-suited for dynamic, failure prone environments.
Abstract: Self-stabilizing systems can automatically recover from ar- bitrary state perturbations in finite time. They are therefore well-suited for dynamic, failure prone environments. Spanning-tree construction in distributed systems is a fundamental task which forms the basis for many other network algorithms (like token circulation or routing). This paper surveys self-stabilizing algorithms that construct a spanning tree within a network of processing entities. Lower bounds and related work are also discussed.