TL;DR: This paper presents a combinatorial method for the reliability analysis of systems subject to such competing propagated failures and failure isolation effect, based on the total probability theorem, which is analytical, exact, and has no limitation on the type of time-to-failure distributions for the system components.
TL;DR: There's a simple technique that dramatically reduces the number of bugs in the authors' software, which will make most defects much easier to find and build their software to "fail fast".
Abstract: The most annoying aspect of software development is debugging. We don't mind the kinds of bugs that yield to a few minutes inspection. The bugs we hate are the ones that show up only after hours of successful operation, under unusual circumstances, or whose stack traces lead to dead ends. Fortunately, there's a simple technique that dramatically reduces the number of these bugs in our software. It won't reduce the overall number of bugs, at least not at first, but it'll make most defects much easier to find. The technique is to build our software to "fail fast".
TL;DR: It turns out that repeatable failures can be specified as violations of invariant properties (i.e., properties that must always hold) in a system, and an algorithm is presented to refine the system model and label those states of the refined system where the property is violated.
Abstract: In our earlier work, we introduced a state-based approach for the diagnosis of repeatedly occurring failures in discrete event systems (DESs). Since temporal logic provides a simpler way of specifying system properties; in this paper, a temporal-logic-based approach for diagnosing the occurrence of a repeated number of failures is developed. Linear-time temporal-logic (LTL) formulae are used to represent the specifications of DESs. Notions of prediagnosability for failures and diagnosability for repeated failures are introduced in the setting of temporal logic. A polynomial algorithm for the test of prediagnosability for failures is provided. The diagnosis problem for repeated failures in the temporal-logic setting is reduced to one in a state-based setting, and so the prior results of a state-based repeated failure diagnosis can be applied. Finally, a simple example is given for illustration. Note to Practitioners-Certain failures in a system are repeatable, such as routing errors in a manufacturing system. A theory for the diagnosis of such failures was presented in an earlier work of Jiang et al. The present paper uses temporal logic to specify such failures. It turns out that repeatable failures can be specified as violations of invariant properties (i.e., properties that must always hold). Given an invariant property that the system must always satisfy, an algorithm is presented to refine the system model and label those states of the refined system where the property is violated. The problem of repeated diagnosis then requires determining, within a bounded delay, each time a "failure-state" is visited. For this analysis, the existing theory developed by Jiang et al. is used.
TL;DR: Common cause failures occur when several failures have the same origin this paper, where the cause is a single external event, or common mode failures, where two systems fail in the same way for the same reason.
Abstract: A common cause failure occurs when several failures have the same origin. Common cause failures are either common event failures, where the cause is a single external event, or common mode failures, where two systems fail in the same way for the same reason. Common mode failures can occur at different times because of a design defect or a repeated external event. Common event failures reduce the reliability of on-line redundant systems but not of systems using off-line spare parts. Common mode failures reduce the dependability of systems using off-line spare parts and on-line redundancy.
TL;DR: In this paper, a failure detector is coupled to a failure handler for failure recovery for intra-enterprise failures and an interenterprise failure handler is also coupled to the failure detector for performing failure recovery in a second enterprise to which a failure in a first enterprise has been transferred.
Abstract: A method and system for processing multi-agent cooperative transactions. A failure detector is provided for detecting whether a failure is an intra-enterprise failure or an inter-enterprise failure. An intra-enterprise failure handler is coupled to the failure detector for performing failure recovery for intra-enterprise failures. Failure recovery for intra-enterprise failures can include identifying the scope of failure recovery within a first enterprise. Once the scope of failure recovery has been identified, a top-down undo operation of sub-transactions in the identified scope may be performed within the first enterprise. An inter-enterprise failure handler is also coupled to the failure detector for performing failure recovery for inter-enterprise failures. Failure recovery for inter-enterprise failures can include identifying the scope of failure recovery in a second enterprise to which a failure in a first enterprise has been transfered. Once the scope of failure recovery has been identified, a top-down undo operation of sub-transactions in the identified scope may be performed in the second enterprise.