TL;DR: Using static vector clocks, this paper can drastically improve the efficiency of existing MHP analyses, without loss of precision: the performance speedup can be up to 1828X, with a much smaller memory footprint (reduced by up to 150X).
Abstract: May-Happen-in-Parallel (MHP) analysis computes whether two statements in a multi-threaded program may execute concurrently or not. It works as a basis for many analyses and optimization techniques of concurrent programs. This paper proposes a novel approach for MHP analysis, by statically computing vector clocks. Static vector clocks extend the classic vector clocks algorithm to handle the complex control flow structures in static analysis, and we have developed an efficient context-sensitive algorithm to compute them. To the best of our knowledge, this is the first attempt to compute vector clocks statically. Using static vector clocks, we can drastically improve the efficiency of existing MHP analyses, without loss of precision: the performance speedup can be up to 1828X, with a much smaller memory footprint (reduced by up to 150X). We have implemented our analysis in a static data race detector, and experimental results show that our MHP analysis can help remove up to 88% of spurious data race pairs.
TL;DR: The experimental findings prove that MC-CChecker not only effectively detects memory consistency errors as MC-Checker did, but also completely eliminates the potential source of false positives which is a major limitation of MC- checker while still retaining acceptable overheads of execution time and memory usage for DN-Analyzer.
Abstract: MPI one-sided communication decouples data movement from synchronization, which eliminates overhead from unneeded synchronization and allows for greater concurrency On the one hand this fact is the great advantage of MPI one-sided communication, but on the other, it poses enormous challenges for programmers in preserving the reliability of programs Memory consistency errors are notorious for degrading reliability as well as performance of MPI one-sided applications Even an MPI expert can easily make these mistakes The lockopts bug occurred in an RMA test case that is part of MPICH MPI implementation is an example for this situation Hence, detecting memory consistency errors is extremely challenging MC-Checker is the most cutting-edge debugger to address these errors effectively MC-Checker tackles the memory consistency errors based on the happened-before relation Taking full advantage of the relation makes DN-Analyzer of MC-Checker difficult to scale well For that reason, MC-Checker does ignore the transitive ordering of the happened-before relation to retain scalability of DN-Analyzer Consequently, MC-Checker is highly able to impose a potential source of false positivesIn order to overcome this issue, we present a novel clock-based approach called MC-CChecker with the aim of fully preserving the happened-before relation by making use of an encoded vector clock MC-CChecker inherits distinguishing features from MC-Checker by reusing ST-Analyzer and Profiler while focusing mainly on the optimization of DN-Analyzer The experimental findings prove that MC-CChecker not only effectively detects memory consistency errors as MC-Checker did, but also completely eliminates the potential source of false positives which is a major limitation of MC-Checker while still retaining acceptable overheads of execution time and memory usage for DN-Analyzer Especially, DN-Analyzer of MC-CChecker is fairly scalable when processing a large amount of trace files generated from running the lockopts up to 8192 processes
TL;DR: A two-phase method to derive vector clock information for a wide range of concurrency features that includes all of the message-passing features in Go, and builds a prototype for the Go programming language and provides empirical evidence of the usefulness of the method.
Abstract: Understanding the runtime behavior of concurrent programs is a challenging task. A popular approach is to establish a happens-before relation via vector clocks. Thus, we can identify bugs and performance bottlenecks, for example, by checking if two conflicting events may happen concurrently. We employ a two-phase method to derive vector clock information for a wide range of concurrency features that includes all of the message-passing features in Go. The first phase (instrumentation and tracing) yields a runtime trace that records all events related to message-passing concurrency that took place. The second phase (trace replay) is carried out of fline and replays the recorded traces to infer vector clock information. Trace replay operates on thread-local traces. Thus, we can observe behavior that might result from some alternative schedule. Our approach is not tied to any specific language. We have built a prototype for the Go programming language and provide empirical evidence of the usefulness of our method.
TL;DR: In this paper, the authors propose a wait-free vector clock algorithm, which is guaranteed to complete its operations within a finite number of steps and to recover after the occurrence of transient faults, such as soft errors and arbitrary violations of the assumptions according to which the system was designed to behave.
Abstract: Vector clock algorithms are basic wait-free building blocks that facilitate causal ordering of events. As wait-free algorithms, they are guaranteed to complete their operations within a finite number of steps. Stabilizing algorithms allow the system to recover after the occurrence of transient faults, such as soft errors and arbitrary violations of the assumptions according to which the system was designed to behave.
TL;DR: This work proposes the encoding of the vector clock using prime numbers to use a single number to represent vector time, and shows how to timestamp global states and how to perform operations on the global states using the EVC.
Abstract: The vector clock is a fundamental tool for tracking causality in distributed applications. Unfortunately, it does not scale well to large systems because each process needs to maintain a vector of size n, where n is the total number of processes in the system. To address this problem, we propose the encoding of the vector clock using prime numbers to use a single number to represent vector time. We propose the operations on the encoded vector clock (EVC). We then show how to timestamp global states and how to perform operations on the global states using the EVC. We also discuss scalability issues of the EVC.
TL;DR: A distributed unfolding technique is extended in order to annotate the resulting unfolding with time stamps, which allows for distributed construction of distributed testers for distributed systems.
TL;DR: Nemo, a NUMA-aware Transactional Memory (TM) design and implementation optimized for promoting scalability in applications running on top of N UMA architectures, is presented.
Abstract: In this paper we present Nemo, a NUMA-aware Transactional Memory (TM) design and implementation optimized for promoting scalability in applications running on top of NUMA architectures. Nemo deploys a hybrid design where conflicting threads alternate the usage of single timestamps and vector clocks to identify inconsistent executions depending upon the source of conflict. We assessed the performance of Nemo by using both synthetic and well-known OLTP transactional workloads. Our approach offers improvements over the six state-of-the-art competitors we implemented.
TL;DR: In this article, the authors employ a two-phase method to derive vector clock information for a wide range of concurrency features that includes all of the message-passing features in Go.
Abstract: Understanding the run-time behavior of concurrent programs is a challenging task. A popular approach is to establish a happens- before relation via vector clocks. Thus, we can identify bugs and per- formance bottlenecks, for example, by checking if two con icting events may happen concurrently. We employ a two-phase method to derive vector clock information for a wide range of concurrency features that includes all of the message-passing features in Go. The rst phase (instrumentation and tracing) yields a run-time trace that records all events related to message-passing concurrency that took place. The second phase (trace replay) is carried out o ine and replays the recorded traces to infer vector clock information. Trace replay operates on thread-local traces. Thus, we can observe behav- ior that might result from some alternative schedule. Our approach is not tied to any speci c language. We have built a prototype for the Go programming language and provide empirical evidence of the usefulness of our method.
TL;DR: A model for determining the bounds on the size of HVC is developed in terms of a delay differential equation, and it is shown that for many practical applications and deployment environments in Amazon EC2, the size remains only as a couple entries and substantially less than $n$.
Abstract: Hybrid vector clock(s) (HVC) provide a mechanism to combine the theory and practice of distributed systems. Improving on traditional vector clock(s) (VC), HVC utilizes synchronized physical clocks to reduce the size by focusing only on causality where the physical time associated with two events is within a given uncertainty window $\epsilon$ and letting physical clock alone determine the order of events that are outside the uncertainty window. In this paper, we develop a model for determining the bounds on the size of HVC. Our model uses four parameters, $\epsilon$ : uncertainty window, $\delta$ : message delay, $\alpha$ : communication frequency and $n$ : number of nodes in the system. We derive the size of HVC in terms of a delay differential equation, and show that the size predicted by our model is almost identical to the results obtained by simulation. We also identify closed form solutions that provide tight lower and upper bounds for useful special cases. We show that for many practical applications and deployment environments in Amazon EC2, the size of HVC remains only as a couple entries and substantially less than $n$ . Finally, although the analytical results rely on a specific communication pattern they are useful in evaluating size of HVC in different communication scenarios.
TL;DR: Veca is a leaderless consensus algorithm for state machine replication for which all replicas can commit commands concurrently at any time, and each command can be committed after just one round of communication with a majority of replicas in the normal case.
Abstract: Consensus algorithm is fundamental for distributed systems. It requires high throughput, low latency and high availability. However, to the best of our knowledge, these requirements cannot be satisfied well at the same time in any existing consensus algorithm. In this paper, we propose Veca, a consensus algorithm for state machine replication that tries to satisfy the requirements at the same time as well as possible. Veca is a leaderless consensus algorithm for which all replicas can commit commands concurrently at any time, and each command can be committed after just one round of communication with a majority of replicas in the normal case. Veca separates agreement from ordering and execution, which allows all replicas to commit commands concurrently without determining their order, but to track their dependencies using vector clocks. Then a subsequent replay phase assigns an order to the commands and executes them in that order. Commands are committed out of order and then be executed in the same order by all replicas. A replica can take the initiative to learn the decision for an instance using a failure recovery protocol. The leaderless design makes the systems built with Veca provide continuous service as long as more than half of the replicas are available. We show that Veca has higher throughput, lower latency and higher availability than several typical consensus algorithms. The correctness of Veca has been proved theoretically and its advantages are demonstrated through an experimental evaluation in LAN and WAN.
TL;DR: Zipper is an alternative to this naive algorithm, which identifies the presence or absence of data race in asymptotically superior time and is optimized for lightweight threads and has superior time complexity to leading vector clock algorithms.
Abstract: Task parallel programs that are free of data race are guaranteed to be deterministic, serializable, and free of deadlock. Techniques for verification of data race freedom vary in both accuracy and asymptotic complexity. One work is particularly well suited to task parallel programs with isolation and lightweight threads. It uses the Java Pathfinder model checker to reason about different schedules and proves the presence or absence of data race in a program on a fixed input. However, it uses a direct and inefficient transitive closure on the happens-before relation to reason about data race. This paper presents Zipper, an alternative to this naive algorithm, which identifies the presence or absence of data race in asymptotically superior time. Zipper is optimized for lightweight threads and, in the presence of many threads, has superior time complexity to leading vector clock algorithms. This paper includes an empirical study of Zipper and a comparison against the naive computation graph algorithm, demonstrating the superior performance it achieves.
TL;DR: The Logical Clock approach of Distributed system is discussed, which provides the facility of information exchange between the systems by passing the message over the communication network.
Abstract: A distributed system consisting of the system on the network.The network provides the facility of information exchange between the systems. The client-server systems are the basic concept for this type of system.In the distributed system the communication delay is finite and unpredictable. The process does not share common memory and communication happen by passing the message over the communication network. The message failure is also the common problems on the distributed system. The network data link layer resolve this problems by various algorithms such as (1) Stop and Wait (2) Stop and Wait for Noisy channels (3) Automatic repeat request and positive Acknowledgement with Retransmission (ARQ OR PAR) (4) Sliding window protocols (5) Go Back-N protocol (6) Protocol using selected repeat But in this paper we discuss the Logical Clock approach of Distributed system.
TL;DR: In this paper, a schedulable happens-before (SHB) partial order is proposed for dynamic race detection, where every pair of conflicting data accesses that are identified by SHB can be scheduled.
Abstract: Dynamic race detection is the problem of determining if an observed program execution reveals the presence of a data race in a program. The classical approach to solving this problem is to detect if there is a pair of conflicting memory accesses that are unordered by Lamport’s happens-before (HB) relation. HB based race detection is known to not report false positives, i.e., it is sound. However, the soundness guarantee of HB only promises that the first pair of unordered, conflicting events is a schedulable data race. That is, there can be pairs of HB-unordered conflicting data accesses that are not schedulable races because there is no reordering of the events of the execution, where the events in race can be executed immediately after each other. We introduce a new partial order, called schedulable happens-before (SHB) that exactly characterizes the pairs of schedulable data races — every pair of conflicting data accesses that are identified by SHB can be scheduled, and every HB-race that can be scheduled is identified by SHB. Thus, the SHB partial order is truly sound. We present a linear time, vector clock algorithm to detect schedulable races using SHB. Our experiments demonstrate the value of our algorithm for dynamic race detection — SHB incurs only little performance overhead and can scale to executions from real-world software applications without compromising soundness.