Journal Article10.1145/3375633
Visualizing Distributed System Executions
TL;DR: This article presents a novel approach for tackling three tasks frequently performed during analysis of distributed system executions: understanding the relative ordering of events, searching for specific patterns of interaction between hosts, and identifying structural similarities and differences between pairs of executions.
read more
Abstract: Distributed systems pose unique challenges for software developers. Understanding the system’s communication topology and reasoning about concurrent activities of system hosts can be difficult. The standard approach, analyzing system logs, can be a tedious and complex process that involves reconstructing a system log from multiple hosts’ logs, reconciling timestamps among hosts with non-synchronized clocks, and understanding what took place during the execution encoded by the log. This article presents a novel approach for tackling three tasks frequently performed during analysis of distributed system executions: (1) understanding the relative ordering of events, (2) searching for specific patterns of interaction between hosts, and (3) identifying structural similarities and differences between pairs of executions. Our approach consists of XVector, which instruments distributed systems to capture partial ordering information that encodes the happens-before relation between events, and ShiViz, which processes the resulting logs and presents distributed system executions as interactive time-space diagrams. Two user studies with a total of 109 students and a case study with 2 developers showed that our method was effective, helping participants answer statistically significantly more system-comprehension questions correctly, with a very large effect size.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Fault Injection Analytics: A Novel Approach to Discover Failure Modes in Cloud-Computing Systems
TL;DR: In this paper , the authors apply unsupervised machine learning on execution traces of the injected system, to ease the discovery and interpretation of failure modes, and evaluated the proposed approach in the context of fault injection experiments on the OpenStack cloud computing platform.
Theia: Visual Signatures for Problem Diagnosis in Large Hadoop Clusters.
Elmer Garduno,Soila Kavulya,Jiaqi Tan,Rajeev Gandhi,Priya Narasimhan +4 more
- 01 Jan 2013
TL;DR: Theia as discussed by the authors analyzes application-level logs in a Hadoop cluster, and generates visual signatures of each job's performance, providing compact representations of task durations, task status, and data consumption by jobs.
23
Fault Injection Analytics: A Novel Approach to Discover Failure Modes in Cloud-Computing Systems
TL;DR: A new paradigm (fault injection analytics) that applies unsupervised machine learning on execution traces of the injected system, to ease the discovery and interpretation of failure modes with a low computational cost is introduced.
16
An Interview Study of how Developers use Execution Logs in Embedded Software Engineering
Nan Yang,Pieter J. L. Cuijpers,Ramon R. H. Schiffelers,Johan Lukkien,Alexander Serebrenik +4 more
- 05 Jan 2021
TL;DR: In this article, the authors explore the type of logs developers analyze, the purposes for which developers analyze logs, the information developers need from logs and their expectation on tool support, and their main contribution is that the lack of domain knowledge, lack of familiarity with code base and software design, and presence of concurrency, raise major challenges in log analysis for complex and multidisciplinary systems.
15
Compiling Distributed System Models with PGo
Finn Hackett,Shayan Hosseini,Renato Mascarenhas Costa,Matthew Do,Ivan Beschastnikh +4 more
- 27 Jan 2023
TL;DR: Modular PlusCal as mentioned in this paper is a language that extends PlusCal by cleanly separating the model of a system from a model of its environment and then presents a compiler tool-chain called PGo that automatically translates MPCal models to TLA+ for model checking, and also compiles MPcal models to runnable Go code.
12
References
Time, clocks, and the ordering of events in a distributed system
TL;DR: In this article, the concept of one event happening before another in a distributed system is examined, and a distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events.
Time, clocks, and the ordering of events in a distributed system
TL;DR: In this paper, the concept of one event happening before another in a distributed system is examined, and a distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events.
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems
Antony Rowstron,Peter Druschel +1 more
TL;DR: Pastry as mentioned in this paper is a scalable, distributed object location and routing substrate for wide-area peer-to-peer ap- plications, which performs application-level routing and object location in a po- tentially very large overlay network of nodes connected via the Internet.
Statecharts: A visual formalism for complex systems
TL;DR: It is intended to demonstrate here that statecharts counter many of the objections raised against conventional state diagrams, and thus appear to render specification by diagrams an attractive and plausible approach.
7.5K
•Book
Concurrency Control and Recovery in Database Systems
Philip A. Bernstein,Vassco Hadzilacos,Nathan Goodman +2 more
- 01 Feb 1987
TL;DR: In this article, the design and implementation of concurrency control and recovery mechanisms for transaction management in centralized and distributed database systems is described. But this can lead to interference between queries and updates.
4.2K