Debugging distributed systems
TL;DR: Several key features and debugging challenges that differentiate distributed systems from other kinds of software that are presented in this article.
read more
Abstract: Distributed systems pose unique challenges for software developers. Reasoning about concurrent activities of system nodes and even understanding the system’s communication topology can be difficult...
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study
TL;DR: The results show that the current industrial practices of microservice debugging can be improved by employing proper tracing and visualization techniques and strategies, and suggest that there is a strong need for more intelligent trace analysis and visualization for distributed systems.
335
Anomaly Detection and Failure Root Cause Analysis in (Micro) Service-Based Cloud Applications: A Survey
03 Feb 2022
TL;DR: In this paper , the authors provide a structured overview and qualitative analysis of currently available techniques for anomaly detection and root cause analysis in modern multi-service applications and some open challenges and research directions stemming out from the analysis are also discussed.
Simulation for Robotics Test Automation: Developer Perspectives
Afsoon Afzal,Deborah S. Katz,Claire Le Goues,Christopher Steven Timperley +3 more
- 01 Apr 2021
TL;DR: In this paper, the authors conduct a survey of 82 robotics developers from a diversity of backgrounds, addressing the current capabilities and limits of simulation in practice, and find that simulation is used by 84% of their participants for testing, and that many participants want to use simulation as part of their test automation.
63
Promises and challenges of microservices: an exploratory study
TL;DR: In this paper, the authors conduct a broad, mixed-method study that includes in-depth interviews with 21 practitioners and a follow-up online survey with 37 respondents, covering 37 companies in total.
53
Visualizing Distributed System Executions
TL;DR: This article presents a novel approach for tackling three tasks frequently performed during analysis of distributed system executions: understanding the relative ordering of events, searching for specific patterns of interaction between hosts, and identifying structural similarities and differences between pairs of executions.
51
References
•Book
Concurrency Control and Recovery in Database Systems
Philip A. Bernstein,Vassco Hadzilacos,Nathan Goodman +2 more
- 01 Feb 1987
TL;DR: In this article, the design and implementation of concurrency control and recovery mechanisms for transaction management in centralized and distributed database systems is described. But this can lead to interference between queries and updates.
4.2K
Spanner: Google’s Globally Distributed Database
James C. Corbett,Jeffrey Dean,Michael James Boyer Epstein,Andrew Fikes,Christopher Frost,J. J. Furman,Sanjay Ghemawat,Andrey Gubarev,Christopher Heiser,Peter Hochschild,Wilson C. Hsieh,Sebastian Kanthak,Eugene Kogan,Hongyi Li,Alexander Lloyd,Sergey Melnik,David Mwaura,David Nagle,Sean Quinlan,Rajesh Rao,Lindsay Rolig,Yasushi Saito,Michal Piotr Szymaniak,Chris Jorgen Taylor,Ruth Wang,Dale Woodford +25 more
TL;DR: Spanner as mentioned in this paper is Google's scalable, multiversion, globally distributed, and synchronously replicated database, which is the first system to distribute data at global scale and support externally-consistent distributed transactions.
672
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
Benjamin H. Sigelman,Luiz Andre Barroso,Mike Burrows,Pat Stephenson,Manoj Plakal,Donald Beaver,Saul Jaspan,Chandan Shanbhag +7 more
- 01 Jan 2010
TL;DR: The design of Dapper is introduced, Google’s production distributed systems tracing infrastructure is described, and how its design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met are described.
Spanner: Google's globally-distributed database
Cooper Brian F
- 30 Jun 2013
TL;DR: The design and implementation of Spanner is discussed, as well as some of the lessons it has learned along the way, and some open challenges in building scalable distributed storage systems are discussed.
460
How Amazon web services uses formal methods
Chris Newcombe,Timothy Andrew Rath,Fan Zhang,Bogdan Munteanu,Marc John Brooker,Michael Benjamin Deardeuff +5 more
TL;DR: Engineers use TLA+ to prevent serious but subtle bugs from reaching production and find ways to reduce the number of bugs in the final product.
370