TL;DR: A comprehensive overview of a broad spectrum of fault localization techniques, each of which aims to streamline the fault localization process and make it more effective by attacking the problem in a unique way is provided.
Abstract: Software fault localization, the act of identifying the locations of faults in a program, is widely recognized to be one of the most tedious, time consuming, and expensive – yet equally critical – activities in program debugging. Due to the increasing scale and complexity of software today, manually locating faults when failures occur is rapidly becoming infeasible, and consequently, there is a strong demand for techniques that can guide software developers to the locations of faults in a program with minimal human intervention. This demand in turn has fueled the proposal and development of a broad spectrum of fault localization techniques, each of which aims to streamline the fault localization process and make it more effective by attacking the problem in a unique way. In this article, we catalog and provide a comprehensive overview of such techniques and discuss key issues and concerns that are pertinent to software fault localization as a whole.
TL;DR: This work proposes a new technique that utilizes the wealth of bug fixes across projects in their development history to effectively guide and drive a program repair process, and can produce good-quality fixes for many more bugs as compared to the baselines while beingreasonably computationally efficient.
Abstract: Effective automated program repair techniques have great potential to reduce the costs of debugging and maintenance. Previously proposed automated program repair (APR) techniques often follow a generate-and-validate and test-case-driven procedure: They first randomly generate a large pool of fix candidates and then exhaustively validate the quality of the candidates by testing them against existing or provided test suites. Unfortunately, many real-world bugs cannot be repaired by existing techniques even after more than 12 hours of computation in a multi-core cloud environment. More work is needed to advance the capabilities of modern APR techniques. We propose a new technique that utilizes the wealth of bug fixesacross projects in their development history to effectively guide and drive a programrepair process. Our main insight is that recurring bug fixes are common inreal-world applications, and that previously-appearing fix patterns canprovide useful guidance to an automated repair technique. Based on this insight, our technique first automaticallymines bug fix patterns from the history of many projects. We then employ existingmutation operators to generate fix candidates for a given buggy program. Candidates that match frequently occurring historical bug fixes are consideredmore likely to be relevant, and we thus give them priority inthe random search process. Finally, candidates thatpass all the previously failed test cases are recommended as likely fixes. We compare our technique against existinggenerate-and-validate and test-driven APR approaches using 90 bugs from 5 Javaprograms. The experiment results show that our technique can producegood-quality fixes for many more bugs as compared to the baselines, while beingreasonably computationally efficient: it takes less than 20minutes, on average, to correctly fix a bug.
TL;DR: This book is the definitive guide to KeY that lets you explore the full potential of deductive software verification in practice and contains the complete theory behind KeY for active researchers who want to understand it in depth or use it in their own work.
Abstract: Static analysis of software with deductive methods is a highly dynamic field of research on the verge of becoming a mainstream technology in software engineering. It consists of a large portfolio of - mostly fully automated - analyses: formal verification, test generation, security analysis, visualization, and debugging. All of them are realized in the state-of-art deductive verification framework KeY. This book is the definitive guide to KeY that lets you explore the full potential of deductive software verification in practice. It contains the complete theory behind KeY for active researchers who want to understand it in depth or use it in their own work. But the book also features fully self-contained chapters on the Java Modeling Language and on Using KeY that require nothing else than familiarity with Java. All other chapters are accessible for graduate students (M.Sc. level and beyond). The KeY framework is free and open software, downloadable from the book companion website which contains also all code examples mentioned in this book.
TL;DR: To the best of the knowledge, ClickNP is the first FPGA-accelerated platform for NFs, written completely in high-level language and achieving 40 Gbps line rate at any packet size and reducing latency by 10x.
Abstract: Highly flexible software network functions (NFs) are crucial components to enable multi-tenancy in the clouds. However, software packet processing on a commodity server has limited capacity and induces high latency. While software NFs could scale out using more servers, doing so adds significant cost. This paper focuses on accelerating NFs with programmable hardware, i.e., FPGA, which is now a mature technology and inexpensive for datacenters. However, FPGA is predominately programmed using low-level hardware description languages (HDLs), which are hard to code and difficult to debug. More importantly, HDLs are almost inaccessible for most software programmers. This paper presents ClickNP, a FPGA-accelerated platform for highly flexible and high-performance NFs with commodity servers. ClickNP is highly flexible as it is completely programmable using high-level C-like languages, and exposes a modular programming abstraction that resembles Click Modular Router. ClickNP is also high performance. Our prototype NFs show that they can process traffic at up to 200 million packets per second with ultra-low latency ($
TL;DR: An IR-based approach Locus is proposed to locate bugs using software changes, which offer finer granularity than files and provide important contextual clues for bug-fixing, and it is shown that Locus outperforms existing techniques at the source file level localization significantly.
Abstract: Various information retrieval (IR) based techniques have been proposed recently to locate bugs automatically at the file level. However, their usefulness is often compromised by the coarse granularity of files and the lack of contextual information. To address this, we propose to locate bugs using software changes, which offer finer granularity than files and provide important contextual clues for bug-fixing. We observe that bug inducing changes can facilitate the bug fixing process. For example, it helps triage the bug fixing task to the developers who committed the bug inducing changes or enables developers to fix bugs by reverting these changes. Our study further identifies that change logs and the naturally small granularity of changes can help boost the performance of IR-based bug localization. Motivated by these observations, we propose an IR-based approach Locus to locate bugs from software changes, and evaluate it on six large open source projects. The results show that Locus outperforms existing techniques at the source file level localization significantly. MAP and MRR in particular have been improved, on average, by 20.1% and 20.5%, respectively. Locus is also capable of locating the inducing changes within top 5 for 41.0% of the bugs. The results show that Locus can significantly reduce the number of lines needing to be scanned to locate the bug compared with existing techniques.
TL;DR: A set of multi-level GUI Comparison Criteria (GUICC) that provides the selection of multiple abstraction levels for GUI model generation and can alleviate the inherent state explosion problems of existing a single-level GuICC for behavior modeling of real-world Android apps by flexibly manipulating GUICC.
Abstract: Automated Graphical User Interface (GUI) testing is one of the most widely used techniques to detect faults in mobile applications (apps) and to test functionality and usability. GUI testing exercises behaviors of an application under test (AUT) by executing events on GUIs and checking whether the app behaves correctly. In particular, because Android leads in market share of mobile OS platforms, a lot of research on automated Android GUI testing techniques has been performed. Among various techniques, we focus on model-based Android GUI testing that utilizes a GUI model for systematic test generation and effective debugging support. Since test inputs are generated based on the underlying model, accurate GUI modeling of an AUT is the most crucial factor in order to generate effective test inputs. However, most modern Android apps contain a number of dynamically constructed GUIs that make accurate behavior modeling more challenging. To address this problem, we propose a set of multi-level GUI Comparison Criteria (GUICC) that provides the selection of multiple abstraction levels for GUI model generation. By using multilevel GUICC, we conducted empirical experiments to identify the influence of GUICC on testing effectiveness. Results show that our approach, which performs model-based testing with multi-level GUICC, achieved higher effectiveness than activity-based GUI model generation. We also found that multi-level GUICC can alleviate the inherent state explosion problems of existing a single-level GUICC for behavior modeling of real-world Android apps by flexibly manipulating GUICC.
TL;DR: This paper proposes a new approach - Bugram - that leverages n-gram language models instead of rules to detect bugs, and suggests that Bugram is complementary to existing rule-based bug detection approaches.
Abstract: To improve software reliability, many rule-based techniques have been proposed to infer programming rules and detect violations of these rules as bugs. These rule-based approaches often rely on the highly frequent appearances of certain patterns in a project to infer rules. It is known that if a pattern does not appear frequently enough, rules are not learned, thus missing many bugs. In this paper, we propose a new approach—Bugram—that leverages n-gram language models instead of rules to detect bugs. Bugram models program tokens sequentially, using the n-gram language model. Token sequences from the program are then assessed according to their probability in the learned model, and low probability sequences are marked as potential bugs. The assumption is that low probability token sequences in a program are unusual, which may indicate bugs, bad practices, or unusual/special uses of code of which developers may want to be aware. We evaluate Bugram in two ways. First, we apply Bugram on the latest versions of 16 open source Java projects. Results show that Bugram detects 59 bugs, 42 of which are manually verified as correct, 25 of which are true bugs and 17 are code snippets that should be refactored. Among the 25 true bugs, 23 cannot be detected by PR-Miner. We have reported these bugs to developers, 7 of which have already been confirmed by developers (4 of them have already been fixed), while the rest await confirmation. Second, we further compare Bugram with three additional graph- and rule-based bug detection tools, i.e., JADET, Tikanga, and GrouMiner. We apply Bugram on 14 Java projects evaluated in these three studies. Bugram detects 21 true bugs, at least 10 of which cannot be detected by these three tools. Our results suggest that Bugram is complementary to existing rule-based bug detection approaches.
TL;DR: New tools that check for violation of the invariant online and visualize scheduling activity are built that are simple, easily portable across kernel versions, and run with a negligible overhead to help keep this type of bug at bay.
Abstract: As a central part of resource management, the OS thread scheduler must maintain the following, simple, invariant: make sure that ready threads are scheduled on available cores. As simple as it may seem, we found that this invariant is often broken in Linux. Cores may stay idle for seconds while ready threads are waiting in runqueues. In our experiments, these performance bugs caused many-fold performance degradation for synchronization-heavy scientific applications, 13% higher latency for kernel make, and a 14-23% decrease in TPC-H throughput for a widely used commercial database. The main contribution of this work is the discovery and analysis of these bugs and providing the fixes. Conventional testing techniques and debugging tools are ineffective at confirming or understanding this kind of bugs, because their symptoms are often evasive. To drive our investigation, we built new tools that check for violation of the invariant online and visualize scheduling activity. They are simple, easily portable across kernel versions, and run with a negligible overhead. We believe that making these tools part of the kernel developers' tool belt can help keep this type of bug at bay.
TL;DR: This study finds that both the accurate and mediocre spectra-based fault localization tools can help professionals to save their debugging time, and the improvements are statistically significant and substantial.
Abstract: Due to the complexity of software systems, bugs are inevitable. Software debugging is tedious and time consuming. To help developers perform this crucial task, a number of spectra-based fault localization techniques have been proposed. In general, spectra-based fault localization helps developers to find the location of a bug given its symptoms (e.g., program failures). A previous study by Parnin and Orso however implies that several assumptions made by existing work on spectra-based fault localization do not hold in practice, which hinders the practical usage of these tools. Moreover, a recent study by Xie et al. claims that spectra-based fault localization can potentially "weaken programmers' abilities in fault detection".Unfortunately, these studies are performed either using only 2 bugs from small systems (Parnin and Orso's study) or synthetic bugs injected into toy programs (Xie et al.'s study), only involve students, and use dated spectra-based fault localization tools. Thus, the question whether spectra-based fault localization techniques can help professionals to improve their debugging efficiency in a reasonably large project is still insufficiently answered. In this paper, we perform a more realistic investigation of how professionals can use and benefit from spectra-based fault localization techniques. We perform a user study of spectra-based fault localization with a total of 16 real bugs from 4 reasonably large open-source projects, with 36 professionals, amounting to 80 recorded debugging hours. The 36 professionals are divided into 3 groups, i.e., those that use an accurate fault localization tool, use a mediocre fault localization tool, and do not use any fault localization tool. Our study finds that both the accurate and mediocre spectra-based fault localization tools can help professionals to save their debugging time, and the improvements are statistically significant and substantial. We also discuss implications of our findings to future directions of spectra-based fault localization.
TL;DR: This study conducts the first empirical study on the characteristics of the bugs in two main-stream compilers, GCC and LLVM, and shows that even mature production compilers still have many bugs, which may affect development.
Abstract: Compilers are critical, widely-used complex software. Bugs in them have significant impact, and can cause serious damage when they silently miscompile a safety-critical application. An in-depth understanding of compiler bugs can help detect and fix them. To this end, we conduct the first empirical study on the characteristics of the bugs in two main-stream compilers, GCC and LLVM. Our study is significant in scale — it exhaustively examines about 50K bugs and 30K bug fix revisions over more than a decade’s span. This paper details our systematic study. Summary findings include: (1) In both compilers, C++ is the most buggy component, accounting for around 20% of the total bugs and twice as many as the second most buggy component; (2) the bug revealing test cases are typically small, with 80% having fewer than 45 lines of code; (3) most of the bug fixes touch a single source file with small modifications (43 lines for GCC and 38 for LLVM on average); (4) the average lifetime of GCC bugs is 200 days, and 111 days for LLVM; and (5) high priority tends to be assigned to optimizer bugs, most notably 30% of the bugs in GCC’s inter-procedural analysis component are labeled P1 (the highest priority). This study deepens our understanding of compiler bugs. For application developers, it shows that even mature production compilers still have many bugs, which may affect development. For researchers and compiler developers, it sheds light on interesting characteristics of compiler bugs, and highlights challenges and opportunities to more effectively test and debug compilers.
TL;DR: OWL is presented, an Online Watcher for LTE that is able to decode all the resource blocks in more than 99% of the system frames, significantly outperforming existing non-commercial prior decoders.
Abstract: Reliable network measurements are a fundamental component of networking research as they enable network analysis, system debugging, performance evaluation and optimization. In particular, decoding the LTE control channel would give access to the full base station traffic at a 1 ms granularity, thus allowing for traffic profiling and accurate measurements. Although a few open-source implementations of LTE are available, they do not provide tools to reliably decoding the LTE control channel and, thus, accessing the scheduling information. In this paper, we present OWL, an Online Watcher for LTE that is able to decode all the resource blocks in more than 99% of the system frames, significantly outperforming existing non-commercial prior decoders. Compared to previous attempts, OWL grounds the decoding procedure on information obtained from the LTE random access mechanism. This makes it possible to run our software on inexpensive hardware coupled with almost any software defined radio capable of sampling the LTE signal with sufficient accuracy.
TL;DR: Adaptive instructional strategies and materials can be developed for students of different performance levels, to improve associated cognitive activities during debugging, which can foster learning during debugging and programming.
Abstract: This study explores students' cognitive processes while debugging programs by using an eye tracker. Students' eye movements during debugging were recorded by an eye tracker to investigate whether and how high- and low-performance students act differently during debugging. Thirty-eight computer science undergraduates were asked to debug two C programs. The path of students' gaze while following program codes was subjected to sequential analysis to reveal significant sequences of areas examined. These significant gaze path sequences were then compared to those of students with different debugging performances. The results show that, when debugging, high-performance students traced programs in a more logical manner, whereas low-performance students tended to stick to a line-by-line sequence and were unable to quickly derive the program's higher-level logic. Low-performance students also often jumped directly to certain suspected statements to find bugs, without following the program's logic. They also often needed to trace back to prior statements to recall information, and spent more time on manual computation. Based on the research results, adaptive instructional strategies and materials can be developed for students of different performance levels, to improve associated cognitive activities during debugging, which can foster learning during debugging and programming.
TL;DR: Several key features and debugging challenges that differentiate distributed systems from other kinds of software that are presented in this article.
Abstract: Distributed systems pose unique challenges for software developers. Reasoning about concurrent activities of system nodes and even understanding the system’s communication topology can be difficult...
TL;DR: This paper randomly sampled 113 real-world performance bugs from three highly-configurable open-source projects: Apache, MySQL and Firefox to study the challenges that configurability creates for handling performance bugs.
Abstract: Modern computer systems are highly-configurable, complicating the testing and debugging process. The sheer size of the configuration space makes the quality of software even harder to achieve. Performance is one of the key aspects of non-functional qualities, where performance bugs can cause significant performance degradation and lead to poor user experience. However, performance bugs are difficult to expose, primarily because detecting them requires specific inputs, as well as a specific execution environment (e.g., configurations). While researchers have developed techniques to analyze, quantify, detect, and fix performance bugs, we conjecture that many of these techniques may not be effective in highly-configurable systems. In this paper, we study the challenges that configurability creates for handling performance bugs. We study 113 real-world performance bugs, randomly sampled from three highly-configurable open-source projects: Apache, MySQL and Firefox. The findings of this study provide a set of lessons learned and guidance to aid practitioners and researchers to better handle performance bugs in highly-configurable software systems.
TL;DR: This tool demonstration paper presents the execution framework offered by the GEMOC studio, an Eclipse-based language and modeling workbench that provides a generic interface to plug in different execution engines associated to their specific metalanguages used to define the discrete-event operational semantics of DSMLs.
Abstract: The development and evolution of an advanced modeling environment for a Domain-Specific Modeling Language (DSML) is a tedious task, which becomes recurrent with the increasing number of DSMLs involved in the development and management of complex software-intensive systems. Recent efforts in language workbenches result in advanced frameworks that automatically provide syntactic tooling such as advanced editors. However, defining the execution semantics of languages and their tooling remains mostly hand crafted. Similarly to editors that share code completion or syntax highlighting, the development of advanced debuggers, animators, and others execution analysis tools shares common facilities, which should be reused among various DSMLs. In this tool demonstration paper, we present the execution framework offered by the GEMOC studio, an Eclipse-based language and modeling workbench. The framework provides a generic interface to plug in different execution engines associated to their specific metalanguages used to define the discrete-event operational semantics of DSMLs. It also integrates generic runtime services that are shared among the approaches used to implement the execution semantics, such as graphical animation or omniscient debugging.
TL;DR: A new graphical tool targeting at the creation and execution of robotic tasks, called RAFCON, which provides many debugging mechanisms and a GUI with a graphical editor, allowing for intuitive visual programming and fast prototyping.
Abstract: Robotic tasks are becoming increasingly complex, and with this also the robotic systems. This requires new tools to manage this complexity and to orchestrate the systems to fulfill demanding autonomous tasks. For this purpose, we developed a new graphical tool targeting at the creation and execution of robotic tasks, called RAFCON. These tasks are described in hierarchical state machines supporting concurrency. A formal notation of this concept is given. The tool provides many debugging mechanisms and a GUI with a graphical editor, allowing for intuitive visual programming and fast prototyping. The application of RAFCON for an autonomous mobile robot in the SpaceBotCamp competition has already proved to be successful.
TL;DR: This paper shows how techniques from model-based diagnosis can be applied and extended for spreadsheet debugging by translating the relevant parts of a spreadsheet to a constraint satisfaction problem and proposes both problem-specific and generalizable extensions to the classical diagnosis algorithms.
Abstract: Spreadsheet programs are probably the most successful example of end-user software development tools and are used for a variety of purposes. Like any type of software, they are prone to error, in particular as they are usually developed by non-programmers. While various techniques exist to support the developer in finding errors in procedural programs, the tool support for spreadsheet debugging is still limited. In this paper, we show how techniques from model-based diagnosis can be applied and extended for spreadsheet debugging by translating the relevant parts of a spreadsheet to a constraint satisfaction problem. We additionally propose both problem-specific and generalizable extensions to the classical diagnosis algorithms which help to detect potential problems in a spreadsheet based on user-provided test cases more efficiently. The proposed techniques were integrated into a modular framework for spreadsheet debugging and evaluated with respect to scalability based on a number of real-world and artificially created spreadsheets. An additional error detection exercise involving 24 subjects was performed to assess the general applicability of such advanced spreadsheet debugging techniques for end users.
TL;DR: A novel debugging tool for electronic design projects, the Toastboard, that aims to reduce debugging time by improving upon the standard paradigm of point-wise circuit measurements, which allows for immediate visualization of an entire breadboard's state.
Abstract: The recent proliferation of easy to use electronic components and toolkits has introduced a large number of novices to designing and building electronic projects. Nevertheless, debugging circuits remains a difficult and time-consuming task. This paper presents a novel debugging tool for electronic design projects, the Toastboard, that aims to reduce debugging time by improving upon the standard paradigm of point-wise circuit measurements. Ubiquitous instrumentation allows for immediate visualization of an entire breadboard's state, meaning users can diagnose problems based on a wealth of data instead of having to form a single hypothesis and plan before taking a measurement. Basic connectivity information is displayed visually on the circuit itself and quantitative data is displayed on the accompanying web interface. Software-based testing functions further lower the expertise threshold for efficient debugging by diagnosing classes of circuit errors automatically. In an informal study, participants found the detailed, pervasive, and context-rich data from our tool helpful and potentially time-saving.
TL;DR: This paper presents the experience in running a building automation service based on the Salus framework, and demonstrates the usefulness of Salus in systematically debugging the correctness of IoT control systems for building automation.
Abstract: Advances and standards in Internet of Things (IoT) have simplified the realization of building automation. However, non-expert IoT users still lack tools that can help them to ensure the underlying control system correctness: user-programmable logics match the user intention. In fact, non-expert IoT users lack the necessary know-how of domain experts. This paper presents our experience in running a building automation service based on the Salus framework. Complementing efforts that simply verify the IoT control system correctness, Salus takes novel steps to tackle practical challenges in automated debugging of identified policy violations, for non-expert IoT users. First, Salus leverages formal methods to localize faulty user-programmable logics. Second, to debug these identified faults, Salus selectively transforms the control system logics into a set of parameterized equations, which can then be solved by popular model checking tools or SMT (Satisfiability Modulo Theories) solvers. Through office deployments, user studies, and public datasets, we demonstrate the usefulness of Salus in systematically debugging the correctness of IoT control systems for building automation.
TL;DR: VeriDP is a tool that can continuously monitor what the authors call control-data plane consistency, defined as the consistency between control plane policies and data plane forwarding behaviors, and can localize faulty switches when verification fails.
Abstract: How to debug large networks is always a challenging task. Software Defined Network (SDN) offers a centralized con- trol platform where operators can statically verify network policies, instead of checking configuration files device-by-device. While such a static verification is useful, it is still not enough: due to data plane faults, packets may not be forwarded according to control plane policies, resulting in network faults at runtime. To address this issue, we present VeriDP, a tool that can continuously monitor what we call control-data plane consistency, defined as the consistency between control plane policies and data plane forwarding behaviors. We prototype VeriDP with small modifications of both hardware and software SDN switches, and show that it can achieve a verification speed of 3 μs per packet, with a false negative rate as low as 0.1%, for the Stanford backbone and Internet2 topologies. In addition, when verification fails, VeriDP can localize faulty switches with a probability as high as 96% for fat tree topologies.
TL;DR: It is concluded that targeted malware does not use more anti-debugging and anti-VM techniques than generic malware, although targeted malware tend to have a lower antivirus detection rate.
Abstract: Malware is becoming more and more advanced. As part of the sophistication, malware typically deploys various anti-debugging and anti-VM techniques to prevent detection. While defenders use debuggers and virtualized environment to analyze malware, malware authors developed anti-debugging and anti-VM techniques to evade this defense approach. In this paper, we investigate the use of anti-debugging and anti-VM techniques in modern malware, and compare their presence in 16,246 generic and 1,037 targeted malware samples (APTs). As part of this study we found several counter-intuitive trends. In particular, our study concludes that targeted malware does not use more anti-debugging and anti-VM techniques than generic malware, although targeted malware tend to have a lower antivirus detection rate. Moreover, this paper even identifies a decrease over time of the number of anti-VM techniques used in APTs and the Winwebsec malware family.
TL;DR: A controlled experiment to quantify the impact of variability on debugging of preprocessor- based programs and shows that the speed of bug finding decreases linearly with the level of variability, while effectiveness of finding bugs is relatively independent of the degree of variability.
Abstract: Software projects embrace variability to increase adaptability and to lower cost; however, others blame variability for increasing complexity and making reasoning about programs more difficult. We carry out a controlled experiment to quantify the impact of variability on debugging of preprocessor- based programs. We measure speed and precision for bug finding tasks defined at three different degrees of variability on several subject programs derived from real systems. The results show that the speed of bug finding decreases linearly with the degree of variability, while effectiveness of finding bugs is relatively independent of the degree of variability. Still, identifying the set of configurations in which the bug manifests itself is difficult already for a low degree of variability. Surprisingly, identifying the exact set of affected configurations appears to be harder than finding the bug in the first place. The difficulty in reasoning about several configurations is a likely reason why the variability bugs are actually introduced in configurable programs. We hope that the detailed findings presented here will inspire the creation of programmer support tools addressing the challenges faced by developers when reasoning about configurations, contributing to more effective debugging and, ultimately, fewer bugs in highly-configurable systems.
TL;DR: Trigger analysis helps selecting the best VV 666 bug reports from two applications and provides a fine-grain characterization of bug manifestation that is expected to increase the perceived importance of this dimension in testing, debugging, and fault tolerance strategies.
TL;DR: This work proposes a novel automated approach for debugging web sites that is based on image processing and probabilistic techniques and predicts the elements and styling properties most likely to cause the observed failure for the page under test and reports these to the developer.
Abstract: Presentation failures in a website can undermine its success by giving users a negative perception of the trustworthiness of the site and the quality of the services it delivers. Unfortunately, existing techniques for debugging presentation failures do not provide developers with automated and broadly applicable solutions for finding the site's faulty HTML elements and CSS properties. To address this limitation, we propose a novel automated approach for debugging web sites that is based on image processing and probabilistic techniques. Our approach first builds a model that links observable changes in the web site's appearance to faulty elements and styling properties. Then using this model, our approach predicts the elements and styling properties most likely to cause the observed failure for the page under test and reports these to the developer. In evaluation, our approach was more accurate and faster than prior techniques for identifying faulty elements in a website.
TL;DR: This work proposes complete, sound and optimal methods for the interactive debugging of KBs that suggest the one (minimally invasive) error correction of the faulty KB that yields a repaired KB with exactly the intended semantics.
Abstract: Many AI applications rely on knowledge about a relevant real-world domain that is encoded by means of some logical knowledge base (KB). The most essential benefit of logical KBs is the opportunity to perform automatic reasoning to derive implicit knowledge or to answer complex queries about the modeled domain. The feasibility of meaningful reasoning requires KBs to meet some minimal quality criteria such as logical consistency. Without adequate tool assistance, the task of resolving violated quality criteria in KBs can be extremely tough even for domain experts, especially when the problematic KB includes a large number of logical formulas or comprises complicated logical formalisms.
Published non-interactive debugging systems often cannot localize all possible faults (incompleteness), suggest the deletion or modification of unnecessarily large parts of the KB (non-minimality), return incorrect solutions which lead to a repaired KB not satisfying the imposed quality requirements (unsoundness) or suffer from poor scalability due to the inherent complexity of the KB debugging problem. Even if a system is complete and sound and considers only minimal solutions, there are generally exponentially many solution candidates to select one from. However, any two repaired KBs obtained from these candidates differ in their semantics in terms of entailments and non-entailments. Selection of just any of these repaired KBs might result in unexpected entailments, the loss of desired entailments or unwanted changes to the KB.
This work proposes complete, sound and optimal methods for the interactive debugging of KBs that suggest the one (minimally invasive) error correction of the faulty KB that yields a repaired KB with exactly the intended semantics. Users, e.g. domain experts, are involved in the debugging process by answering automatically generated queries about the intended domain.
TL;DR: An interdisciplinary study to analyze the brain activity during code inspection tasks using functional magnetic resonance imaging (fMRI), which is a well-established tool in cognitive neuroscience research, and confirmed that brain areas associated with language processing and mathematics are highly active during code reviewing.
Abstract: We propose that understanding functional patterns of activity in mapped brain regions associated with code comprehension tasks and, more specifically, to the activity of finding bugs in traditional code inspections could reveal useful insights to improve software reliability and to improve the software development process in general. This includes helping to select the best professionals for the debugging effort, improving the conditions for code inspections, and identify new directions to follow for training code reviewers. This paper presents an interdisciplinary study to analyze the brain activity during code inspection tasks using functional magnetic resonance imaging (fMRI), which is a well-established tool in cognitive neuroscience research. We used several programs where realistic bugs representing the most frequent types of software faults found in the field were injected. The code inspectors involved in the research include programmers with different levels of expertise and experience in real code reviews. The goal is to understand brain activity patterns associated with code comprehension tasks and, more specifically, the brain activity when the code reviewer identifies a bug in the code ('eureka' moment), which can be a true positive or a false positive. Our results confirmed that brain areas associated with language processing and mathematics are highly active during code reviewing and shows that there are specific brain activity patterns that can be related to the decision-making moment of suspicion/bug detection. Importantly, the activity at the anterior insula region that we find to play a relevant role in the process of identifying software bugs is positively correlated to the precision of bug detection by the inspectors. This finding provides a new perspective on the role of this region on error awareness and monitoring and of its potential predictive value in predicting the quality of bug removing.
TL;DR: CREDAL is developed, an automatic tool that employs the source code of a crashing program to enhance core dump analysis and turns a core dump to an informative aid in tracking down memory corruption vulnerabilities.
Abstract: After a program has crashed and terminated abnormally, it typically leaves behind a snapshot of its crashing state in the form of a core dump. While a core dump carries a large amount of information, which has long been used for software debugging, it barely serves as informative debugging aids in locating software faults, particularly memory corruption vulnerabilities. A memory corruption vulnerability is a special type of software faults that an attacker can exploit to manipulate the content at a certain memory. As such, a core dump may contain a certain amount of corrupted data, which increases the difficulty in identifying useful debugging information (e.g. , a crash point and stack traces). Without a proper mechanism to deal with this problem, a core dump can be practically useless for software failure diagnosis. In this work, we develop CREDAL, an automatic tool that employs the source code of a crashing program to enhance core dump analysis and turns a core dump to an informative aid in tracking down memory corruption vulnerabilities. Specifically, CREDAL systematically analyzes a core dump potentially corrupted and identifies the crash point and stack frames. For a core dump carrying corrupted data, it goes beyond the crash point and stack trace. In particular, CREDAL further pinpoints the variables holding corrupted data using the source code of the crashing program along with the stack frames. To assist software developers (or security analysts) in tracking down a memory corruption vulnerability, CREDAL also performs analysis and highlights the code fragments corresponding to data corruption. To demonstrate the utility of CREDAL, we use it to analyze 80 crashes corresponding to 73 memory corruption vulnerabilities archived in Offensive Security Exploit Database. We show that, CREDAL can accurately pinpoint the crash point and (fully or partially) restore a stack trace even though a crashing program stack carries corrupted data. In addition, we demonstrate CREDAL can potentially reduce the manual effort of finding the code fragment that is likely to contain memory corruption vulnerabilities.
TL;DR: This work embarked on a two year journey to create a production quality time-traveling debugger in Microsoft's open-source ChakraCore JavaScript engine and the popular Node.js application framework.
Abstract: Time-traveling in the execution history of a program during debugging enables a developer to precisely track and understand the sequence of statements and program values leading to an error. To provide this functionality to real world developers, we embarked on a two year journey to create a production quality time-traveling debugger in Microsoft's open-source ChakraCore JavaScript engine and the popular Node.js application framework.
TL;DR: This paper presents a new methodology for testing distributed systems that applies advanced systematic testing techniques to thoroughly check that the executable code adheres to its high-level specifications, which significantly improves coverage of important system behaviors.
Abstract: Testing distributed systems is challenging due to multiple sources of nondeterminism. Conventional testing techniques, such as unit, integration and stress testing, are ineffective in preventing serious but subtle bugs from reaching production. Formal techniques, such as TLA+, can only verify high-level specifications of systems at the level of logic-based models, and fall short of checking the actual executable code. In this paper, we present a new methodology for testing distributed systems. Our approach applies advanced systematic testing techniques to thoroughly check that the executable code adheres to its high-level specifications, which significantly improves coverage of important system behaviors.
Our methodology has been applied to three distributed storage systems in the Microsoft Azure cloud computing platform. In the process, numerous bugs were identified, reproduced, confirmed and fixed. These bugs required a subtle combination of concurrency and failures, making them extremely difficult to find with conventional testing techniques. An important advantage of our approach is that a bug is uncovered in a small setting and witnessed by a full system trace, which dramatically increases the productivity of debugging.
TL;DR: PerfCompass is presented, an online performance anomaly fault debugging tool that can quantify whether a production-run performance anomaly has a global impact or local impact and provides useful diagnosis hints within several minutes and imposes negligible runtime overhead to the production system during normal execution time.
Abstract: Infrastructure-as-a-service clouds are becoming widely adopted. However, resource sharing and multi-tenancy have made performance anomalies a top concern for users. Timely debugging those anomalies is paramount for minimizing the performance penalty for users. Unfortunately, this debugging often takes a long time due to the inherent complexity and sharing nature of cloud infrastructures. When an application experiences a performance anomaly, it is important to distinguish between faults with a global impact and faults with a local impact as the diagnosis and recovery steps for faults with a global impact or local impact are quite different. In this paper, we present PerfCompass, an online performance anomaly fault debugging tool that can quantify whether a production-run performance anomaly has a global impact or local impact. PerfCompass can use this information to suggest the root cause as either an external fault (e.g., environment-based) or an internal fault (e.g., software bugs). Furthermore, PerfCompass can identify top affected system calls to provide useful diagnostic hints for detailed performance debugging. PerfCompass does not require source code or runtime application instrumentation, which makes it practical for production systems. We have tested PerfCompass by running five common open source systems (e.g., Apache, MySQL, Tomcat, Hadoop, Cassandra) inside a virtualized cloud testbed. Our experiments use a range of common infrastructure sharing issues and real software bugs. The results show that PerfCompass accurately classifies 23 out of the 24 tested cases without calibration and achieves 100 percent accuracy with calibration. PerfCompass provides useful diagnosis hints within several minutes and imposes negligible runtime overhead to the production system during normal execution time.