TL;DR: ARJA as discussed by the authors is a genetic programming based approach for automated repair of Java programs, which decouples the search subspaces of likely-buggy locations, operation types and potential fix ingredients, enabling GP to explore the search space more effectively.
Abstract: Automated program repair is the problem of automatically fixing bugs in programs in order to significantly reduce the debugging costs and improve the software quality. To address this problem, test-suite based repair techniques regard a given test suite as an oracle and modify the input buggy program to make the entire test suite pass. GenProg is well recognized as a prominent repair approach of this kind, which uses genetic programming (GP) to rearrange the statements already extant in the buggy program. However, recent empirical studies show that the performance of GenProg is not fully satisfactory, particularly for Java. In this paper, we propose ARJA, a new GP based repair approach for automated repair of Java programs. To be specific, we present a novel lower-granularity patch representation that properly decouples the search subspaces of likely-buggy locations, operation types and potential fix ingredients, enabling GP to explore the search space more effectively. Based on this new representation, we formulate automated program repair as a multi-objective search problem and use NSGA-II to look for simpler repairs. To reduce the computational effort and search space, we introduce a test filtering procedure that can speed up the fitness evaluation of GP and three types of rules that can be applied to avoid unnecessary manipulations of the code. Moreover, we also propose a type matching strategy that can create new potential fix ingredients by exploiting the syntactic patterns of existing statements. We conduct a large-scale empirical evaluation of ARJA along with its variants on both seeded bugs and real-world bugs in comparison with several state-of-the-art repair approaches. Our results verify the effectiveness and efficiency of the search mechanisms employed in ARJA and also show its superiority over the other approaches. In particular, compared to jGenProg (an implementation of GenProg for Java), an ARJA version fully following the redundancy assumption can generate a test-suite adequate patch for more than twice the number of bugs (from 27 to 59), and a correct patch for nearly four times of the number (from 5 to 18), on 224 real-world bugs considered in Defects4J. Furthermore, ARJA is able to correctly fix several real multi-location bugs that are hard to be repaired by most of the existing repair approaches.
TL;DR: In this paper, the authors propose a systematic and automated approach to mine relevant and actionable fix patterns based on an iterative clustering strategy applied to atomic changes within patches, which can be leveraged to extract generic fix actions.
Abstract: Patching is a common activity in software development. It is generally performed on a source code base to address bugs or add new functionalities. In this context, given the recurrence of bugs across projects, the associated similar patches can be leveraged to extract generic fix actions. While the literature includes various approaches leveraging similarity among patches to guide program repair, these approaches often do not yield fix patterns that are tractable and reusable as actionable input to APR systems. In this paper, we propose a systematic and automated approach to mining relevant and actionable fix patterns based on an iterative clustering strategy applied to atomic changes within patches. The goal of FixMiner is thus to infer separate and reusable fix patterns that can be leveraged in other patch generation systems. Our technique, FixMiner, leverages Rich Edit Script which is a specialized tree structure of the edit scripts that captures the AST-level context of the code changes. FixMiner uses different tree representations of Rich Edit Scripts for each round of clustering to identify similar changes. These are abstract syntax trees, edit actions trees, and code context trees. We have evaluated FixMiner on thousands of software patches collected from open source projects. Preliminary results show that we are able to mine accurate patterns, efficiently exploiting change information in Rich Edit Scripts. We further integrated the mined patterns to an automated program repair prototype, PARFixMiner, with which we are able to correctly fix 26 bugs of the Defects4J benchmark. Beyond this quantitative performance, we show that the mined fix patterns are sufficiently relevant to produce patches with a high probability of correctness: 81% of PARFixMiner’s generated plausible patches are correct.
TL;DR: Proq, a runtime assertion scheme for testing and debugging quantum programs on a quantum computer, is proposed and it is rigorously proved that checking projection-based assertions can help locate bugs or statistically assure that the semantic function of the tested program is close to what the authors expect, for both exact and approximate quantum programs.
Abstract: In this paper, we propose Proq, a runtime assertion scheme for testing and debugging quantum programs on a quantum computer. The predicates in Proq are represented by projections (or equivalently, closed subspaces of the state space), following Birkhoff-von Neumann quantum logic. The satisfaction of a projection by a quantum state can be directly checked upon a small number of projective measurements rather than a large number of repeated executions. On the theory side, we rigorously prove that checking projection-based assertions can help locate bugs or statistically assure that the semantic function of the tested program is close to what we expect, for both exact and approximate quantum programs. On the practice side, we consider hardware constraints and introduce several techniques to transform the assertions, making them directly executable on the measurement-restricted quantum computers. We also propose to achieve simplified assertion implementation using local projection technique with soundness guaranteed. We compare Proq with existing quantum program assertions and demonstrate the effectiveness and efficiency of Proq by its applications to assert two sophisticated quantum algorithms, the Harrow-Hassidim-Lloyd algorithm and Shor’s algorithm.
TL;DR: SmartBugs is presented, an extensible and easy-to-use execution framework that simplifies the execution of analysis tools on smart contracts written in Solidity, the primary language used in Ethereum.
Abstract: Over the last few years, there has been substantial research on automated analysis, testing, and debugging of Ethereum smart contracts. However, it is not trivial to compare and reproduce that research. To address this, we present SmartBugs, an extensible and easy-to-use execution framework that simplifies the execution of analysis tools on smart contracts written in Solidity, the primary language used in Ethereum. SmartBugs is currently distributed with support for 10 tools and two datasets of Solidity contracts. The first dataset can be used to evaluate the precision of analysis tools, as it contains 143 annotated vulnerable contracts with 208 tagged vulnerabilities. The second dataset contains 47,518 unique contracts collected through Etherscan. We discuss how SmartBugs supported the largest experimental setup to date both in the number of tools and in execution time. Moreover, we show how it enables easy integration and comparison of analysis tools by presenting a new extension to the tool SmartCheck that improves substantially the detection of vulnerabilities related to the DASP10 categories Bad Randomness, Time Manipulation, and Access Control (identified vulnerabilities increased from 11% to 24%).
TL;DR: This paper presents the first comprehensive empirical study on program failures of deep learning jobs collected from a deep learning platform in Microsoft, and identifies the common root causes and bug-fix solutions on a sample of 400 failures.
Abstract: Deep learning has made significant achievements in many application areas. To train and test models more efficiently, enterprise developers submit and run their deep learning programs on a shared, multi-tenant platform. However, some of the programs fail after a long execution time due to code/script defects, which reduces the development productivity and wastes expensive resources such as GPU, storage, and network I/O. This paper presents the first comprehensive empirical study on program failures of deep learning jobs. 4960 real failures are collected from a deep learning platform in Microsoft. We manually examine their failure messages and classify them into 20 categories. In addition, we identify the common root causes and bug-fix solutions on a sample of 400 failures. To better understand the current testing and debugging practices for deep learning, we also conduct developer interviews. Our major findings include: (1) 48.0% of the failures occur in the interaction with the platform rather than in the execution of code logic, mostly due to the discrepancies between local and platform execution environments; (2) Deep learning specific failures (13.5%) are mainly caused by inappropriate model parameters/structures and framework API misunderstanding; (3) Current debugging practices are not efficient for fault localization in many cases, and developers need more deep learning specific tools. Based on our findings, we further suggest possible research topics and tooling support that could facilitate future deep learning development.
TL;DR: This project creates another benchmark database and tool that contain 493 real bugs from 17 real-world Python programs, inspired by Defects4J, that can help catalyze future work on testing and debugging tools that work on Python programs.
Abstract: The 2019 edition of Stack Overflow developer survey highlights that, for the first time, Python outperformed Java in terms of popularity. The gap between Python and Java further widened in the 2020 edition of the survey. Unfortunately, despite the rapid increase in Python's popularity, there are not many testing and debugging tools that are designed for Python. This is in stark contrast with the abundance of testing and debugging tools for Java. Thus, there is a need to push research on tools that can help Python developers. One factor that contributed to the rapid growth of Java testing and debugging tools is the availability of benchmarks. A popular benchmark is the Defects4J benchmark; its initial version contained 357 real bugs from 5 real-world Java programs. Each bug comes with a test suite that can expose the bug. Defects4J has been used by hundreds of testing and debugging studies and has helped to push the frontier of research in these directions. In this project, inspired by Defects4J, we create another benchmark database and tool that contain 493 real bugs from 17 real-world Python programs. We hope our benchmark can help catalyze future work on testing and debugging tools that work on Python programs.
TL;DR: The echo-debugger, a tool to debug two different executions in parallel, and the Convergence Divergence Mapping (CDM) algorithm to locate all the control-flow divergences and convergences of these executions are proposed.
Abstract: As applications get developed, bugs inevitably get introduced. Often, it is unclear why a given code change introduced a given bug. To find this causal relation and more effectively debug, developers can leverage the existence of a previous version of the code, without the bug. But traditional debug-ging tools are not designed for this type of work, making this operation tedious. In this article, we propose as exploratory work the echo-debugger, a tool to debug two different executions in parallel, and the Convergence Divergence Mapping (CDM) algorithm to locate all the control-flow divergences and convergences of these executions. In this exploratory work, we present the architecture of the tool and a scenario to solve a non trivial bug.
TL;DR: TACTO is a step towards the widespread adoption of touch sensing in robotic applications, and to enable machine learning practitioners interested in multi-modal learning and control, and is provided a proof-of-concept that TACTO can be successfully used for Sim2Real applications.
Abstract: Simulators perform an important role in prototyping, debugging and benchmarking new advances in robotics and learning for control. Although many physics engines exist, some aspects of the real-world are harder than others to simulate. One of the aspects that have so far eluded accurate simulation is touch sensing. To address this gap, we present TACTO -- a fast, flexible and open-source simulator for vision-based tactile sensors. This simulator allows to render realistic high-resolution touch readings at hundreds of frames per second, and can be easily configured to simulate different vision-based tactile sensors, including GelSight, DIGIT and OmniTact. In this paper, we detail the principles that drove the implementation of TACTO and how they are reflected in its architecture. We demonstrate TACTO on a perceptual task, by learning to predict grasp stability using touch from 1 million grasps, and on a marble manipulation control task. We believe that TACTO is a step towards the widespread adoption of touch sensing in robotic applications, and to enable machine learning practitioners interested in multi-modal learning and control. TACTO is open-source at this https URL.
TL;DR: SmartBugs as discussed by the authors is an extensible and easy-to-use execution framework that simplifies the execution of analysis tools on smart contracts written in Solidity, the primary language used in Ethereum.
Abstract: Over the last few years, there has been substantial research on automated analysis, testing, and debugging of Ethereum smart contracts. However, it is not trivial to compare and reproduce that research. To address this, we present SmartBugs, an extensible and easy-to-use execution framework that simplifies the execution of analysis tools on smart contracts written in Solidity, the primary language used in Ethereum. SmartBugs is currently distributed with support for 10 tools and two datasets of Solidity contracts. The first dataset can be used to evaluate the precision of analysis tools, as it contains 143 annotated vulnerable contracts with 208 tagged vulnerabilities. The second dataset contains 47,518 unique contracts collected through Etherscan. We discuss how SmartBugs supported the largest experimental setup to date both in the number of tools and in execution time. Moreover, we show how it enables easy integration and comparison of analysis tools by presenting a new extension to the tool SmartCheck that improves substantially the detection of vulnerabilities related to the DASP10 categories Bad Randomness, Time Manipulation, and Access Control (identified vulnerabilities increased from 11% to 24%).
TL;DR: The analysis results that major achievements and future potentials of AI are the automation of lengthy routine jobs in software development and testing using algorithms and the structured analysis of big data pools to discover patterns and novel information clusters and the systematic evaluation of these data in neural networks.
Abstract: Although Artificial Intelligence (AI) has become a buzzword for self-organizing IT applications, its relevance to software engineering has hardly been analyzed systematically. This study combines a systematic review of previous research in the field and five qualitative interviews with software developers who use or want to use AI tools in their daily work routines, to assess the status of development, future development potentials and equally the risks of AI application to software engineering. The study classifies the insights in the software development life cycle. The analysis results that major achievements and future potentials of AI are a) the automation of lengthy routine jobs in software development and testing using algorithms, e.g. for debugging and documentation, b) the structured analysis of big data pools to discover patterns and novel information clusters and c) the systematic evaluation of these data in neural networks. AI thus contributes to speed up development processes, realize development cost reductions and efficiency gains. AI to date depends on man-made structures and is mainly reproductive, but the automation of software engineering routines entails a major advantage: Human developers multiply their creative potential when using AI tools effectively.
TL;DR: This paper designs quantum circuits to assert classical states, entanglement, and superposition states, and shows that they are effective in debugging as well as improving the success rate for various quantum algorithms on IBM Q quantum computers.
Abstract: In this paper, we propose quantum circuits for runtime assertions, which can be used for both software debugging and error detection. Runtime assertion is challenging in quantum computing for two key reasons. First, a quantum bit (qubit) cannot be copied, which is known as the non-cloning theorem. Second, when a qubit is measured, its superposition state collapses into a classical state, losing the inherent parallel information. In this paper, we overcome these challenges with runtime computation through ancilla qubits, which are used to indirectly collect the information of the qubits of interest. We design quantum circuits to assert classical states, entanglement, and superposition states. Our experimental results show that they are effective in debugging as well as improving the success rate for various quantum algorithms on IBM Q quantum computers.
TL;DR: The unified debugging approach is proposed to unify the two areas in the other direction for the first time, i.e., can program repair in turn help with fault localization, and designed ProFL to leverage patch-execution results (from program repair) as the feedback information for fault localization.
Abstract: A large body of research efforts have been dedicated to automated software debugging, including both automated fault localization and program repair. However, existing fault localization techniques have limited effectiveness on real-world software systems while even the most advanced program repair techniques can only fix a small ratio of real-world bugs. Although fault localization and program repair are inherently connected, their only existing connection in the literature is that program repair techniques usually use off-the-shelf fault localization techniques (e.g., Ochiai) to determine the potential candidate statements/elements for patching. In this work, we propose the unified debugging approach to unify the two areas in the other direction for the first time, i.e., can program repair in turn help with fault localization? In this way, we not only open a new dimension for more powerful fault localization, but also extend the application scope of program repair to all possible bugs (not only the bugs that can be directly automatically fixed). We have designed ProFL to leverage patch-execution results (from program repair) as the feedback information for fault localization. The experimental results on the widely used Defects4J benchmark show that the basic ProFL can already at least localize 37.61% more bugs within Top-1 than state-of-the-art spectrum and mutation based fault localization. Furthermore, ProFL can boost state-of-the-art fault localization via both unsupervised and supervised learning. Meanwhile, we have demonstrated ProFL's effectiveness under different settings and through a case study within Alipay, a popular online payment system with over 1 billion global users.
TL;DR: Although MFL is becoming of grave concern, existing solutions in the field are less mature and concrete solutions to reduce MFL debugging time and cost by adopting an approach such as MBA debugging approach are also less, which require more attention from the research community.
Abstract: Context Multiple fault localization (MFL) is the act of identifying the locations of multiple faults (more than one fault) in a faulty software program. This is known to be more complicated, tedious, and costly in comparison to the traditional practice of presuming that a software contains a single fault. Due to the increasing interest in MFL by the research community, a broad spectrum of MFL debugging approaches and solutions have been proposed and developed. Objective The aim of this study is to systematically review existing research on MFL in the software fault localization (SFL) domain. This study also aims to identify, categorize, and synthesize relevant studies in the research domain. Method Consequently, using an evidence-based systematic methodology, we identified 55 studies relevant to four research questions. The methodology provides a systematic selection and evaluation process with rigorous and repeatable evidence-based studies selection process. Result The result of the systematic review shows that research on MFL is gaining momentum with stable growth in the last 5 years. Three prominent MFL debugging approaches were identified, i.e. One-bug-at-a-time debugging approach (OBA), parallel debugging approach, and multiple-bug-at-a-time debugging approach (MBA), with OBA debugging approach being utilized the most. Conclusion The study concludes with some identified research challenges and suggestions for future research. Although MFL is becoming of grave concern, existing solutions in the field are less mature. Studies utilizing real faults in their experiments are scarce. Concrete solutions to reduce MFL debugging time and cost by adopting an approach such as MBA debugging approach are also less, which require more attention from the research community.
TL;DR: A sequential language model that uses an attention-mechanism-based long short-term memory (LSTM) neural network to assess and classify source code based on the estimated error probability and combined the attention mechanism with LSTM to enhance the results of error assessment and detection as well as source code classification is proposed.
Abstract: The rate of software development has increased dramatically. Conventional compilers cannot assess and detect all source code errors. Software may thus contain errors, negatively affecting end-users. It is also difficult to assess and detect source code logic errors using traditional compilers, resulting in software that contains errors. A method that utilizes artificial intelligence for assessing and detecting errors and classifying source code as correct (error-free) or incorrect is thus required. Here, we propose a sequential language model that uses an attention-mechanism-based long short-term memory (LSTM) neural network to assess and classify source code based on the estimated error probability. The attentive mechanism enhances the accuracy of the proposed language model for error assessment and classification. We trained the proposed model using correct source code and then evaluated its performance. The experimental results show that the proposed model has logic and syntax error detection accuracies of 92.2% and 94.8%, respectively, outperforming state-of-the-art models. We also applied the proposed model to the classification of source code with logic and syntax errors. The average precision, recall, and F-measure values for such classification are much better than those of benchmark models. To strengthen the proposed model, we combined the attention mechanism with LSTM to enhance the results of error assessment and detection as well as source code classification. Finally, our proposed model can be effective in programming education and software engineering by improving code writing, debugging, error-correction, and reasoning.
TL;DR: It is found that developers face a variety of challenges for all nine configuration engineering activities, starting from the creation of options, which generally is not planned beforehand and increases the complexity of a software system, to the non-trivial comprehension and debugging of configurations, and ending with the risky maintenance of configuration options, since developers avoid touching and changing configuration options in a mature system.
Abstract: Modern software applications are adapted to different situations (e.g., memory limits, enabling/disabling features, database credentials) by changing the values of configuration options, without any source code modifications. According to several studies, this flexibility is expensive as configuration failures represent one of the most common types of software failures. They are also hard to debug and resolve as they require a lot of effort to detect which options are misconfigured among a large number of configuration options and values, while comprehension of the code also is hampered by sprinkling conditional checks of the values of configuration options. Although researchers have proposed various approaches to help debug or prevent configuration failures, especially from the end users’ perspective, this paper takes a step back to understand the process required by practitioners to engineer the run-time configuration options in their source code, the challenges they experience as well as best practices that they have or could adopt. By interviewing 14 software engineering experts, followed by a large survey on 229 Java software engineers, we identified 9 major activities related to configuration engineering, 22 challenges faced by developers, and 24 expert recommendations to improve software configuration quality. We complemented this study by a systematic literature review to enrich the experts’ recommendations, and to identify possible solutions discussed and evaluated by the research community for the developers’ problems and challenges. We find that developers face a variety of challenges for all nine configuration engineering activities, starting from the creation of options, which generally is not planned beforehand and increases the complexity of a software system, to the non-trivial comprehension and debugging of configurations, and ending with the risky maintenance of configuration options, since developers avoid touching and changing configuration options in a mature system. We also find that researchers thus far focus primarily on testing and debugging configuration failures, leaving a large range of opportunities for future work.
TL;DR: E9Patch develops a suite of binary rewriting methodologies that can insert jumps to trampolines without the need to move other instructions, and is robust by design, and can scale to very large (>100MB) stripped binaries including the Google Chrome and FireFox web browsers.
Abstract: Static binary rewriting has many important applications in software security and systems, such as hardening, repair, patching, instrumentation, and debugging. While many different static binary rewriting tools have been proposed, most rely on recovering control flow information from the input binary. The recovery step is necessary since the rewriting process may move instructions, meaning that the set of jump targets in the rewritten binary needs to be adjusted accordingly. Since the static recovery of control flow information is a hard problem in general, most tools rely on a set of simplifying heuristics or assumptions, such as specific compilers, specific source languages, or binary file meta information. However, the reliance on assumptions or heuristics tends to scale poorly in practice, and most state-of-the-art static binary rewriting tools cannot handle very large/complex programs such as web browsers. In this paper we present E9Patch, a tool that can statically rewrite x86_64 binaries without any knowledge of control flow information. To do so, E9Patch develops a suite of binary rewriting methodologies---such as instruction punning, padding, and eviction---that can insert jumps to trampolines without the need to move other instructions. Since this preserves the set of jump targets, the need for control flow recovery and related heuristics is eliminated. As such, E9Patch is robust by design, and can scale to very large (>100MB) stripped binaries including the Google Chrome and FireFox web browsers. We also evaluate the effectiveness of E9Patch against realistic applications such as binary instrumentation, hardening and repair.
TL;DR: This work provides XFDetector, a tool that detects cross-failure bugs by automatically injecting failures into the pre-failures execution, and checking for cross- failure races and semantic bugs in the post-failURE continuation.
Abstract: Persistent memory (PM) technologies, such as Intel's Optane memory, deliver high performance, byte-addressability, and persistence, allowing programs to directly manipulate persistent data in memory without any OS intermediaries. An important requirement of these programs is that persistent data must remain consistent across a failure, which we refer to as the crash consistency guarantee. However, maintaining crash consistency is not trivial. We identify that a consistent recovery critically depends not only on the execution before the failure, but also on the recovery and resumption after failure. We refer to these stages as the pre- and post-failure execution stages. In order to holistically detect crash consistency bugs, we categorize the underlying causes behind inconsistent recovery due to incorrect interactions between the pre- and post-failure execution. First, a program is not crash-consistent if the post-failure stage reads from locations that are not guaranteed to be persisted in all possible access interleavings during the pre-failure stage -- a type of programming error that leads to a race that we refer to as a cross-failure race. Second, a program is not crash-consistent if the post-failure stage reads persistent data that has been left semantically inconsistent during the pre-failure stage, such as a stale log or uncommitted data. We refer to this type of bugs as a cross-failure semantic bug. Together, they form the cross-failure bugs in PM programs. In this work, we provide XFDetector, a tool that detects cross-failure bugs by automatically injecting failures into the pre-failure execution, and checking for cross-failure races and semantic bugs in the post-failure continuation. XFDetector has detected four new bugs in three pieces of PM software: one of PMDK's examples, a PM-optimized Redis database, and a PMDK library function.
TL;DR: It is observed that the exceptions thrown from Android framework (termed “framework-specific exceptions”) account for the majority, and DroidDefects, the first comprehensive and largest benchmark of Android app exception bugs, is constructed, which contains 33 reproducible exceptions.
Abstract: Mobile apps have become ubiquitous. Ensuring their correctness and reliability is important. However, many apps still suffer from occasional to frequent crashes, weakening their competitive edge. Large-scale, deep analyses of the characteristics of real-world app crashes can provide useful insights to both developers and researchers. However, such studies are difficult and yet to be carried out --- this work fills this gap. We collected 16,245 and 8,760 unique exceptions from 2,486 open-source and 3,230 commercial Android apps, respectively, and observed that the exceptions thrown from Android framework (termed "framework-specific exceptions") account for the majority. With one-year effort, we (1) extensively investigated these framework-specific exceptions, and (2) further conducted an online survey of 135 professional app developers about how they analyze, test, reproduce and fix these exceptions. Specifically, we aim to understand the framework-specific exceptions from several perspectives: (i) their characteristics (e.g., manifestation locations, fault taxonomy), (ii) the developers' testing practices, (iii) existing bug detection techniques' effectiveness, (iv) their reproducibility and (v) bug fixes. To enable follow-up research (e.g., bug understanding, detection, localization and repairing), we further systematically constructed, DroidDefects, the first comprehensive and largest benchmark of Android app exception bugs. This benchmark contains 33 reproducible exceptions (with test cases, stack traces, faulty and fixed app versions, bug types, etc.), and 3,696 ground-truth exceptions (real faults manifested by automated testing tools), which cover the apps with different complexities and diverse exception types. Based on our findings, we also built two prototype tools: Stoat+, an optimized dynamic testing tool, which quickly uncovered three previously-unknown, fixed crashes in Gmail and Google+; ExLocator, an exception localization tool, which can locate the root causes of specific exception types. Our dataset, benchmark and tools are publicly available on https://github.com/tingsu/droiddefects.
TL;DR: In this article, the authors demonstrate a unified approach to rigorous design of safetycritical autonomous systems using the VerifAI toolkit for formal analysis of AI-based systems, including modeling, falsification, debugging, and ML component retraining.
Abstract: We demonstrate a unified approach to rigorous design of safety-critical autonomous systems using the VerifAI toolkit for formal analysis of AI-based systems. VerifAI provides an integrated toolchain for tasks spanning the design process, including modeling, falsification, debugging, and ML component retraining. We evaluate all of these applications in an industrial case study on an experimental autonomous aircraft taxiing system developed by Boeing, which uses a neural network to track the centerline of a runway. We define runway scenarios using the Scenic probabilistic programming language, and use them to drive tests in the X-Plane flight simulator. We first perform falsification, automatically finding environment conditions causing the system to violate its specification by deviating significantly from the centerline (or even leaving the runway entirely). Next, we use counterexample analysis to identify distinct failure cases, and confirm their root causes with specialized testing. Finally, we use the results of falsification and debugging to retrain the network, eliminating several failure cases and improving the overall performance of the closed-loop system.
TL;DR: This work performs an extensive study of the unified-debugging approach on 16 state-of-the-art program repair systems for the first time and proposes an advanced unified debugging technique, UniDebug++, which can localize over 20% more bugs within Top-1 positions than state- of theart unified debugging technique, ProFL.
Abstract: Automated debugging techniques, including fault localization and program repair, have been studied for over a decade. However, the only existing connection between fault localization and program repair is that fault localization computes the potential buggy elements for program repair to patch. Recently, a pioneering work, ProFL, explored the idea of unified debugging to unify fault localization and program repair in the other direction for the first time to boost both areas. More specifically, ProFL utilizes the patch execution results from one state-of-the-art repair system, PraPR, to help improve state-of-the-art fault localization. In this way, ProFL not only improves fault localization for manual repair, but also extends the application scope of automated repair to all possible bugs (not only the small ratio of bugs that can be automaticallyfi xed). However, ProFL only considers one APR system (i.e., PraPR), and it is not clear how other existing APR systems based on different designs contribute to unified debugging. In this work, we perform an extensive study of the unified-debugging approach on 16 state-of-the-art program repair systems for the first time. Our experimental results on the widely studied Defects4J benchmark suite reveal various practical guidelines for unified debugging, such as (1) nearly all the studied 16 repair systems can positively contribute to unified debugging despite their varying repairing capabilities, (2) repair systems targeting multi-edit patches can bring extraneous noise into unified debugging, (3) repair systems with more executed/plausible patches tend to perform better for unified debugging, and (4) unified debugging effectiveness does not rely on the availability of correct patches in automated repair. Based on our results, we further propose an advanced unified debugging technique, UniDebug++, which can localize over 20% more bugs within Top-1 positions than state-of-the-art unified debugging technique, ProFL.
TL;DR: SystemDS is introduced, an open source ML system for the end-to-end data science lifecycle from data integration, cleaning, and preparation, over local, distributed, and federated ML model training, to debugging and serving, and preliminary results that show the potential of end- to-end lifecycle optimization.
Abstract: Machine learning (ML) applications become increasingly common in many domains. ML systems to execute these workloads include numerical computing frameworks and libraries, ML algorithm libraries, and specialized systems for deep neural networks and distributed ML. These systems focus primarily on efficient model training and scoring. However, the data science process is exploratory, and deals with underspecified objectives and a wide variety of heterogeneous data sources. Therefore, additional tools are employed for data engineering and debugging, which requires boundary crossing, unnecessary manual effort, and lacks optimization across the lifecycle. In this paper, we introduce SystemDS, an open source ML system for the end-to-end data science lifecycle from data integration, cleaning, and preparation, over local, distributed, and federated ML model training, to debugging and serving. To this end, we aim to provide a stack of declarative language abstractions for the different lifecycle tasks, and users with different expertise. We describe the overall system architecture, explain major design decisions (motivated by lessons learned from Apache SystemML), and discuss key features and research directions. Finally, we provide preliminary results that show the potential of end-to-end lifecycle optimization.
TL;DR: The effects of presenting novices with compiler error messages designed using the most recent collection of published guidelines are explored, resulting in significantly shorter debugging times and higher self-reported scores of message usefulness for students in the very early stages of learning a new language.
Abstract: It is well known that programming error messages can be notoriously difficult for novices to understand, hampering progress and leading to frustration. In response, researchers have explored various approaches for enhancing such messages, yet results from this active strand of research are currently mixed. Direct comparisons of results between studies is challenging as these typically investigate different kinds of message enhancements and report results using different metrics. In addition, many prior studies have involved code writing tasks. In such cases, not all students encounter the same errors and messages, and it is difficult to isolate the time spent interpreting messages and resolving errors from the time spent writing code. In this research, we explore the effects of presenting novices with compiler error messages designed using the most recent collection of published guidelines - specifically, more easily readable, short, positive messages containing resolution hints. To accurately determine the time and effort required to read and respond to the messages, we utilise a debugging task where all students are presented the same code and therefore encounter the same errors. We present results of a randomised controlled experiment (n > 700) which shows that, compared to standard error messages, the messages we tested resulted in significantly shorter debugging times and higher self-reported scores of message usefulness for students in the very early stages of learning a new language.
TL;DR: A tool to automate the specification synthesis task by translating natural language comments into formal program specifications guided by specification syntax and the context of the target method, substantially outperforming the state-of-the-art.
Abstract: Formal program specifications are essential for various software engineering tasks, such as program verification, program synthesis, code debugging and software testing. However, manually inferring formal program specifications is not only time-consuming but also error-prone. In addition, it requires substantial expertise. Natural language comments contain rich semantics about behaviors of code, making it feasible to infer program specifications from comments. Inspired by this, we develop a tool, named C2S, to automate the specification synthesis task by translating natural language comments into formal program specifications. Our approach firstly constructs alignments between natural language word and specification tokens from existing comments and their corresponding specifications. Then for a given method comment, our approach assembles tokens that are associated with words in the comment from the alignments into specifications guided by specification syntax and the context of the target method. Our tool successfully synthesizes 1,145 specifications for 511 methods of 64 classes in 5 different projects, substantially outperforming the state-of-the-art. The generated specifications are also used to improve a number of software engineering tasks like static taint analysis, which demonstrates the high quality of the specifications.
TL;DR: A unified approach to rigorous design of safety-critical autonomous systems using the VerifAI toolkit for formal analysis of AI-based systems and the results of falsification and debugging are used to retrain the network, eliminating several failure cases and improving the overall performance of the closed-loop system.
Abstract: We demonstrate a unified approach to rigorous design of safety-critical autonomous systems using the VerifAI toolkit for formal analysis of AI-based systems. VerifAI provides an integrated toolchain for tasks spanning the design process, including modeling, falsification, debugging, and ML component retraining. We evaluate all of these applications in an industrial case study on an experimental autonomous aircraft taxiing system developed by Boeing, which uses a neural network to track the centerline of a runway. We define runway scenarios using the Scenic probabilistic programming language, and use them to drive tests in the X-Plane flight simulator. We first perform falsification, automatically finding environment conditions causing the system to violate its specification by deviating significantly from the centerline (or even leaving the runway entirely). Next, we use counterexample analysis to identify distinct failure cases, and confirm their root causes with specialized testing. Finally, we use the results of falsification and debugging to retrain the network, eliminating several failure cases and improving the overall performance of the closed-loop system.
TL;DR: A significant redesign of the core R2U2 algorithms is presented to adapt to severe resource and certification constraints and prove their correctness, time complexity, and space requirements and is deployed on Robonaut2, demonstrating successful real-time fault disambiguation and mitigation triggering of R2's knee-joint faults without false positives.
Abstract: Robonaut2 (R2) is a humanoid robot onboard the International Space Station (ISS), performing specialized tasks in collaboration with astronauts. After deployment, R2 developed an unexpected emergent behavior. R2’s inability to distinguish between knee-joint faults (e.g., due to sensor drift versus violated environmental assumptions) began triggering safety-preserving freezes-in-place in the confined space of the ISS, preventing further motion until a ground-control operator determines the root-cause and initiates proper corrective action. Runtime verification (RV) algorithms can efficiently disambiguate the temporal signatures of different faults in real-time. However, no previous RV engine can operate within the limited available resources and specialized platform constraints of R2’s hardware architecture. An attempt to deploy the only runtime verification engine designed for embedded flight systems, R2U2, failed due to resource constraints. We present a significant redesign of the core R2U2 algorithms to adapt to severe resource and certification constraints and prove their correctness. We further define an optimization enabled by our new algorithms and implement the new version of R2U2. We encode specifications describing real-life faults occurring onboard Robonaut2 using Mission-time Linear Temporal Logic (MLTL) and detail our process of specification debugging, validation, and refinement. We deployed this new version of R2U2 on Robonaut2, demonstrating successful real-time fault disambiguation and mitigation triggering of R2’s knee-joint faults without false positives.
TL;DR: This paper rigorously explores the effectiveness of four well-known repair techniques, GenProg, Par, SimFix, and TrpAutoRepair, on defects made by the projects’ developers during their regular development process, and develops JaRFly, an open-source framework for implementing techniques for automatic search-based improvement of Java programs.
Abstract: Automated program repair is a promising approach to reducing the costs of manual debugging and increasing software quality. However, recent studies have shown that automated program repair techniques can be prone to producing patches of low quality, overfitting to the set of tests provided to the repair technique, and failing to generalize to the intended specification. This paper rigorously explores this phenomenon on real-world Java programs, analyzing the effectiveness of four well-known repair techniques, GenProg, Par, SimFix, and TrpAutoRepair, on defects made by the projects' developers during their regular development process. We find that: (1) When applied to real-world Java code, automated program repair techniques produce patches for between 10.6% and 19.0% of the defects, which is less frequent than when applied to C code. (2) The produced patches often overfit to the provided test suite, with only between 13.8% and 46.1% of the patches passing an independent set of tests. (3) Test suite size has an extremely small but significant effect on the quality of the patches, with larger test suites producing higher-quality patches, though, surprisingly, higher-coverage test suites correlate with lower-quality patches. (4) The number of tests that a buggy program fails has a small but statistically significant positive effect on the quality of the produced patches. (5) Test suite provenance, whether the test suite is written by a human or automatically generated, has a significant effect on the quality of the patches, with developer-written tests typically producing higher-quality patches. And (6) the patches exhibit insufficient diversity to improve quality through some method of combining multiple patches. We develop JaRFly, an open-source framework for implementing techniques for automatic search-based improvement of Java programs. Our study uses JaRFly to faithfully reimplement GenProg and TrpAutoRepair to work on Java code, and makes the first public release of an implementation of Par. Unlike prior work, our study carefully controls for confounding factors and produces a methodology, as well as a dataset of automatically-generated test suites, for objectively evaluating the quality of Java repair techniques on real-world defects.
TL;DR: This paper provides a review of contemporary methodologies and APIs for parallel programming, with representative technologies selected in terms of target system type, communication patterns, and programming abstraction level to identify trends in high-performance computing and of the challenges to be addressed in the near future.
Abstract: This paper provides a review of contemporary methodologies and APIs for parallel programming, with representative technologies selected in terms of target system type (shared memory, distributed, and hybrid), communication patterns (one-sided and two-sided), and programming abstraction level. We analyze representatives in terms of many aspects including programming model, languages, supported platforms, license, optimization goals, ease of programming, debugging, deployment, portability, level of parallelism, constructs enabling parallelism and synchronization, features introduced in recent versions indicating trends, support for hybridity in parallel execution, and disadvantages. Such detailed analysis has led us to the identification of trends in high-performance computing and of the challenges to be addressed in the near future. It can help to shape future versions of programming standards, select technologies best matching programmers’ needs, and avoid potential difficulties while using high-performance computing systems.
TL;DR: JCrashPack is proposed, an extensible benchmark for Java crash reproduction, together with ExRunner, a tool to simply and systematically run evaluations, on which an extensive evaluation of EvoCrash, the state-of-the-art tool for search-based crash reproduction is run.
Abstract: Crash reproduction approaches help developers during debugging by generating a test case that reproduces a given crash. Several solutions have been proposed to automate this task. However, the proposed solutions have been evaluated on a limited number of projects, making comparison difficult. In this paper, we enhance this line of research by proposing JCrashPack, an extensible benchmark for Java crash reproduction, together with ExRunner, a tool to simply and systematically run evaluations. JCrashPack contains 200 stack traces from various Java projects, including industrial open source ones, on which we run an extensive evaluation of EvoCrash, the state-of-the-art tool for search-based crash reproduction. EvoCrash successfully reproduced 43% of the crashes. Furthermore, we observed that reproducing NullPointerException, IllegalArgumentException, and IllegalStateException is relatively easier than reproducing ClassCastException, ArrayIndexOutOfBoundsException and StringIndexOutOfBoundsException. Our results include a detailed manual analysis of EvoCrash outputs, from which we derive 14 current challenges for crash reproduction, among which the generation of input data and the handling of abstract and anonymous classes are the most frequents. Finally, based on those challenges, we discuss future research directions for search-based crash reproduction for Java.
TL;DR: This paper synthesizes input formulas that are satisfiable or unsatisfiable by construction and automatically applies satisfiability-preserving transformations to generate increasingly-complex formulas, which allows to detect many errors with simple inputs and, thus, facilitates debugging.
Abstract: SMT solvers are at the basis of many applications, such as program verification, program synthesis, and test case generation. For all these applications to provide reliable results, SMT solvers must answer queries correctly. However, since they are complex, highly-optimized software systems, ensuring their correctness is challenging. In particular, state-of-the-art testing techniques do not reliably detect when an SMT solver is unsound. In this paper, we present an automatic approach for generating test cases that reveal soundness errors in the implementations of string solvers, as well as potential completeness and performance issues. We synthesize input formulas that are satisfiable or unsatisfiable by construction and use this ground truth as test oracle. We automatically apply satisfiability-preserving transformations to generate increasingly-complex formulas, which allows us to detect many errors with simple inputs and, thus, facilitates debugging. The experimental evaluation shows that our technique effectively reveals bugs in the implementation of widely-used SMT solvers and applies also to other types of solvers, such as automata-based solvers. We focus on strings here, but our approach carries over to other theories and their combinations.
TL;DR: This work enhances Node-RED with new features that improve the user's system development, debugging, and understanding tasks, and concludes that such enhancements reduce the development time and the number of failed attempts to deploy the system.
Abstract: The continuous spreading of the Internet-of-Things across application domains, aided by the continuous growth on the number of devices and systems that are Internet-connected, created both a rise in the complexity of these systems and made noticeable a lack of human resources with the expertise to design, develop and maintain them. Recent works try to mitigate these issues by creating solutions that abstract the complexity of the systems, such as using visual programming languages. Node-RED, as one of the most common solutions for the visual development IoT systems, stills has several limitations, such as the lack of observability and inadequate debugging mechanisms. In this work, we address some of these limitations by enhancing Node-RED with new features that improve the user's system development, debugging, and understanding tasks. We proceed to empirically evaluate the impact of these enhancements, concluding that, overall, such enhancements reduce the development time and the number of failed attempts to deploy the system.