TL;DR: In this article , the authors explored the use of Chat GPT in solving programming bugs and highlighted the potential of using chat GPT as one part of a comprehensive debugging toolkit, and the benefits of combining its strengths with the strengths of other debugging tools to identify and fix bugs.
Abstract: This research paper explores the use of Chat GPT in solving programming bugs. The paper examines the characteristics of Chat GPT and how they can be leveraged to provide debugging assistance, bug prediction, and bug explanation to help solve programming problems. The paper also explores the limitations of Chat GPT in solving programming bugs and the importance of using other debugging tools and techniques to validate its predictions and explanations. The paper concludes by highlighting the potential of Chat GPT as one part of a comprehensive debugging toolkit, and the benefits of combining its strengths with the strengths of other debugging tools to identify and fix bugs more effectively.
TL;DR: In this article , a parallel platform solution for high-precision machining equipment based on the Stewart six degrees of freedom parallel platform is presented. But not one can provide a common physical platform to test the effectiveness of a variety of control algorithms.
Abstract: With the rapid development of the manufacturing industry, industrial automation equipment represented by computer numerical control (CNC) machine tools has put forward higher and higher requirements for the machining accuracy of parts. Compared with the multi-axis serial platform solution, the parallel platform solution is theoretically more suitable for high-precision machining equipment. There are many parallel platform solutions, but not one can provide a common physical platform to test the effectiveness of a variety of control algorithms. To achieve the goals, this paper is based on the Stewart six degrees of freedom parallel platform, and it mainly studies the platform construction. This study completed the mechanical structure design of the parallel platform. Based on the microprogrammed control unit (MCU) + pre-driver chip + three-phase full bridge solution, we have completed the circuit design of the motor driver. We wrote the program of MCU to drive six parallel robotic arms as well as the program of the parallel platform control center on the PC, and we completed the system joint debugging. The closed-loop control effect of the parallel platform workspace pose is realized.
TL;DR: Self-Debugging as mentioned in this paper proposes to train a large language model to debug its predicted program via few-shot demonstrations, i.e., without any feedback on the code correctness or error messages, the model can identify its mistakes by explaining the generated code in natural language.
Abstract: Large language models (LLMs) have achieved impressive performance on code generation. However, for complex programming tasks, generating the correct solution in one go becomes challenging, thus some prior works have designed program repair approaches to improve code generation performance. In this work, we propose Self-Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations. In particular, we demonstrate that Self-Debugging can teach the large language model to perform rubber duck debugging; i.e., without any feedback on the code correctness or error messages, the model is able to identify its mistakes by explaining the generated code in natural language. Self-Debugging achieves the state-of-the-art performance on several code generation benchmarks, including the Spider dataset for text-to-SQL generation, TransCoder for C++-to-Python translation, and MBPP for text-to-Python generation. On the Spider benchmark where there are no unit tests to verify the correctness of predictions, Self-Debugging with code explanation consistently improves the baseline by 2-3%, and improves the prediction accuracy on problems of the hardest label by 9%. On TransCoder and MBPP where unit tests are available, Self-Debugging improves the baseline accuracy by up to 12%. Meanwhile, by leveraging feedback messages and reusing failed predictions, Self-Debugging notably improves sample efficiency, and can match or outperform baseline models that generate more than 10x candidate programs.
TL;DR: This work highlights, with worked examples, some advantages and limitations of using generative artificial intelligence for scientific coding and argues that if you are willing to debug, you can get a head start on more challenging tasks.
TL;DR: In this paper , the authors conducted a systematic review of 42 empirical studies focused on teaching and assessing CT in early childhood education (ECE) and proposed a CT curriculum framework for ECE that covers CT concepts (i.e., control flow/structures, representation, and hardware/software), CT practices, and CT perspectives (e.g., expressing and creating, connecting, perseverance, and choices of conduct).
TL;DR: In this article , explanations are hypothesized to improve human understanding of machine learning models and achieve a variety of desirable outcomes, ranging from model debugging to enhancing human decision-making, but empirical studies have found mixed and even mixed results.
Abstract: Explanations are hypothesized to improve human understanding of machine learning models and achieve a variety of desirable outcomes, ranging from model debugging to enhancing human decision making. However, empirical studies have found mixed and even ...
TL;DR: The Synthesize, Execute, Debug (SED) approach as mentioned in this paper is a similar approach to ours, where a draft of the solution is generated first, followed by a program repair phase addressing the failed tests.
Abstract: Current approaches to program synthesis with Large Language Models (LLMs) exhibit a"near miss syndrome": they tend to generate programs that semantically resemble the correct answer (as measured by text similarity metrics or human evaluation), but achieve a low or even zero accuracy as measured by unit tests due to small imperfections, such as the wrong input or output format. This calls for an approach known as Synthesize, Execute, Debug (SED), whereby a draft of the solution is generated first, followed by a program repair phase addressing the failed tests. To effectively apply this approach to instruction-driven LLMs, one needs to determine which prompts perform best as instructions for LLMs, as well as strike a balance between repairing unsuccessful programs and replacing them with newly generated ones. We explore these trade-offs empirically, comparing replace-focused, repair-focused, and hybrid debug strategies, as well as different template-based and model-based prompt-generation techniques. We use OpenAI Codex as the LLM and Program Synthesis Benchmark 2 as a database of problem descriptions and tests for evaluation. The resulting framework outperforms both conventional usage of Codex without the repair phase and traditional genetic programming approaches.
Benjamin Steenhoek, Md. Mahbubur Rahman, Richard Jiles, Wei Le
1 May 2023
TL;DR: An empirical study of deep learning models for vulnerability detection reveals the variability between different runs of a model, low agreement among different models' outputs, and the challenges associated with training deep learning models for vulnerability detection.
Abstract: Deep learning (DL) models of code have recently reported great progress for vulnerability detection. In some cases, DL-based models have outperformed static analysis tools. Although many great models have been proposed, we do not yet have a good understanding of these models. This limits the further advancement of model robustness, debugging, and deployment for the vulnerability detection. In this paper, we surveyed and reproduced 9 state-of-the-art (SOTA) deep learning models on 2 widely used vulnerability detection datasets: Devign and MSR. We investigated 6 research questions in three areas, namely model capabilities, training data, and model interpretation. We experimentally demonstrated the variability between different runs of a model and the low agreement among different models' outputs. We investigated models trained for specific types of vulnerabilities compared to a model that is trained on all the vulnerabilities at once. We explored the types of programs DL may consider “hard” to handle. We investigated the relations of training data sizes and training data composition with model performance. Finally, we studied model interpretations and analyzed important features that the models used to make predictions. We believe that our findings can help better understand model results, provide guidance on preparing training data, and improve the robustness of the models. All of our datasets, code, and results are available at https://doi.org/10.6084/m9.figshare.20791240.
TL;DR: This article proposed a framework called CRITIC that allows LLMs to validate and progressively amend their own outputs in a manner similar to human interaction with tools, starting with an initial output, and then revising the output based on the feedback obtained during this validation process.
Abstract: Recent developments in large language models (LLMs) have been impressive. However, these models sometimes show inconsistencies and problematic behavior, such as hallucinating facts, generating flawed code, or creating offensive and toxic content. Unlike these models, humans typically utilize external tools to cross-check and refine their initial content, like using a search engine for fact-checking, or a code interpreter for debugging. Inspired by this observation, we introduce a framework called CRITIC that allows LLMs, which are essentially "black boxes" to validate and progressively amend their own outputs in a manner similar to human interaction with tools. More specifically, starting with an initial output, CRITIC interacts with appropriate tools to evaluate certain aspects of the text, and then revises the output based on the feedback obtained during this validation process. Comprehensive evaluations involving free-form question answering, mathematical program synthesis, and toxicity reduction demonstrate that CRITIC consistently enhances the performance of LLMs. Meanwhile, our research highlights the crucial importance of external feedback in promoting the ongoing self-improvement of LLMs.
TL;DR: Flakify as discussed by the authors is a black-box, language model-based predictor for flaky test cases, which relies exclusively on the source code of test cases and does not require access to production code, or pre-define features.
Abstract: Software testing assures that code changes do not adversely affect existing functionality. However, a test case can be flaky, i.e., passing and failing across executions, even for the same version of the source code. Flaky test cases introduce overhead to software development as they can lead to unnecessary attempts to debug production or testing code. Besides rerunning test cases multiple times, which is time-consuming and computationally expensive, flaky test cases can be predicted using machine learning (ML) models, thus reducing the wasted cost of re-running and debugging these test cases. However, the state-of-the-art ML-based flaky test case predictors rely on pre-defined sets of features that are either project-specific, i.e., inapplicable to other projects, or require access to production code, which is not always available to software test engineers. Moreover, given the non-deterministic behavior of flaky test cases, it can be challenging to determine a complete set of features that could potentially be associated with test flakiness. Therefore, in this article, we propose Flakify, a black-box, language model-based predictor for flaky test cases. Flakify relies exclusively on the source code of test cases, thus not requiring to (a) access to production code (black-box), (b) rerun test cases, (c) pre-define features. To this end, we employed CodeBERT, a pre-trained language model, and fine-tuned it to predict flaky test cases using the source code of test cases. We evaluated Flakify on two publicly available datasets (FlakeFlagger and IDoFT) for flaky test cases and compared our technique with the FlakeFlagger approach, the best state-of-the-art ML-based, white-box predictor for flaky test cases, using two different evaluation procedures: (1) cross-validation and (2) per-project validation, i.e., prediction on new projects. Flakify achieved F1-scores of 79% and 73% on the FlakeFlagger dataset using cross-validation and per-project validation, respectively. Similarly, Flakify achieved F1-scores of 98% and 89% on the IDoFT dataset using the two validation procedures, respectively. Further, Flakify surpassed FlakeFlagger by 10 and 18 percentage points (pp) in terms of precision and recall, respectively, when evaluated on the FlakeFlagger dataset, thus reducing the cost bound to be wasted on unnecessarily debugging test cases and production code by the same percentages (corresponding to reduction rates of 25% and 64%). Flakify also achieved significantly higher prediction results when used to predict test cases on new projects, suggesting better generalizability over FlakeFlagger. Our results further show that a black-box version of FlakeFlagger is not a viable option for predicting flaky test cases.
TL;DR: ChatGPT as discussed by the authors is a cutting-edge language model that has been making waves in the field of natural language processing (NLP), and it can also be used as a powerful tool for debugging software code.
Abstract: ChatGPT is a cutting-edge language model that has been making waves in the field of natural language processing. However, its capabilities extend far beyond language-based applications. ChatGPT can also be used as a powerful tool for debugging software code. As software applications become increasingly complex, the need for efficient and accurate debugging tools has become more pressing. ChatGPT's ability to analyze and understand code makes it a promising solution to this challenge. Debugging is a critical part of the software development process. Bugs, or errors in code, can have serious consequences for the functionality and security of software applications. Identifying and fixing bugs can be a time-consuming and labor-intensive process, requiring the expertise of experienced developers. ChatGPT has the potential to streamline this process and make it more accessible to a wider range of developers, regardless of their experience level. In this article, we will explore the capabilities of ChatGPT as a debugging tool, the advantages and limitations of using it, and best practices for integrating it into the software development workflow.
Verya Monjezi, Ashutosh Trivedi, Gang Tan, Saeid Tizpaz-Niari
1 May 2023
Abstract: The deep feedforward neural networks (DNNs) are increasingly deployed in socioeconomic critical decision support software systems. DNNs are exceptionally good at finding min-imal, sufficient statistical patterns within their training data. Consequently, DNNs may learn to encode decisions-amplifying existing biases or introducing new ones-that may disadvantage protected individuals/groups and may stand to violate legal protections. While the existing search based software testing approaches have been effective in discovering fairness defects, they do not supplement these defects with debugging aids-such as severity and causal explanations-crucial to help developers triage and decide on the next course of action. Can we measure the severity of fairness defects in DNNs? Are these defects symptomatic of improper training or they merely reflect biases present in the training data? To answer such questions, we present Dice: an information-theoretic testing and debugging framework to discover and localize fairness defects in DNNs. The key goal of Dice is to assist software developers in triaging fairness defects by ordering them by their severity. Towards this goal, we quantify fairness in terms of protected information (in bits) used in decision making. A quantitative view of fairness defects not only helps in ordering these defects, our empirical evaluation shows that it improves the search efficiency due to resulting smoothness of the search space. Guided by the quan-titative fairness, we present a causal debugging framework to localize inadequately trained layers and neurons responsible for fairness defects. Our experiments over ten DNNs, developed for socially critical tasks, show that Dice efficiently characterizes the amounts of discrimination, effectively generates discriminatory instances (vis-a-vis the state-of-the-art techniques), and localizes layers/neurons with significant biases.
TL;DR: In this article , the authors explore ChatGPT's capability for DL program repair by asking three research questions: (1) Can chatgpt debug DL programs effectively? (2) How can ChatGpt's repair performance be improved by prompting? (3) In which way can dialogue help facilitate the repair?
Abstract: ChatGPT has revolutionized many research and industrial fields. ChatGPT has shown great potential in software engineering to boost various traditional tasks such as program repair, code understanding, and code generation. However, whether automatic program repair (APR) applies to deep learning (DL) programs is still unknown. DL programs, whose decision logic is not explicitly encoded in the source code, have posed unique challenges to APR. While to repair DL programs, an APR approach needs to not only parse the source code syntactically but also needs to understand the code intention. With the best prior work, the performance of fault localization is still far less than satisfactory (only about 30\%). Therefore, in this paper, we explore ChatGPT's capability for DL program repair by asking three research questions. (1) Can ChatGPT debug DL programs effectively? (2) How can ChatGPT's repair performance be improved by prompting? (3) In which way can dialogue help facilitate the repair? On top of that, we categorize the common aspects useful for prompt design for DL program repair. Also, we propose various prompt templates to facilitate the performance and summarize the advantages and disadvantages of ChatGPT's abilities such as detecting bad code smell, code refactoring, and detecting API misuse/deprecation.
TL;DR: Lucid as mentioned in this paper is a nonintrusive deep learning workload scheduler based on interpretable models, which consists of three innovative modules: a two-dimensional optimized profiler is introduced for efficient job metric collection and timely debugging job feedback, and Lucid utilizes an indolent packing strategy to circumvent interference.
Abstract: While recent deep learning workload schedulers exhibit excellent performance, it is arduous to deploy them in practice due to some substantial defects, including inflexible intrusive manner, exorbitant integration and maintenance cost, limited scalability, as well as opaque decision processes. Motivated by these issues, we design and implement Lucid, a non-intrusive deep learning workload scheduler based on interpretable models. It consists of three innovative modules. First, a two-dimensional optimized profiler is introduced for efficient job metric collection and timely debugging job feedback. Second, Lucid utilizes an indolent packing strategy to circumvent interference. Third, Lucid orchestrates resources based on estimated job priority values and sharing scores to achieve efficient scheduling. Additionally, Lucid promotes model performance maintenance and system transparent adjustment via a well-designed system optimizer. Our evaluation shows that Lucid reduces the average job completion time by up to 1.3× compared with state-of-the-art preemptive scheduler Tiresias. Furthermore, it provides explicit system interpretations and excellent scalability for practical deployment.
Weishi Wang, Yue Wang, Shafiq Joty, Steven C. H. Hoi
30 Nov 2023
TL;DR: RAP-Gen leverages retrieval-augmented learning to generate patches using codeT5, improving the performance of APR models by incorporating diverse code contexts.
Abstract: Automatic program repair (APR) is crucial to reduce manual debugging efforts for developers and improve software reliability. While conventional search-based techniques typically rely on heuristic rules or a redundancy assumption to mine fix patterns, recent years have witnessed the surge of deep learning (DL) based approaches to automate the program repair process in a data-driven manner. However, their performance is often limited by a fixed set of parameters to model the highly complex search space of APR.
TL;DR: In this article , the effects of computer supported project-based learning (CSPBL) on students' computational thinking and learning engagement through comparing students' CT and engagement in two courses instructed by the same instructor (one instructed with a traditional method, the other instructed with CSPBL).
TL;DR: In this paper , the authors investigated the debugging process of 526 children in preschool (4-6 aged) when programming a tangible robot and found that the main finding is the construction and development of syntactic and semantic knowledge.
TL;DR: This research paper explores advanced debugging techniques for 5G multi-processor communication, including distributed tracing, time-travel debugging, AI-assisted anomaly detection, and hardware-assisted methods, to address complexity and optimize system performance and security.
Abstract: This comprehensive research paper explores cutting-edge debugging techniques for multi-processor communication in 5G systems. As 5G networks continue to evolve and expand, the complexity of multi-processor communication introduces unique challenges in system debugging and optimization. This study examines various advanced debugging methodologies, including distributed tracing, time-travel debugging, AI-assisted anomaly detection, and hardware-assisted techniques. The research also delves into real-time debugging protocols, security considerations, and performance analysis of these debugging solutions. By synthesizing current literature and industry practices, this paper provides valuable insights into the state-of-the-art debugging approaches for 5G systems and outlines future research directions in this critical field.
TL;DR: GitHub Actions as mentioned in this paper is a powerful tool for automating workflows on GitHub repositories, with thousands of Actions currently available on the GitHub Marketplace, but the motivation and best practices of developers for using, developing, and debugging Actions are unknown.
Abstract: GitHub Actions is a powerful tool for automating workflows on GitHub repositories, with thousands of Actions currently available on the GitHub Marketplace. So far, the research community has conducted mining studies on Actions, with much of the focus on CI/CD. However, the motivation and best practices of developers for using, developing, and debugging Actions are unknown. To address this gap, we conducted a survey study with 90 Action users and developers. Our findings indicate that developers prefer Actions with verified creators and more stars when choosing between similar Actions, and often switch to alternative Actions when faced with bugs or a lack of documentation. We also found that developers find the composition of YAML files, which are essential for Action integration, challenging and error-prone. They primarily rely on Q&A forums to fix issues with these YAML files. Finally, we observed that developers would not likely adopt Actions when there are concerns around complexity and security risks. Our study summarizes developers’ perceptions, decision-making process, and challenges in using, developing, and debugging Actions. We provide recommendations for improving the visibility, re-usability, documentation, and support surrounding GitHub Actions.
TL;DR: Gamma revisits template-based APR by leveraging large pre-trained language models for donor code generation, improving patch quality and addressing dataset overfitting issues.
Abstract: Automated program repair (APR) aims to fix software bugs without manual debugging efforts and plays a crucial role in software development and maintenance. Template-based APR has been widely investigated and shown promising results. However, it is challenging for template-based APR to select the appropriate donor code, which is an important repair ingredient for generating candidate patches. Inappropriate donor code may cause plausible but incorrect patch generation even with correct fix patterns, limiting the repair performance. In this paper, we aim to revisit template-based APR, and propose Gamma, to directly leverage large pre-trained language models for donor code generation. Our main insight is that instead of retrieving donor code in the local buggy file, we can directly predict the correct code tokens based on the context code snippets and repair patterns by a cloze task. Specifically, (1) Gamma revises a variety of fix templates from state-of-the-art template-based APR techniques (i.e., TBar) and transforms them into mask patterns. (2) Gamma adopts a pre-trained language model to predict the correct code for masked code as a fill-in-the-blank task. Although our idea is general and can be built on various existing pre-trained language models, we have implemented Gamma as a practical APR tool based on the recent UniXcoder model. The experimental results demonstrate that Gamma correctly repairs 82 bugs on Defects4J-v1.2, which achieves 20.59% (14 bugs) and 26.15% (17 bugs) improvement over the previous state-of-the-art template-based approach TBar and learning-based one Recoder. Furthermore, Gamma repairs 45 bugs and 22 bugs from the additional Defects4J-v2.0 and QuixBugs, indicating the generalizability of Gamma in addressing the dataset overfitting issue. We also prove that adopting other pre-trained language models can provide substantial advancement, e.g., CodeBERT-based and ChatGPT-based Gamma is able to fix 80 and 67 bugs on Defects4J-v1.2, indicating the scalability of Gamma. Overall, our study highlights the promising future of adopting pre-trained models to generate correct patches on top of fix patterns in practice.
TL;DR: Angler as discussed by the authors ) is an interactive visual analytics tool to help practitioners prioritize model improvements in machine translation models by analyzing quantitative summary statistics and qualitatively assessing data by reading sentences, and it has been shown that participants could form more interesting and user-focused hypotheses for prioritization.
Abstract: Machine learning (ML) models can fail in unexpected ways in the real world, but not all model failures are equal. With finite time and resources, ML practitioners are forced to prioritize their model debugging and improvement efforts. Through interviews with 13 ML practitioners at Apple, we found that practitioners construct small targeted test sets to estimate an error’s nature, scope, and impact on users. We built on this insight in a case study with machine translation models, and developed Angler, an interactive visual analytics tool to help practitioners prioritize model improvements. In a user study with 7 machine translation experts, we used Angler to understand prioritization practices when the input space is infinite, and obtaining reliable signals of model quality is expensive. Our study revealed that participants could form more interesting and user-focused hypotheses for prioritization by analyzing quantitative summary statistics and qualitatively assessing data by reading sentences.
TL;DR: In this article , the authors examine how a middle school science teacher, new to programming, supports students in learning to debug physical computing systems consisting of programmable sensors and data displays.
Abstract:
Purpose
The purpose of this paper is to examine how a middle school science teacher, new to programming, supports students in learning to debug physical computing systems consisting of programmable sensors and data displays.
Design/methodology/approach
This case study draws on data collected during an inquiry-oriented instructional unit in which students learn to collect, display and interpret data from their surrounding environment by wiring and programming a physical computing system. Using interaction analysis, the authors analyzed video recordings of one teacher’s (Gabrielle) pedagogical moves as she supported students in debugging their systems as they drew upon a variety of embodied, material and social resources.
Findings
This study presents Gabrielle’s debugging interactional grammar, highlighting the pedagogical possibilities for supporting students in systematic ways, providing affective support (e.g. showing them care and encouragement) and positioning herself as a learner with the students. Gabrielle’s practice, and therefore her pedagogy, has the potential to support students in becoming better debuggers on their own in the future.
Originality/value
While much of the prior work on learning to debug focuses on learner actions and possible errors, this case focuses on an educator’s debugging pedagogy centered on the educator debugging with the learners. This case study illustrates the need for educators to exhibit deft facilitation, vulnerability and orchestration skills to support student development of their own process for and agency in debugging.
TL;DR: Somnus as mentioned in this paper provides an overview of the creation and evolution of data tables using a provenance graph, which allows detailed investigation of individual transformations and provides a collection of 23 glyphs that visualize the semantics of transformations.
Abstract: Data workers use various scripting languages for data transformation, such as SAS, R, and Python. However, understanding intricate code pieces requires advanced programming skills, which hinders data workers from grasping the idea of data transformation at ease. Program visualization is beneficial for debugging and education and has the potential to illustrate transformations intuitively and interactively. In this paper, we explore visualization design for demonstrating the semantics of code pieces in the context of data transformation. First, to depict individual data transformations, we structure a design space by two primary dimensions, i.e., key parameters to encode and possible visual channels to be mapped. Then, we derive a collection of 23 glyphs that visualize the semantics of transformations. Next, we design a pipeline, named Somnus, that provides an overview of the creation and evolution of data tables using a provenance graph. At the same time, it allows detailed investigation of individual transformations. User feedback on Somnus is positive. Our study participants achieved better accuracy with less time using Somnus, and preferred it over carefully-crafted textual description. Further, we provide two example applications to demonstrate the utility and versatility of Somnus.
TL;DR: The Differential Unit Tests Based Smart Industrial Automation Software Debugging Tool as mentioned in this paper is a tool that uses a combination of differential unit tests, a debugging engine, and a graphical user interface (GUI) to help developers quickly locate and fix bugs in their software.
Abstract: The Differential Unit Tests Based Smart Industrial Automation Software Debugging Tool is a new and powerful tool designed to make debugging industrial automation software easier and faster. It uses a combination of differential unit tests, a debugging engine, and a graphical user interface (GUI) to help developers quickly locate and fix bugs in their software. Differential unit tests are a type of test that compares the differences between two versions of the same software. They can help identify any subtle changes that may have occurred since the last version was tested. This is especially useful when debugging industrial automation software, which often contains complex logic and calculations that can be difficult to trace. The debugging engine is the core of the tool and is responsible for analyzing the results of the differential unit tests. It uses a variety of algorithms to identify any potential errors or discrepancies. The debugging engine is also responsible for providing feedback to the developers so they can quickly locate and fix any issues. The GUI is the user interface of the tool and provides developers with an easy way to view and interact with the results of the differential tests. It allows developers to quickly identify any problems in their software and make the necessary changes.
TL;DR: In this paper , the authors report a case study in which they used a think-aloud protocol to gain insight into the behaviour of three students engaged in debugging tasks and observe that comprehension, evidence-based activities, and workflow practices all contribute to novice debugging success.
Abstract: Debugging is a core skill required by programmers, yet we know little about how to effectively teach the process of debugging. The challenges of learning debugging are compounded for novices who lack experience and are still learning the tools they need to program effectively. In this work, we report a case study in which we used a think-aloud protocol to gain insight into the behaviour of three students engaged in debugging tasks. Our qualitative analysis reveals a variety of helpful practices and barriers that limit the effectiveness of debugging. We observe that comprehension, evidence-based activities, and workflow practices all contribute to novice debugging success. Lack of sustained effort, precision, and methodical processes negatively impact debugging effectiveness. We anticipate that understanding how students engage in debugging tasks will aid future work to address ineffective behaviours and promote effective debugging activities.
TL;DR: In this article , the authors propose an automated scientific debugging (AutoSD) technique that given buggy code and a bug-revealing test, prompts large language models to automatically generate hypotheses, uses debuggers to actively interact with buggy code, and thus automatically reach conclusions prior to patch generation.
Abstract: Automated debugging techniques have the potential to reduce developer effort in debugging, and have matured enough to be adopted by industry. However, one critical issue with existing techniques is that, while developers want rationales for the provided automatic debugging results, existing techniques are ill-suited to provide them, as their deduction process differs significantly from that of human developers. Inspired by the way developers interact with code when debugging, we propose Automated Scientific Debugging (AutoSD), a technique that given buggy code and a bug-revealing test, prompts large language models to automatically generate hypotheses, uses debuggers to actively interact with buggy code, and thus automatically reach conclusions prior to patch generation. By aligning the reasoning of automated debugging more closely with that of human developers, we aim to produce intelligible explanations of how a specific patch has been generated, with the hope that the explanation will lead to more efficient and accurate developer decisions. Our empirical analysis on three program repair benchmarks shows that AutoSD performs competitively with other program repair baselines, and that it can indicate when it is confident in its results. Furthermore, we perform a human study with 20 participants, including six professional developers, to evaluate the utility of explanations from AutoSD. Participants with access to explanations could judge patch correctness in roughly the same time as those without, but their accuracy improved for five out of six real-world bugs studied: 70% of participants answered that they wanted explanations when using repair tools, while 55% answered that they were satisfied with the Scientific Debugging presentation.
TL;DR: In this article , an in-kernel tracer is proposed to enable the measurement of the Operating System Noise as observed by a workload, and the tracing of the sources of the noise, in an integrated manner, facilitating the analysis and debugging of the system.
Abstract: As modern network infrastructure moves from hardware-based to software-based using Network Function Virtualization, a new set of requirements is raised for operating system developers. By using the real-time kernel options and advanced CPU isolation features common to the HPC use-cases, Linux is becoming a central building block for this new architecture that aims to enable a new set of low latency networked services. Tuning Linux for these applications is not an easy task, as it requires a deep understanding of the Linux execution model and the mix of user-space tooling and tracing features. This paper discusses the internal aspects of Linux that influence the Operating System Noise from a timing perspective. It also presents Linux's osnoise tracer, an in-kernel tracer that enables the measurement of the Operating System Noise as observed by a workload, and the tracing of the sources of the noise, in an integrated manner, facilitating the analysis and debugging of the system. Finally, this paper presents a series of experiments demonstrating both Linux's ability to deliver low OS noise (in the single-digit $\mu$ s order), and the ability of the proposed tool to provide precise information about root-cause of timing-related OS noise problems.
TL;DR: The results showed that GNet4FL successfully located 160 out of 262 faults, outperforming the three state-of-the-art methods by 94, 42, and 14% in Top-1 accuracy, and having close results to Grace with less cost.
TL;DR: In this article , a flipped systematic debugging approach combined with a systematic debugging process (SDP) and the modeling method was developed to address the gap of debugging teaching in K-12 contexts, and relevant empirical studies are lacking in the literature.
Abstract: Reintroducing computer science (CS) education in K–12 schools to promote computational thinking (CT) has attracted significant attention among scholars and educators. Among the several essential components included in CS and CT education, program debugging is an indispensable skill. However, debugging teaching has often been overlooked in K–12 contexts, and relevant empirical studies are lacking in the literature. Moreover, novices generally have poor performance in domain knowledge and strategic knowledge concerning debugging. They also consistently experience a high cognitive burden in debugging learning. To address these gaps, we developed a flipped systematic debugging approach combined with a systematic debugging process (SDP) and the modeling method. A quasi-experimental study was conducted to explore the effectiveness of this flipped systematic debugging approach, in which 83 fifth-grade students attended the flipped debugging training lessons with the SDP–modeling method, and 75 fifth-grade students attended the unassisted flipped debugging training lessons without the SDP–modeling method. The results indicated that flipped debugging training using the SDP–modeling method improved students’ debugging skills. The results from the questionnaire showed that the proposed teaching approach increased the students’ investment in germane cognitive load by promoting schema construction. It also helped reduce students’ intrinsic and extraneous cognitive load in learning.