TL;DR: This study investigates integrating AI tools like ChatGPT into CS education to enhance learning while preventing over-reliance, proposing a framework to encourage critical thinking, problem-solving, and creativity through modified assignments that challenge AI's capabilities.
Abstract: This full research paper investigates how Artificial Intelligence (AI) tools like ChatGPT can be thoughtfully integrated into computer science education to enhance learning while preventing over-reliance that could hinder students' development of critical thinking and problem-solving skills. Although AI can assist students in tasks such as debugging and code generation, its ease of use may inadvertently reduce their engagement with the underlying problem-solving process. To address this challenge, our research focuses on the design of CS1 assignments that make productive use of AI while encouraging deeper learning and creativity. We evaluate ChatGPT-4's ability to solve various programming problems and identify its limitations with AI-generated outputs. Based on this analysis, we propose a framework that encourages students to deconstruct complex problems, create innovative solutions beyond what AI typically generates, iteratively improve their code, and clearly explain their reasoning. Our methods include modifying standard lab assignments by adding non-textual elements such as flowcharts, real-world scenarios, and diagrams, making it more difficult for AI tools to directly solve them. These changes prompt students to engage in higher-order thinking as they interpret and solve the problems. Through this study, we offer practical recommendations for educators aiming to incorporate AI into computer science curricula without compromising the development of essential cognitive skills. By designing assignments that challenge students intellectually while leveraging the benefits of AI, we aim to promote responsible and effective use of emerging technologies in education.
Liia Butler, Charlotte Kiesel, Dipayan Mukherjee, Mohammed Hassan, Mattox Beckman, Geoffrey Herman
2 Nov 2025
TL;DR: ILDBug, a novel debugging exercise, improves students' debugging skills by predicting, experiencing, and reflecting on bug fixes, with students fixing bugs faster and more accurately after completing multiple exercises, particularly for moderately difficult bugs.
Abstract: This innovative practice category full paper describes a novel approach to teaching debugging. Debugging is an essential skill in programming, yet there are few evidence-based techniques to improve students' debugging abilities. We developed a debugging exercise inspired by the pedagogical technique, Interactive Lecture Demonstrations (ILDs). During ILDs, students predict a demonstration's result, experience the demonstration, and then reflect on their experience. We adapted this process for debugging (ILDBug) by having students look at example program outputs and predict the bug, then they experience debugging by tracing the code and trying to fix the bug, and finally students reflect on their debugging process. We engaged students in a series of three ILDBug exercises during a lab section of an introductory programming course for nonCS, engineering majors. To evaluate whether our exercises were improving students' debugging skills, we introduced a cross-over study design with three populations: each group of students completed the same three exercises but in a different order. We then compared how students performed on each exercise when it was their first exercise or their last exercise. We graded students' predictions for accuracy using a binary score and measured how long students took to fix the bugs using clickstream logs. Students generally fixed bugs faster after completing two ILDBug exercises. Students improved at predicting bugs that were moderately difficult to identify. The ILDBug exercises are a promising, light-weight debugging exercise that can be adapted to many contexts and merit future research.
TL;DR: This research presents an IoT-centric project-based learning module to teach computer networks, enhancing student learning outcomes in a senior-level course by combining lectures, guided tutorials, and team-based projects with real-world IoT applications and hands-on experience.
Abstract: This research-to-practice full paper presents an innovative project-based learning (PBL) module designed to teach computer networks through the lens of Internet of Things (IoT) applications. As IoT continues to reshape technology and society, computer networks education must evolve to provide students with both practical skills and real-world insights. Traditional teaching methods often struggle to offer the hands-on experience and contextual understanding necessary to meet these demands. To address this gap, we developed a PBL module centered around IoT projects to enhance student learning outcomes in a computer networks course. The module aims to teach students: 1) modern applicationlayer protocols commonly used in IoT environments; 2) foundational IoT concepts, including IoT devices and the performance requirements of IoT applications; 3) practical skills in building and debugging network applications on IoT devices, covering topics such as IP addressing, private and public networks, network debugging tools, Raspberry Pi operations, and sensors with GPIO ports; and 4) problem-solving and collaboration skills through the design and implementation of IoT projects. Spanning four weeks, the module combines lectures, guided tutorials, and team-based projects culminating in a hands-on IoT application. The module was implemented in a senior-level Computer Networks course with 47 students in Fall 2024. A mixedmethods evaluation, including student reflections and instructor observations, revealed significant improvements in students' understanding of networking concepts and their ability to apply them in IoT contexts. Furthermore, student feedback highlighted the importance of real-world relevance and teamwork in enhancing their learning experience. These findings provide valuable insights for designing PBL modules that effectively bridge the gap between theoretical knowledge and practical application in computer science education.
TL;DR: This artifact package replicates the results of Defects4REST, a benchmark for REST API defects, containing code and data for three research questions: defect types, file types and resolution time, and testing tool evaluation.
Abstract: This artifacte contains the code and data to replicate the results presented in the paper titled Defects4REST: A Benchmark of Real-World Defects to Enable Controlled Testing and Debugging Studies for REST APIs. in Proceedings of the 48th International Conference on Software Engineering (ICSE), 2026 by Rahil P. Mehta, Pushpak Kathkhede, and Manish Motwani. The artrifact is organized in terms of the three research questions addressed in the paper: RQ1: Common REST API Defect Types: This involves code and data related to mining issues from GitHub projects (issue mining), classfying issues into REST-API defects and non-REST API defects (issue classification), and deriving defect taxonomy (clustering and topic modeling). RQ2: File Types Modified and Time to Resolve REST API Defects: This involves the code and data to analyze the file types of developer-modified files to fix the REST API defects (patch file analysis), and the time it took developers to fix these defects (time to fix analysis). RQ3: Evaluating Current REST API Testing Tools Against Real-World Defects: This involves code and data to execute and analyze the results of four REST API testing techniques (EvoMaster, Schemathesis,RESTler, and AutoRestTest) on a 30-defect subset of Defects4REST.
TL;DR: PySPN is a Python library for stochastic Petri net modeling, simulation, and event log generation, enabling researchers and practitioners to model, analyze, and optimize complex systems in various domains with a user-friendly and efficient toolset.
Abstract: Stochastic Petri Nets (SPNs) are a powerful formalism, widely used for modeling complex systems in various domains, ranging from manufacturing and logistics to healthcare and computer networks. In this paper, we introduce PySPN , a flexible and easily extendable Python library for Modeling and Simulation of SPNs. Besides the simulation of SPNs, we further extended PySPN with the functionality of generating synthetic data in the form of event logs from SPNs’ simulations. Event logs in simulation models are essential for ensuring model accuracy, evaluating performance, debugging, and facilitating decision-making processes. Event logs offer a comprehensive record of simulated events, which can be analyzed to gain insights into systems’ behaviors and performance. PySPN aims to provide researchers, engineers, and simulation practitioners with a user-friendly and efficient toolset to model, simulate, and analyze SPNs, facilitating the understanding and optimization of stochastic processes in dynamic systems.
TL;DR: This study proposes Code Insight, an active learning framework to develop code review skills in novice programmers, improving their learning experience and confidence in writing and debugging programs through in-class activities and classroom response systems.
Abstract: The full research paper describes an active learning based in-class intervention to encourage the development of code review and debugging skills in students from the CS program. The rapid proliferation of AI-based applications has heightened the importance of code review and debugging skills among software developers. Organizations leverage code review processes not only to ensure software quality and accuracy but also to facilitate knowledge transfer, promote productive collaboration, maintain consistency, and share ideas. However, incorporating code review practices into classroom settings presents challenges due to the associated heavy workload, often resulting in low student motivation and participation. To address these issues, we propose an in-class active learning framework, Code Insight, which integrates seamlessly into class lectures without adding extra workload for students or additional grading for instructors. Recognizing that beginner programmers have limited knowledge and experience with programming concepts, the goal of this study is to provide strategic guidance to help students effectively and systematically review, analyze, and evaluate existing code written by others, thereby gaining a deeper understanding of the code. In our study, students participated in the Code Insight activity during class, using a classroom response system to report the correctness of provided code through multiple-choice questions. The impact of Code Insight on learning is evaluated by collecting student feedback at the end of the semester. Our findings indicate that the Code Insight activity presents a sufficient challenge for students, thereby improving their learning experience and highlighting the necessity of integrating more extensive code review practices in entry-level programming courses. Students expressed a preference for these activities over traditional lectures, noting that they helped them debug their programs by identifying and fixing errors, and they feel confident in writing a program and addressing any issues through complex debugging tasks. The proposed intervention provides a platform for students to discuss misconceptions, visualize multiple problem-solving approaches, and thoroughly understand existing code.
TL;DR: ClusterXplain is a clustering-based tool for DNN component debugging, offering 36 pipelines that leverage transfer learning and advanced algorithms to extract features, cluster images, and identify distinct failure scenarios with high accuracy.
Abstract: We introduce ClusterXplain, a tool offering 36 distinct pipelines for debugging deep neural network (DNN) failures. Building on our previous approach, SAFE, ClusterXplain leverages transfer learning models to extract features from failure-inducing images. These features are used to cluster corresponding images using advanced algorithms, including density-based methods, to identify the distinct scenarios in which the DNN may fail. While SAFE focused on a single pipeline configuration, ClusterXplain explores diverse combinations of feature extraction methods, clustering algorithms, and dimensionality reduction techniques to improve DNN debugging. The results show that ClusterXplain achieves high accuracy in scenario identification, offering actionable insights to engineers to diagnose and mitigate issues in safety-critical applications. A demo video of ClusterXplain is available at https://youtu.be/DU94c-lIwys.
TL;DR: This work-in-progress research explores the efficacy of a debugging education intervention in an introductory microelectronics course, finding that a mini-lecture and cheat sheet improved students' debugging speed and success rate, while also shifting their mindset towards debugging.
Abstract: This work-in-progress research paper explores the efficacy of a small-scale microelectronics debugging education intervention utilizing quasi-experimental design in an introductory microelectronics course for third-year electrical and computer engineering (ECE) students. In the first semester of research, the experimental group attended a debugging “mini lecture” covering two common sources of circuit error and received a debugging cheat sheet with recommendations for testing and hypothesis formation. Across three debugging problems, students in the experimental group were faster by an average of $1: 43$ and had a 7% higher success rate than the control group. Both groups demonstrated a strong general growth mindset while the experimental group also displayed a shift in their debugging mindset by perceiving a greater value towards debugging. Though these differences are not yet statistically significant, the pilot results indicate that a mini-lecture and debugging cheat sheet are steps in the right direction toward improving students' readiness for debugging in the workplace.
TL;DR: A Machine Learning approach, DNN-IFA, is proposed to debug localization bugs and improve microprocessor performance by classifying performance benchmarking of new microarchitectures, achieving 91.5% defect detection with 1% IPC impact and minimizing debugging time.
Abstract: Processor design evaluation and debugging are challenging and complex tasks that take up the majority of the design process as well as require significant engineering resources. The performance of overall processor gets reduced because of bugs with no effect of the functionality are especially challenging for debugging owing to the lack of a universal standard from bug-free performance. This is due to the functional errors, the optimum processor performance for novel microarchitectures on complicated whereas the benchmark of long-run is usually predictably determined. To solve the mentioned problems, we are presenting a Machine Learning (ML) approach that is Deep Neural Network (DNN) with Improved Firefly Algorithm (IFA) as DNN-IFA as a classifier for performance benchmarking of new microarchitectures. The performance of a novel microarchitecture may be assumed to be appropriate if the present microarchitecture outperforms the preceding generation, despite considerable performance regressions in the initial implementation. Moreover, this proposal has influences hyperparamer and by modifying the feature fitness of firefly algorithm resulted to improve the fitness by avoiding bias has capability in detecting the bug performance endurance in microprocessors. The findings reveal that the most effective technique discovers microprocessor core performance defect is 91.5% with a standard IPC impact of more than 1% across the analyzed applications, compared to a bug-free design with no instances of false positives. The suggested system in the simulated scenario takes consumes less time to execute a bug location inference that resulted to minimize the debugging time.
Florian Rupprecht, Jason Kai, B. Shrestha, Steven Giavasis, Ting Xu, Tristan Glatard, Michael P. Milham, Gregory Kiar
30 Jul 2025
TL;DR: Styx, a compiler, generates language-native wrapper functions from tool metadata, enabling seamless integration of command-line tools in data science ecosystems, with NiWrap providing a proof-of-concept implementation for neuroimaging tools in Python, R, and TypeScript.
Abstract: In numerous scientific domains, established tools have often been developed with complex command-line interfaces. Such is the case for brain imaging and bioinformatics, making the use of powerful legacy tools in modern workflow paradigms challenging. We present (i) Styx, a compiler for generating language-native wrapper functions from static tool metadata, leading to seamless integration of command-line tools within the data science ecosystem. Alongside Styx, we have created (ii) NiWrap, a collection of more than 1900 neuroimaging command-line function descriptions as a proof-of-concept implementation. These interfaces, available in Python, R, and TypeScript (available at https://github.com/styx-api ), significantly reduce the complexity of writing and interpreting software pipelines, particularly when composing workflows across packages with distinct API standards. The compiler architecture of Styx facilitates maintainability and portability across computing environments. As with all metadata-dependent infrastructure, creating sufficient metadata annotations remains a barrier to adoption. Accordingly, NiWrap demonstrates approaches that lower this barrier through direct source code extraction and LLM-assisted documentation parsing. Together, Styx and NiWrap offer a sustainable solution for interfacing diverse command-line tools with modern data science ecosystems. This modular approach enhances reproducibility and efficiency in pipeline development while ensuring portability across computing environments and programming languages.
TL;DR: SemBIC identifies bug-inducing commits by statically tracking semantic changes in execution paths across historical commit versions, achieving high accuracy (88% top 1 ranking, MRR 0.520) on 199 real-world bugs from 12 open-source projects.
Abstract: Debugging can be much facilitated if one can identify the evolution commit that introduced the bug leading to a detected failure (aka. bug-inducing commit, BIC). Although one may, in theory, locate BICs by executing the detected failing test on various historical commit versions, it is impractical when the test cannot be executed on some of those versions. On the other hand, existing static techniques often assume the availability of additional information such as patches and bug reports, or the applicability of predefined heuristics like commit chronology. However, these approaches are ineffective when such assumptions do not hold, which are often the case in practice. To address these limitations, we propose SEMBIC to identify the BIC of a bug by statically tracking the semantic changes in the execution path prescribed by the failing test across successive historical commit versions. Our insight is that the greater the semantic changes a commit introduces concerning the failing execution path of a target bug, the more likely it is to be the BIC. To distill semantic changes relevant to the failure, we focus on three fine-grained semantic properties. We evaluate the performance of SEMBIC on a benchmark containing 199 real-world bugs from 12 open-source projects. We found that SEMBIC can identify BICs with high accuracy – it ranks the BIC as top 1 for 88 out of 199 bugs, and achieves an MRR of 0.520, outperforming the state-of-the-art technique by 29.4% and 13.6%, respectively.
TL;DR: IntraJ, a Java code analysis framework, provides interactive analysis results directly in the editor, leveraging Reference Attribute Grammars for on-demand evaluation, and achieves a response time of under 0.1 seconds for most compilation units.
Abstract: Abstract Static analysis tools play a crucial role in software development by detecting bugs and vulnerabilities. However, running these tools separately from the code editing process often causes developers to switch contexts, which can reduce productivity. Previous work has shown how Reference Attribute Grammars (RAGs) can be used for declarative implementation of competitive tooling for intraprocedural control-flow and dataflow analysis of Java source code, embodied in the tool IntraJ . In this paper, we demonstrate how IntraJ can be leveraged to provide interactive analysis results directly in the editor, similar to compile-time error detection, relying on automatic on-demand evaluation of RAGs. We discuss the architecture of IntraJ , and demonstrate how it can be integrated into the development process in three different ways: in the command line, in an editor integration based on the Language Server Protocol, and in an integration with the debugging tool CodeProber . We showcase the extensibility of IntraJ by illustrating how new client analyzes and language constructs can be added to the framework through RAG specifications. Finally, we evaluate the interactive performance of IntraJ on a set of real-world Java benchmarks, demonstrating that IntraJ can provide interactive feedback to developers, achieving a response time of under 0.1 seconds for most compilation units.
TL;DR: This study analyzes Claude 3 Opus's self-debugging capabilities through a meta-experimental approach, where the AI generates and resolves intentionally buggy code, employing a systematic methodology mirroring human debugging strategies with 100% success in a Python-based Task Management System.
Abstract: Abstract This paper presents a novel meta-experimental approach to analyzing the debugging capabilities of large language models (LLMs), specifically Claude 3 Opus. Through a carefully designed experiment where the AI system first generates intentionally buggy code and subsequently debugs it without prior knowledge, we document and analyze the systematic debugging methodology employed by modern AI systems. Our experiment involved a Python-based Task Management System containing 12 distinct bug categories, ranging from syntax errors to complex runtime issues. The AI successfully identified and resolved all bugs using a methodical, error-driven approach that mirrors human debugging strategies. Key findings include the AI’s ability to: (1) prioritize syntax errors before runtime issues, (2) leverage Python’s error messages effectively, (3) implement comprehensive fixes with proper error handling, and (4) validate solutions through automated testing. This research contributes to understanding AI’s role in automated software debugging and has implications for the future of AI-assisted software development, code review processes, and programming education.
TL;DR: This paper develops a Small Language Model (SLM) for efficient coding assistance, leveraging a transformer-based decoder-only architecture (100-300M parameters) to provide real-time code completion and function generation in Python, with a focus on low-latency and low-resource demands.
Abstract: <p><em><span>The proliferation of Large Language Models (LLMs) has significantly impacted software development, yet their substantial computational and resource demands create barriers to widespread accessibility. This paper details the development and evaluation of a Small Language Model (SLM) designed as an efficient, practical alternative for coding assistance. The primary goal is to create a lightweight, low-latency model specialized in Python, capable of performing real-time code completion and generating functions from natural language prompts. The methodology employs a transformer-based decoder-only architecture (100-300M parameters) trained on a filtered, high-quality dataset of open-source code. Model performance is assessed using the pass@k metric from the HumanEval benchmark for functional correctness, alongside measurements of inference speed and memory footprint to validate its efficiency. This research will deliver a proof-of-concept prototype, demonstrating that specialized SLMs can offer a sustainable and effective solution that enhances developer productivity while democratizing access to advanced AI-powered coding tools.</span></em></p>
TL;DR: ChatDBG, an AI-powered debugging assistant, integrates large language models to enhance conventional debuggers, enabling programmers to engage in collaborative dialogue, diagnose complex issues, and generate accurate fixes for real-world errors with high success rates.
Abstract: Debugging is a critical but challenging task for programmers. This paper proposes ChatDBG, an AI-powered debugging assistant. ChatDBG integrates large language models (LLMs) to significantly enhance the capabilities and user-friendliness of conventional debuggers. ChatDBG lets programmers engage in a collaborative dialogue with the debugger, allowing them to pose complex questions about program state, perform root cause analysis for crashes or assertion failures, and explore open-ended queries like "why is x null?". To handle these queries, ChatDBG grants the LLM autonomy to "take the wheel": it can act as an independent agent capable of querying and controlling the debugger to navigate through stacks and inspect program state. It then reports its findings and yields back control to the programmer. By leveraging the real-world knowledge embedded in LLMs, ChatDBG can diagnose issues identifiable only through the use of domain-specific reasoning. Our ChatDBG prototype integrates with standard debuggers including LLDB and GDB for native code and Pdb for Python. Our evaluation across a diverse set of code, including C/C++ code with known bugs and a suite of Python code including standalone scripts and Jupyter notebooks, demonstrates that ChatDBG can successfully analyze root causes, explain bugs, and generate accurate fixes for a wide range of real-world errors. For the Python programs, a single query led to an actionable bug fix 67% of the time; one additional follow-up query increased the success rate to 85%. ChatDBG has seen rapid uptake; it has already been downloaded more than 75,000 times.
Zhu Kunlun, Liu Zijia, Li Bing-Xuan, Yang Yingxuan, Zhang Jiaxun, Xie Qipeng, Zhang Wei-jia, Ma, Xiaoteng, Yu Xiaodong, Ramesh, Gowtham, WU Jialian, Liu, Zicheng, Lu Pan, Zou, James, You, Jiaxuan
1 Oct 2025
Abstract: Large Language Model (LLM) agents, which integrate planning, memory, reflection, and tool-use modules, have shown promise in solving complex, multi-step tasks. Yet their sophisticated architectures amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions, leading to task failure. Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way, and therefore fail to detect these errors accordingly. We address this gap with three contributions. First, we introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations. Second, we construct AgentErrorBench, the first dataset of systematically annotated failure trajectories from ALFWorld, GAIA, and WebShop, grounding error analysis in real-world agent rollouts. Third, we propose AgentDebug, a debugging framework that isolates root-cause failures and provides corrective feedback, enabling agents to recover and iteratively improve. Experiments on AgentErrorBench show that AgentDebug achieves 24% higher all-correct accuracy and 17% higher step accuracy compared to the strongest baseline. Beyond detection, the targeted feedback generated by AgentDebug enables LLM agents to iteratively recover from failures, yielding up to 26% relative improvements in task success across ALFWorld, GAIA, and WebShop. These results establish principled debugging as a pathway to more reliable and adaptive LLM agents. The code and data will be available at https://github.com/ulab-uiuc/AgentDebug
Abstract: A verification method based on AHB bus DMA controller is proposed to address the problems of traditional chip verification platforms, poor reusability of verification cases, and long verification time. The validation platform modified and optimized through this method has high portability, with the ability to simulate the generation of request, reply, and interrupt signals through IP linkage, and supports the ability to randomly inject errors into the AHB bus during transmission. To demonstrate the effectiveness of the method, the architecture of the DMA controller UVM verification platform was modified and optimized, and the platform was reused to construct the Image2D controller UVM verification platform. The code coverage of two verification platforms was collected separately. The experimental results show that this method can improve the debugging and debugging speed of the verification platform, enhance the compatibility of the verification platform, save more than 15% of the verification development time, and ensure 100% code coverage of the verification platform.
Abstract: This article introduces a novel methodology, Network Simulator-centric Compositional Testing (NSCT), to enhance the verification of network protocols with a particular focus on time-varying network properties. NSCT follows a Model-Based Testing (MBT) approach. These approaches usually struggle to test and represent time-varying network properties. NSCT also aims to achieve more accurate and reproducible protocol testing. It is implemented using the Ivy tool and the Shadow network simulator. This enables online debugging of real protocol implementations. A case study on an implementation of QUIC (picoquic) is presented, revealing an error in its compliance with a time-varying specification. This error has subsequently been rectified, highlighting NSCT's effectiveness in uncovering and addressing real-world protocol implementation issues. The article underscores NSCT's potential in advancing protocol testing methodologies, offering a notable contribution to the field of network protocol verification.