Journal Article10.1016/j.infsof.2022.106956
Cleaning ground truth data in software task assignment
6
TL;DR: In this article , a debiasing method was devised to detect potentially problematic samples in task assignment datasets and to clean up the ground truth by removing the samples that are potentially problematic or suspect with the assumption that removing such samples would reduce any systematic labeling bias in the dataset and lead to performance improvements.
read more
Abstract: In the context of collaborative software development, there are many application areas of task assignment such as assigning a developer to fix a bug, or assigning a code reviewer to a pull request. Most task assignment techniques in the literature build and evaluate their models based on datasets collected from real projects. The techniques invariably presume that these datasets reliably represent the “ground truth”. In a project dataset used to build an automated task assignment system, the recommended assignee for the task is usually assumed to be the best assignee for that task. However, in practice, the task assignee may not be the best possible task assignee, or even a sufficiently qualified one. We aim to clean up the ground truth by removing the samples that are potentially problematic or suspect with the assumption that removing such samples would reduce any systematic labeling bias in the dataset and lead to performance improvements. We devised a debiasing method to detect potentially problematic samples in task assignment datasets. We then evaluated the method’s impact on the performance of seven task assignment techniques by comparing the Mean Reciprocal Rank (MRR) scores before and after debiasing. We used two different task assignment applications for this purpose: Code Reviewer Recommendation (CRR) and Bug Assignment (BA). In the CRR application, we achieved an average MRR improvement of 18.17% for the three learning-based techniques tested on two datasets. No significant improvements were observed for the two optimization-based techniques tested on the same datasets. In the BA application, we achieved a similar average MRR improvement of 18.40% for the two learning-based techniques tested on four different datasets. Debiasing the ground truth data by removing suspect samples can help improve the performance of learning-based techniques in software task assignment applications. • Devised a debiasing method to clean task assignment datasets. • Conducted experiments in two task assignment applications. • Debiasing the ground truth data improves learning-based techniques’ performance.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Neighborhood Contrastive Learning-based Graph Neural Network for Bug Triaging
Haozhen Dong,Haoxing Ren,Jianren Shi,Yichen Xie,Xudong Hu +4 more
1
Sharing is Caring: A Practical Guide to FAIR(ER) Open Data Release
Amelia Henriksen,Miranda Mundt +1 more
- 24 Aug 2024
TL;DR: This tutorial provides practical steps for scientists to FAIRly release open data, addressing theoretical limitations and offering suggestions for equitable and realistic data design, accessibility, and translation to real-world applications.
Enhancing Code Review Efficiency: Automated Pull Request Evaluation using Natural Language Processing and Machine Learning
Przemysław Wincenty Zydroń,Jarosław Protasiewicz +1 more
TL;DR: Enhancing code review efficiency through automated pull request evaluation using NLP and machine learning. The primary challenge is verifying review accuracy. Further research is needed to develop more reliable methods to assess review accuracy.
PCG: A joint framework of graph collaborative filtering for bug triaging
Qingshan Li,Daizhen Li,Hua Chu +2 more
- 17 Apr 2024
TL;DR: With bug triaging modeled as predicting links on the bipartite graph of bug–developer correlations, PCG is proposed, an innovative framework that integrates prototype augmentation and contrastive learning with GCF to mitigate data sparsity and devise a semantic contrastive learning task to overcome semantic deficiency.
References
•Journal Article
Scikit-learn: Machine Learning in Python
Fabian Pedregosa,Gaël Varoquaux,Alexandre Gramfort,Vincent Michel,Bertrand Thirion,Olivier Grisel,Mathieu Blondel,Peter Prettenhofer,Ron Weiss,Vincent Dubourg,Jake Vanderplas,Alexandre Passos,David Cournapeau,Matthieu Brucher,Matthieu Perrot,Edouard Duchesnay +15 more
TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
•Proceedings Article
Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov,Kai Chen,Greg S. Corrado,Jeffrey Dean +3 more
- 16 Jan 2013
TL;DR: Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.
27.5K
Guidelines for conducting and reporting case study research in software engineering
Per Runeson,Martin Höst +1 more
TL;DR: This paper aims at providing an introduction to case study methodology and guidelines for researchers conducting case studies and readers studying reports of such studies, and presents recommended practices and evaluated checklists for researchers and readers of case study research.
Who should fix this bug
John Anvik,Lyndon Hiew,Gail C. Murphy +2 more
- 28 May 2006
TL;DR: This paper applies a machine learning algorithm to the open bug repository to learn the kinds of reports each developer resolves and reaches precision levels of 57% and 64% on the Eclipse and Firefox development projects respectively.
•Proceedings Article
Automatic bug triage using text categorization.
Davor Cubranic,Gail C. Murphy +1 more
- 01 Jan 2004
TL;DR: This paper proposes to apply machine learning techniques to assist in bug triage by using text categorization to predict the developer that should work on the bug based on thebug’s description.
Related Papers (5)
Adrian Iftene,Loredana Vamanu,Cosmina Croitoru +2 more
- 30 Sep 2009
Ivo Malý,Jiří Bittner,Pavel Slavik +2 more
- 11 Jul 2012