PHANTOM: Curating GitHub for engineered software projects using time-series clustering
TL;DR: PHANTOM as mentioned in this paper extracts five measures from GitHub logs and transforms each measure into a time-series, which is represented as a feature vector for clustering using the k-means algorithm.
read more
Abstract: Within the field of Mining Software Repositories, there are numerous methods employed to filter datasets in order to avoid analysing low-quality projects. Unfortunately, the existing filtering methods have not kept up with the growth of existing data sources, such as GitHub, and researchers often rely on quick and dirty techniques to curate datasets. The objective of this study is to develop a method capable of filtering large quantities of software projects in a resource-efficient way. This study follows the Design Science Research (DSR) methodology. The proposed method, PHANTOM, extracts five measures from Git logs. Each measure is transformed into a time-series, which is represented as a feature vector for clustering using the k-means algorithm. Using the ground truth from a previous study, PHANTOM was shown to be able to rediscover the ground truth on the training dataset, and was able to identify “engineered” projects with up to 0.87 Precision and 0.94 Recall on the validation dataset. PHANTOM downloaded and processed the metadata of 1,786,601 GitHub repositories in 21.5 days using a single personal computer, which is over 33% faster than the previous study which used a computer cluster of 200 nodes. The possibility of applying the method outside of the open-source community was investigated by curating 100 repositories owned by two companies. It is possible to use an unsupervised approach to identify engineered projects. PHANTOM was shown to be competitive compared to the existing supervised approaches while reducing the hardware requirements by two orders of magnitude.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
A comprehensive empirical study on bug characteristics of deep learning frameworks
TL;DR: Wang et al. as discussed by the authors mined 1,127 DL framework bug reports from eight popular DL frameworks and labeled the bug types, root causes, and symptoms, and analyzed the bug-fixing changes.
22
QScored: A Large Dataset of Code Smells and Quality Metrics
Tushar Sharma,Marouane Kessentini +1 more
- 26 Jan 2021
TL;DR: QScored dataset as mentioned in this paper contains code quality information of more than 86 thousand C# and Java GitHub repositories containing more than 1.1 billion lines of code and contains seven kinds of detected architecture smells, 20 kinds of design smells, eleven kinds of implementation smells, and 27 commonly used code quality metrics computed at project, package, class, and method levels.
22
Systematic Mapping: Artificial Intelligence Techniques in Software Engineering
01 Jan 2022
TL;DR: In this paper , a systematic mapping study was conducted to characterize the publication landscape of AI techniques in software engineering, and gaps were identified and discussed by mapping these AI techniques against the SE phases to which they contributed.
Just-in-time software defect prediction using deep temporal convolutional networks
TL;DR: In this article, the authors proposed a new approach based on a large feature set containing product and process software metrics extracted from commits of software projects along with their evolution, and introduced a deep temporal convolutional networks variant based on hierarchical attention layers to perform the fault prediction.
12
What is an app store? The software engineering perspective
Wenhan Zhu,Sebastian Proksch,Daniel M. German,Michael W. Godfrey,Li Li,Shane McIntosh +5 more
TL;DR: The goal of this paper is to survey and characterize the broader dimensionality of app stores, and to explore how and why they influence software development practices, such as system design and release management.
6
References
•Journal Article
Scikit-learn: Machine Learning in Python
Fabian Pedregosa,Gaël Varoquaux,Alexandre Gramfort,Vincent Michel,Bertrand Thirion,Olivier Grisel,Mathieu Blondel,Peter Prettenhofer,Ron Weiss,Vincent Dubourg,Jake Vanderplas,Alexandre Passos,David Cournapeau,Matthieu Brucher,Matthieu Perrot,Edouard Duchesnay +15 more
TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
Design science in information systems research
TL;DR: The objective is to describe the performance of design-science research in Information Systems via a concise conceptual framework and clear guidelines for understanding, executing, and evaluating the research.
Time-series clustering - A decade review
TL;DR: This review will expose four main components of time-series clustering and is aimed to represent an updated investigation on the trend of improvements in efficiency, quality and complexity of clustering time- series approaches during the last decade and enlighten new paths for future works.
1.7K
•Book
Design Science Methodology for Information Systems and Software Engineering
Roel Wieringa
- 19 Nov 2014
TL;DR: This book provides guidelines for practicing design science in the fields of information systems and software engineering research by providing guidelines on how to effectively structure research goals, how to analyze research problems concerning design goals and knowledge questions,How to validate artifact designs and how to empirically investigate artifacts in context and finally how to present the results of the design cycle as a whole.