PHANTOM: Curating GitHub for engineered software projects using time-series clustering

doi:10.1007/S10664-020-09825-8

Open AccessJournal Article10.1007/S10664-020-09825-8

PHANTOM: Curating GitHub for engineered software projects using time-series clustering

Peter Pickerill, +4 more

- 01 Jul 2020

- Empirical Software Engineering

- Vol. 25, Iss: 4, pp 2897-2929

18

TL;DR: PHANTOM as mentioned in this paper extracts five measures from GitHub logs and transforms each measure into a time-series, which is represented as a feature vector for clustering using the k-means algorithm.

Abstract: Within the field of Mining Software Repositories, there are numerous methods employed to filter datasets in order to avoid analysing low-quality projects. Unfortunately, the existing filtering methods have not kept up with the growth of existing data sources, such as GitHub, and researchers often rely on quick and dirty techniques to curate datasets. The objective of this study is to develop a method capable of filtering large quantities of software projects in a resource-efficient way. This study follows the Design Science Research (DSR) methodology. The proposed method, PHANTOM, extracts five measures from Git logs. Each measure is transformed into a time-series, which is represented as a feature vector for clustering using the k-means algorithm. Using the ground truth from a previous study, PHANTOM was shown to be able to rediscover the ground truth on the training dataset, and was able to identify “engineered” projects with up to 0.87 Precision and 0.94 Recall on the validation dataset. PHANTOM downloaded and processed the metadata of 1,786,601 GitHub repositories in 21.5 days using a single personal computer, which is over 33% faster than the previous study which used a computer cluster of 200 nodes. The possibility of applying the method outside of the open-source community was investigated by curating 100 repositories owned by two companies. It is possible to use an unsupervised approach to identify engineered projects. PHANTOM was shown to be competitive compared to the existing supervised approaches while reducing the hardware requirements by two orders of magnitude.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1016/j.infsof.2022.107004

A comprehensive empirical study on bug characteristics of deep learning frameworks

Yilin Yang, +3 more

- 01 Jul 2022

- Information & Software Technology

TL;DR: Wang et al. as discussed by the authors mined 1,127 DL framework bug reports from eight popular DL frameworks and labeled the bug types, root causes, and symptoms, and analyzed the bug-fixing changes.

...read moreread less

22

Proceedings Article•10.1109/MSR52588.2021.00080

QScored: A Large Dataset of Code Smells and Quality Metrics

Tushar Sharma, +1 more

- 26 Jan 2021

TL;DR: QScored dataset as mentioned in this paper contains code quality information of more than 86 thousand C# and Java GitHub repositories containing more than 1.1 billion lines of code and contains seven kinds of detected architecture smells, 20 kinds of design smells, eleven kinds of implementation smells, and 27 commonly used code quality metrics computed at project, package, class, and method levels.

...read moreread less

22

•Journal Article•10.1109/access.2022.3174115

Systematic Mapping: Artificial Intelligence Techniques in Software Engineering

01 Jan 2022

- IEEE Access

TL;DR: In this paper , a systematic mapping study was conducted to characterize the publication landscape of AI techniques in software engineering, and gaps were identified and discussed by mapping these AI techniques against the SE phases to which they contributed.

...read moreread less

17

•Journal Article•10.1007/S00521-021-06659-3

Just-in-time software defect prediction using deep temporal convolutional networks

Pasquale Ardimento, +4 more

- 14 Nov 2021

- Neural Computing and Applications

TL;DR: In this article, the authors proposed a new approach based on a large feature set containing product and process software metrics extracted from commits of software projects along with their evolution, and introduced a deep temporal convolutional networks variant based on hierarchical attention layers to perform the fault prediction.

...read moreread less

12

Journal Article•10.1007/s10664-023-10362-3

What is an app store? The software engineering perspective

Wenhan Zhu, +5 more

- 02 Jan 2024

- Empirical Software Engineering

TL;DR: The goal of this paper is to survey and characterize the broader dimensionality of app stores, and to explore how and why they influence software development practices, such as system design and release management.

...read moreread less

6

...

Expand

References

•Journal Article•10.5555/2017212.2017217

Design science in information systems research

Alan R. Hevner, +3 more

- 01 Mar 2004

- Management Information Systems Quarterly

TL;DR: The objective is to describe the performance of design-science research in Information Systems via a concise conceptual framework and clear guidelines for understanding, executing, and evaluating the research.

...read moreread less

11.3K

Journal Article•10.2307/25148625

Design Science in Information Systems Research

Hevner, +3 more

- 01 Jan 2004

- Management Information Systems Quarterly

8K

Journal Article•10.1016/J.IS.2015.04.007

Time-series clustering - A decade review

Saeed Aghabozorgi, +2 more

- 01 Oct 2015

- Information Systems

TL;DR: This review will expose four main components of time-series clustering and is aimed to represent an updated investigation on the trend of improvements in efficiency, quality and complexity of clustering time- series approaches during the last decade and enlighten new paths for future works.

...read moreread less

1.7K

•Book

Design Science Methodology for Information Systems and Software Engineering

Roel Wieringa

- 19 Nov 2014

TL;DR: This book provides guidelines for practicing design science in the fields of information systems and software engineering research by providing guidelines on how to effectively structure research goals, how to analyze research problems concerning design goals and knowledge questions,How to validate artifact designs and how to empirically investigate artifacts in context and finally how to present the results of the design cycle as a whole.

...read moreread less

1.1K