Top 83 papers presented at Mining Software Repositories in 2020

Showing papers presented at "Mining Software Repositories in 2020"

Proceedings Article•10.1145/3379597.3387501•

A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries

[...]

Jiahao Fan¹, Yi Li¹, Shaohua Wang¹, Tien N. Nguyen²•Institutions (2)

New Jersey Institute of Technology¹, University of Texas at Dallas²

29 Jun 2020

TL;DR: A large C/C++ code vulnerability dataset from open-source Github projects, namely Big-Vul, which contains 3,754 code vulnerabilities spanning 91 different vulnerability types and can be used for various research topics, e.g., detecting and fixing vulnerabilities, analyzing the vulnerability related code changes.

...read moreread less

Abstract: We collected a large C/C++ code vulnerability dataset from open-source Github projects, namely Big-Vul. We crawled the public Common Vulnerabilities and Exposures (CVE) database and CVE-related source code repositories. Specifically, we collected the descriptive information of the vulnerabilities from the CVE database, e.g., CVE IDs, CVE severity scores, and CVE summaries. With the CVE information and its related published Github code repository links, we downloaded all of the code repositories and extracted vulnerability related code changes. In total, Big-Vul contains 3,754 code vulnerabilities spanning 91 different vulnerability types. All these code vulnerabilities are extracted from 348 Github projects. All information is stored in the CSV format. We linked the code changes with the CVE descriptive information. Thus, our Big-Vul can be used for various research topics, e.g., detecting and fixing vulnerabilities, analyzing the vulnerability related code changes. Big-Vul is publicly available on Github.

...read moreread less

321 citations

Proceedings Article•10.1145/3379597.3387472•

Challenges in Chatbot Development: A Study of Stack Overflow Posts

[...]

Ahmad Abdellatif¹, Diego Costa¹, Khaled Badran¹, Rabe Abdalkareem², Emad Shihab¹ - Show less +1 more•Institutions (2)

Concordia University¹, Queen's University²

29 Jun 2020

TL;DR: This study examines the Q&A website, Stack Overflow, to provide insights on the topics that chatbot developers are interested and the challenges they face and guides future research to propose techniques and tools to help the community at its early stages to overcome the most popular and difficult topics that practitioners face when developing chatbots.

...read moreread less

Abstract: Chatbots are becoming increasingly popular due to their benefits in saving costs, time, and effort. This is due to the fact that they allow users to communicate and control different services easily through natural language. Chatbot development requires special expertise (e.g., machine learning and conversation design) that differ from the development of traditional software systems. At the same time, the challenges that chatbot developers face remain mostly unknown since most of the existing studies focus on proposing chatbots to perform particular tasks rather than their development. Therefore, in this paper, we examine the Q&A website, Stack Overflow, to provide insights on the topics that chatbot developers are interested and the challenges they face. In particular, we leverage topic modeling to understand the topics that are being discussed by chatbot developers on Stack Overflow. Then, we examine the popularity and difficulty of those topics. Our results show that most of the chatbot developers are using Stack Overflow to ask about implementation guidelines. We determine 12 topics that developers discuss (e.g., Model Training) that fall into five main categories. Most of the posts belong to chatbot development, integration, and the natural language understanding (NLU) model categories. On the other hand, we find that developers consider the posts of building and integrating chatbots topics more helpful compared to other topics. Specifically, developers face challenges in the training of the chatbot's model. We believe that our study guides future research to propose techniques and tools to help the community at its early stages to overcome the most popular and difficult topics that practitioners face when developing chatbots.

...read moreread less

141 citations

Proceedings Article•10.1145/3379597.3387491•

How Often Do Single-Statement Bugs Occur?: The ManySStuBs4J Dataset

[...]

Rafael-Michael Karampatsis¹, Charles Sutton¹•Institutions (1)

University of Edinburgh¹

29 Jun 2020

Abstract: Program repair is an important but difficult software engineering problem. One way to achieve acceptable performance is to focus on classes of simple bugs, such as bugs with single statement fixes, or that match a small set of bug templates. However, it is very difficult to estimate the recall of repair techniques for simple bugs, as there are no datasets about how often the associated bugs occur in code. To fill this gap, we provide a dataset of 153,652 single statement bug-fix changes mined from 1,000 popular open-source Java projects, annotated by whether they match any of a set of 16 bug templates, inspired by state-of-the-art program repair techniques. In an initial analysis, we find that about 33% of the simple bug fixes match the templates, indicating that a remarkable number of single-statement bugs can be repaired with a relatively small set of templates. Further, we find that template fitting bugs appear with a frequency of about one bug per 1,600-2,500 lines of code (as measured by the size of the project's latest version). We hope that the dataset will prove a resource for both future work in program repair and studies in empirical software engineering.

...read moreread less

97 citations

Proceedings Article•10.1145/3379597.3387473•

The State of the ML-universe: 10 Years of Artificial Intelligence & Machine Learning Software Development on GitHub

[...]

Danielle Gonzalez¹, Thomas Zimmermann², Nachiappan Nagappan²•Institutions (2)

Rochester Institute of Technology¹, Microsoft²

29 Jun 2020

TL;DR: A large-scale empirical study of AI & ML Tool and Application repositories hosted on GitHub to identify unique properties, development patterns, and trends, and an elaborate study of developer workflow that measures collaboration and autonomy within a repository is enhanced.

...read moreread less

Abstract: In the last few years, artificial intelligence (AI) and machine learning (ML) have become ubiquitous terms. These powerful techniques have escaped obscurity in academic communities with the recent onslaught of AI & ML tools, frameworks, and libraries that make these techniques accessible to a wider audience of developers. As a result, applying AI & ML to solve existing and emergent problems is an increasingly popular practice. However, little is known about this domain from the software engineering perspective. Many AI & ML tools and applications are open source, hosted on platforms such as GitHub that provide rich tools for large-scale distributed software development. Despite widespread use and popularity, these repositories have never been examined as a community to identify unique properties, development patterns, and trends. In this paper, we conducted a large-scale empirical study of AI & ML Tool (700) and Application (4,524) repositories hosted on GitHub to develop such a characterization. While not the only platform hosting AI & ML development, GitHub facilitates collecting a rich data set for each repository with high traceability between issues, commits, pull requests and users. To compare the AI & ML community to the wider population of repositories, we also analyzed a set of 4,101 unrelated repositories. We enhance this characterization with an elaborate study of developer workflow that measures collaboration and autonomy within a repository. We've captured key insights of this community's 10 year history such as it's primary language (Python) and most popular repositories (Tensorflow, Tesseract). Our findings show the AI & ML community has unique characteristics that should be accounted for in future research.

...read moreread less

80 citations

Proceedings Article•10.1145/3379597.3387457•

Developer-Driven Code Smell Prioritization

[...]

Fabiano Pecorelli¹, Fabio Palomba¹, Foutse Khomh², Andrea De Lucia¹•Institutions (2)

University of Salerno¹, École Polytechnique de Montréal²

29 Jun 2020

TL;DR: This paper proposes an approach based on machine learning able to rank code smells according to the perceived criticality that developers assign to them and performs a first step toward the concept of developer-driven code smell prioritization.

...read moreread less

Abstract: Code smells are symptoms of poor implementation choices applied during software evolution. While previous research has devoted effort in the definition of automated solutions to detect them, still little is known on how to support developers when prioritizing them. Some works attempted to deliver solutions that can rank smell instances based on their severity, computed on the basis of software metrics. However, this may not be enough since it has been shown that the recommendations provided by current approaches do not take the developer's perception of design issues into account. In this paper, we perform a first step toward the concept of developer-driven code smell prioritization and propose an approach based on machine learning able to rank code smells according to the perceived criticality that developers assign to them. We evaluate our technique in an empirical study to investigate its accuracy and the features that are more relevant for classifying the developer's perception. Finally, we compare our approach with a state-of-the-art technique. Key findings show that the our solution has an F-Measure up to 85% and outperforms the baseline approach.

...read moreread less

62 citations

Proceedings Article•10.1145/3379597.3387453•

Investigating Severity Thresholds for Test Smells

[...]

Davide Spadini¹, Martin Schvarcbacher, Ana-Maria Oprescu², Magiel Bruntink, Alberto Bacchelli³ - Show less +1 more•Institutions (3)

Delft University of Technology¹, University of Amsterdam², University of Zurich³

29 Jun 2020

TL;DR: This work investigates the severity rating for four test smells and finds that current detection rules for certain test smells are considered as too strict by the developers and their newly defined severity thresholds are in line with the participants' perception of how test smells have an impact on the maintainability of a test suite.

...read moreread less

Abstract: Test smells are poor design decisions implemented in test code, which can have an impact on the effectiveness and maintainability of unit tests. Even though test smell detection tools exist, how to rank the severity of the detected smells is an open research topic. In this work, we aim at investigating the severity rating for four test smells and investigate their perceived impact on test suite maintainability by the developers. To accomplish this, we first analyzed some 1,500 open-source projects to elicit severity thresholds for commonly found test smells. Then, we conducted a study with developers to evaluate our thresholds. We found that (1) current detection rules for certain test smells are considered as too strict by the developers and (2) our newly defined severity thresholds are in line with the participants' perception of how test smells have an impact on the maintainability of a test suite. Preprint [https://doi.org/10.5281/zenodo.3744281], data and material [https://doi.org/10.5281/zenodo.3611111].

...read moreread less

57 citations

Proceedings Article•10.1145/3379597.3387467•

On the Prevalence, Impact, and Evolution of SQL Code Smells in Data-Intensive Systems

[...]

Biruk Asmare Muse¹, Mohammad Masudur Rahman¹, Csaba Nagy², Anthony Cleve³, Foutse Khomh¹, Giuliano Antoniol¹ - Show less +2 more•Institutions (3)

École Polytechnique de Montréal¹, University of Lugano², Université de Namur³

29 Jun 2020

TL;DR: The results show that SQL code smells are indeed prevalent and persistent in the studied data-intensive software systems and have a weaker association with bugs than that of traditional code smells.

...read moreread less

Abstract: Code smells indicate software design problems that harm software quality. Data-intensive systems that frequently access databases often suffer from SQL code smells besides the traditional smells. While there have been extensive studies on traditional code smells, recently, there has been a growing interest in SQL code smells. In this paper, we conduct an empirical study to investigate the prevalence and evolution of SQL code smells in open-source, data-intensive systems. We collected 150 projects and examined both traditional and SQL code smells in these projects. Our investigation delivers several important findings. First, SQL code smells are indeed prevalent in data-intensive software systems. Second, SQL code smells have a weak co-occurrence with traditional code smells. Third, SQL code smells have a weaker association with bugs than that of traditional code smells. Fourth, SQL code smells are more likely to be introduced at the beginning of the project lifetime and likely to be left in the code without a fix, compared to traditional code smells. Overall, our results show that SQL code smells are indeed prevalent and persistent in the studied data-intensive software systems. Developers should be aware of these smells and consider detecting and refactoring SQL code smells and traditional code smells separately, using dedicated tools.

...read moreread less

47 citations

Proceedings Article•10.1145/3379597.3387478•

Detecting and Characterizing Bots that Commit Code

[...]

Tapajit Dey¹, Sara Mousavi¹, Eduardo Ponce¹, Tanner Fry¹, Bogdan Vasilescu², Anna Filippova, Audris Mockus¹ - Show less +3 more•Institutions (2)

University of Tennessee¹, Carnegie Mellon University²

29 Jun 2020

TL;DR: BIMAN as mentioned in this paper is a systematic approach to detect bots using author names, commit messages, files modified by the commit, and projects associated with the commits, and also characterized these bots based on the time patterns of their code commits and the types of files modified.

...read moreread less

Abstract: Background: Some developer activity traditionally performed manually, such as making code commits, opening, managing, or closing issues is increasingly subject to automation in many OSS projects. Specifically, such activity is often performed by tools that react to events or run at specific times. We refer to such automation tools as bots and, in many software mining scenarios related to developer productivity or code quality, it is desirable to identify bots in order to separate their actions from actions of individuals. Aim: Find an automated way of identifying bots and code committed by these bots, and to characterize the types of bots based on their activity patterns. Method and Result: We propose BIMAN, a systematic approach to detect bots using author names, commit messages, files modified by the commit, and projects associated with the commits. For our test data, the value for AUC-ROC was 0.9. We also characterized these bots based on the time patterns of their code commits and the types of files modified, and found that they primarily work with documentation files and web pages, and these files are most prevalent in HTML and JavaScript ecosystems. We have compiled a shareable dataset containing detailed information about 461 bots we found (all of which have more than 1000 commits) and 13,762,430 commits they created.

...read moreread less

46 citations

Proceedings Article•10.1145/3379597.3387503•

AndroZooOpen: Collecting Large-scale Open Source Android Apps for the Research Community

[...]

Pei Liu¹, Li Li¹, Yanjie Zhao¹, Xiaoyu Sun¹, John Grundy¹ - Show less +1 more•Institutions (1)

Monash University¹

29 Jun 2020

TL;DR: A collection of open-source Android apps collected from several sources, including Github, currently contains over 45,000 app artefacts, a representative picture of Github-hosted Android apps.

...read moreread less

Abstract: It is critical for research to have an open, well-curated, representative set of apps for analysis. We present a collection of open-source Android apps collected from several sources, including Github. Our dataset, AndroZooOpen, currently contains over 45,000 app artefacts, a representative picture of Github-hosted Android apps. For apps released on Google Play, metadata including categories, ratings and user reviews, are also stored. We share this new dataset as part of our ongoing research to better support and enable new research topics involving Android app artefact analysis, and as a supplement dataset for AndroZoo, a well-known app collection of close-sourced Android apps.

...read moreread less

45 citations

Proceedings Article•10.1145/3379597.3387493•

Software-related Slack Chats with Disentangled Conversations

[...]

Preetha Chatterjee¹, Kostadin Damevski², Nicholas A. Kraft, Lori Pollock¹•Institutions (2)

University of Delaware¹, Virginia Commonwealth University²

29 Jun 2020

TL;DR: This paper presents a dataset of software-related Q&A chat conversations, curated for two years from three open Slack communities (python, clojure, elm), and shares the code for a customized machine-learning based algorithm that automatically extracts conversations from the downloaded chat transcripts.

...read moreread less

Abstract: More than ever, developers are participating in public chat communities to ask and answer software development questions. With over ten million daily active users, Slack is one of the most popular chat platforms, hosting many active channels focused on software development technologies, e.g., python, react. Prior studies have shown that public Slack chat transcripts contain valuable information, which could provide support for improving automatic software maintenance tools or help researchers understand developer struggles or concerns. In this paper, we present a dataset of software-related Q&A chat conversations, curated for two years from three open Slack communities (python, clojure, elm). Our dataset consists of 38,955 conversations, 437,893 utterances, contributed by 12,171 users. We also share the code for a customized machine-learning based algorithm that automatically extracts (or disentangles) conversations from the downloaded chat transcripts.

...read moreread less

43 citations

Proceedings Article•10.1145/3379597.3387449•

Improved Automatic Summarization of Subroutines via Attention to File Context

[...]

Sakib Haque¹, Alexander LeClair¹, Lingfei Wu², Collin McMillan¹•Institutions (2)

University of Notre Dame¹, IBM²

29 Jun 2020

TL;DR: This paper presents an approach that models the file context of subroutines and uses an attention mechanism to find words and concepts to use in summaries and shows in an experiment that this approach extends and improves several recent baselines.

...read moreread less

Abstract: Software documentation largely consists of short, natural language summaries of the subroutines in the software. These summaries help programmers quickly understand what a subroutine does without having to read the source code him or herself. The task of writing these descriptions is called "source code summarization" and has been a target of research for several years. Recently, AI-based approaches have superseded older, heuristic-based approaches. Yet, to date these AI-based approaches assume that all the content needed to predict summaries is inside subroutine itself. This assumption limits performance because many subroutines cannot be understood without surrounding context. In this paper, we present an approach that models the file context of subroutines (i.e. other subroutines in the same file) and uses an attention mechanism to find words and concepts to use in summaries. We show in an experiment that our approach extends and improves several recent baselines.

...read moreread less

Proceedings Article•10.1145/3379597.3387459•

Beyond the Code: Mining Self-Admitted Technical Debt in Issue Tracker Systems

[...]

Laerte Xavier¹, Fabio Ferreira, Rodrigo Brito¹, Marco Tulio Valente¹•Institutions (1)

Universidade Federal de Minas Gerais¹

29 Jun 2020

TL;DR: The findings suggest that there is space for designing novel tools to support technical debt management, particularly tools that encourage developers to create and label issues containing TD concerns, as well as issue-based SATD or just SATD-I.

...read moreread less

Abstract: Self-admitted technical debt (SATD) is a particular case of Technical Debt (TD) where developers explicitly acknowledge their sub-optimal implementation decisions. Previous studies mine SATD by searching for specific TD-related terms in source code comments. By contrast, in this paper we argue that developers can admit technical debt by other means, e.g., by creating issues in tracking systems and labelling them as referring to TD. We refer to this type of SATD as issue-based SATD or just SATD-I. We study a sample of 286 SATD-I instances collected from five open source projects, including Microsoft Visual Studio and GitLab Community Edition. We show that only 29% of the studied SATD-I instances can be tracked to source code comments. We also show that SATD-I issues take more time to be closed, compared to other issues, although they are not more complex in terms of code churn. Besides, in 45% of the studied issues TD was introduced to ship earlier, and in almost 60% it refers to DESIGN flaws. Finally, we report that most developers pay SATD-I to reduce its costs or interests (66%). Our findings suggest that there is space for designing novel tools to support technical debt management, particularly tools that encourage developers to create and label issues containing TD concerns.

...read moreread less

Proceedings Article•10.1145/3379597.3387479•

The Scent of Deep Learning Code: An Empirical Study

[...]

Hadhemi Jebnoun¹, Houssem Ben Braiek¹, Mohammad Masudur Rahman¹, Foutse Khomh¹•Institutions (1)

École Polytechnique de Montréal¹

29 Jun 2020

TL;DR: This paper performs a comparative analysis between deep learning and traditional open-source applications collected from GitHub and finds that there is a co-existence between code smells and software bugs in the studied deep learning code, which confirms the conjecture on the degraded code quality of deep learning applications.

...read moreread less

Abstract: Deep learning practitioners are often interested in improving their model accuracy rather than the interpretability of their models. As a result, deep learning applications are inherently complex in their structures. They also need to continuously evolve in terms of code changes and model updates. Given these confounding factors, there is a great chance of violating the recommended programming practices by the developers in their deep learning applications. In particular, the code quality might be negatively affected due to their drive for the higher model performance. Unfortunately, the code quality of deep learning applications has rarely been studied to date. In this paper, we conduct an empirical study to investigate the distribution of code smells in deep learning applications. To this end, we perform a comparative analysis between deep learning and traditional open-source applications collected from GitHub. We have several major findings. First, long lambda expression, long ternary conditional expression, and complex container comprehension smells are frequently found in deep learning projects. That is, deep learning code involves more complex or longer expressions than the traditional code does. Second, the number of code smells increases across the releases of deep learning applications. Third, we found that there is a co-existence between code smells and software bugs in the studied deep learning code, which confirms our conjecture on the degraded code quality of deep learning applications.

...read moreread less

Proceedings Article•10.1145/3379597.3387477•

Characterizing and Identifying Composite Refactorings: Concepts, Heuristics and Patterns

[...]

Leonardo Sousa¹, Diego Cedrim², Alessandro Garcia³, Willian Oizumi³, Ana Carla Bibiano³, Daniel Oliveira³, Miryung Kim⁴, Anderson Oliveira³ - Show less +4 more•Institutions (4)

Carnegie Mellon University¹, Amazon.com², Pontifical Catholic University of Rio de Janeiro³, University of California, Los Angeles⁴

29 Jun 2020

TL;DR: This study is the first to reveal that many smells are introduced in a program due to "incomplete" composite refactorings, and is also theFirst to reveal 111 patterns of composite refactoring that frequently introduce or remove certain smell types.

...read moreread less

Abstract: Refactoring consists of a transformation applied to improve the program internal structure, for instance, by contributing to remove code smells. Developers often apply multiple interrelated refactorings called composite refactoring. Even though composite refactoring is a common practice, an investigation from different points of view on how composite refactoring manifests in practice is missing. Previous empirical studies also neglect how different kinds of composite refactorings affect the removal, prevalence or introduction of smells. To address these matters, we provide a conceptual framework and two heuristics to respectively characterize and identify composite refactorings within and across commits. Then, we mined the commit history of 48 GitHub software projects. We identified and analyzed 24,911 composite refactorings involving 104,505 single refactorings. Amongst several findings, we observed that most composite refactorings occur in the same commit and have the same refactoring type. We found that several refactorings are semantically related to each other, which occur in different parts of the system but are still related to the same task. Our study is the first to reveal that many smells are introduced in a program due to "incomplete" composite refactorings. Our study is also the first to reveal 111 patterns of composite refactorings that frequently introduce or remove certain smell types. These patterns can be used as guidelines for developers to improve their refactoring practices as well as for designers of recommender systems.

...read moreread less

Proceedings Article•10.1145/3379597.3387500•

A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git Commits

[...]

Tanner Fry¹, Tapajit Dey¹, Andrey Karnauch¹, Audris Mockus¹•Institutions (1)

University of Tennessee¹

29 Jun 2020

TL;DR: In this paper, the authors propose a method that finds all author IDs belonging to a single developer in this entire dataset, and share the list of author IDs that were found to have aliases.

...read moreread less

Abstract: The data collected from open source projects provide means to model large software ecosystems, but often suffer from data quality issues, specifically, multiple author identification strings in code commits might actually be associated with one developer. While many methods have been proposed for addressing this problem, they are either heuristics requiring manual tweaking, or require too much calculation time to do pairwise comparisons for 38M author IDs in, for example, the World of Code collection. In this paper, we propose a method that finds all author IDs belonging to a single developer in this entire dataset, and share the list of all author IDs that were found to have aliases. To do this, we first create blocks of potentially connected author IDs and then use a machine learning model to predict which of these potentially related IDs belong to the same developer. We processed around 38 million author IDs and found around 14.8 million IDs to have an alias, which belong to 5.4 million different developers, with the median number of aliases being 2 per developer. This dataset can be used to create more accurate models of developer behaviour at the entire OSS ecosystem level and can be used to provide a service to rapidly resolve new author IDs.

...read moreread less

Proceedings Article•10.1145/3379597.3387474•

A Large-Scale Comparative Evaluation of IR-Based Tools for Bug Localization

[...]

Shayan A. Akbar¹, Avinash C. Kak¹•Institutions (1)

Purdue University¹

29 Jun 2020

TL;DR: The results show that the third-generation tools are significantly superior to the older tools and that the word embeddings generated using code files written in one language are effective for retrieval from code libraries in other languages.

...read moreread less

Abstract: This paper reports on a large-scale comparative evaluation of IR-based tools for automatic bug localization. We have divided the tools in our evaluation into the following three generations: (1) The first-generation tools, now over a decade old, that are based purely on the Bag-of-Words (BoW) modeling of software libraries. (2) The somewhat more recent second-generation tools that augment BoW-based modeling with two additional pieces of information: historical data, such as change history, and structured information such as class names, method names, etc. And, finally, (3) The third-generation tools that are currently the focus of much research and that also exploit proximity, order, and semantic relationships between the terms. It is important to realize that the original authors of all these three generations of tools have mostly tested them on relatively small-sized datasets that typically consisted no more than a few thousand bug reports. Additionally, those evaluations only involved Java code libraries. The goal of the present paper is to present a comprehensive large-scale evaluation of all three generations of bug-localization tools with code libraries in multiple languages. Our study involves over 20,000 bug reports drawn from a diverse collection of Java, C/C++, and Python projects. Our results show that the third-generation tools are significantly superior to the older tools. We also show that the word embeddings generated using code files written in one language are effective for retrieval from code libraries in other languages.

...read moreread less

Proceedings Article•10.1145/3379597.3387476•

Using Others' Tests to Identify Breaking Updates

[...]

Suhaib Mujahid¹, Rabe Abdalkareem², Emad Shihab¹, Shane McIntosh³•Institutions (3)

Concordia University¹, Queen's University², McGill University³

29 Jun 2020

TL;DR: A technique to detect breakage-inducing versions of third-party dependencies by leveraging the automated test suites of other projects that depend upon the same dependency to test newly released versions and finds that this proposed technique can detect six of the ten studied breakages.

...read moreread less

Abstract: The reuse of third-party packages has become a common practice in contemporary software development. Software dependencies are constantly evolving with newly added features and patches that fix bugs in older versions. However, updating dependencies could introduce new bugs or break backward compatibility. In this work, we propose a technique to detect breakage-inducing versions of third-party dependencies. The key insight behind our approach is to leverage the automated test suites of other projects that depend upon the same dependency to test newly released versions. We conjecture that this crowd-based approach will help to detect breakage-inducing versions because it broadens the set of realistic usage scenarios to which a package version has been exposed. To evaluate our conjecture, we perform an empirical study of 391,553 npm packages. We use the dependency network from these packages to identify candidate tests of third-party packages. Moreover, to evaluate our proposed technique, we mine the history of this dependency network to identify ten breakage-inducing versions. We find that our proposed technique can detect six of the ten studied breakage-inducing versions. Our findings can help developers to make more informed decisions when they update their dependencies.

...read moreread less

Proceedings Article•10.1145/3379597.3387494•

GitterCom: A Dataset of Open Source Developer Communications in Gitter

[...]

Esteban Parra¹, Ashley Ellis¹, Sonia Haiduc¹•Institutions (1)

Florida State University¹

29 Jun 2020

TL;DR: A new dataset, called GitterCom, is presented, which aims to enable research in this direction and represents the largest manually labeled and curated dataset of Gitter developer messages.

...read moreread less

Abstract: Team communication is essential for the development of modern software systems. For distributed software development teams, such as those found in many open source projects, this communication usually takes place using electronic tools. Among these, modern chat platforms such as Gitter are becoming the de facto choice for many software projects due to their advanced features geared towards software development and effective team communication. Gitter channels contain numerous messages exchanged by developers regarding the state of the project, issues and features of the system, team logistics, etc. These messages can contain important information to researchers studying open source software systems, developers new to a particular project and trying to get familiar with the software, etc. Therefore, uncovering what developers are communicating about through Gitter is an essential first step towards successfully understanding and leveraging this information. We present a new dataset, called GitterCom, which aims to enable research in this direction and represents the largest manually labeled and curated dataset of Gitter developer messages. The dataset is comprised of 10,000 messages collected from 10 Gitter communities associated with the development of open source software. Each message was manually annotated and verified by two of the authors, capturing the purpose of the communication expressed by the message. While the dataset has not yet been used in any publication, we discuss how it can enable interesting research opportunities.

...read moreread less

Proceedings Article•10.1145/3379597.3387469•

Automatically Granted Permissions in Android apps: An Empirical Study on their Prevalence and on the Potential Threats for Privacy

[...]

Paolo Calciati¹, Konstantin Kuznetsov, Alessandra Gorla¹, Andreas Zeller•Institutions (1)

IMDEA¹

29 Jun 2020

TL;DR: This paper runs an empirical study on 2,865,553 app releases and shows that in a representative app store more than ~17% of apps request at least once in their lifetime new dangerous permissions that the operating system grants without any user's approval.

...read moreread less

Abstract: Developers continuously update their Android apps to keep up with competitors in the market. Such constant updates do not bother end users, since by default the Android platform automatically pushes the most recent compatible release on the device, unless there are major changes in the list of requested permissions that users have to explicitly grant. The lack of explicit user's approval for each application update, however, may lead to significant risks for the end user, as the new release may include new subtle behaviors which may be privacy-invasive. The introduction of permission groups in the Android permission model makes this problem even worse: if a user gives a single permission within a group, the application can silently request further permissions in this group with each update---without having to ask the user. In this paper, we explain the threat that permission groups may pose for the privacy of Android users. We run an empirical study on 2,865,553 app releases, and we show that in a representative app store more than ~17% of apps request at least once in their lifetime new dangerous permissions that the operating system grants without any user's approval. Our analyses show that apps actually use over 56% of such automatically granted permissions, although most of their descriptions do not explicitly explain for what purposes. Finally, our manual inspection reveals clear abuses of apps that leak sensitive data such as user's accurate location, list of contacts, history of phone calls, and emails which are protected by permissions that the user never explicitly acknowledges.

...read moreread less

Proceedings Article•10.1145/3379597.3387496•

A Dataset for GitHub Repository Deduplication

[...]

Diomidis Spinellis¹, Zoe Kotti¹, Audris Mockus²•Institutions (2)

Athens University of Economics and Business¹, University of Tennessee²

29 Jun 2020

TL;DR: In this article, the authors provide a dataset of 10.6 million GitHub projects that are copies of others, and link each record with the project's ultimate parent, calculated as the connected components of an 18.2 million node and 12 million edge denoised graph created by directing edges to ultimate parents.

...read moreread less

Abstract: GitHub projects can be easily replicated through the site's fork process or through a Git clone-push sequence. This is a problem for empirical software engineering, because it can lead to skewed results or mistrained machine learning models. We provide a dataset of 10.6 million GitHub projects that are copies of others, and link each record with the project's ultimate parent. The ultimate parents were derived from a ranking along six metrics. The related projects were calculated as the connected components of an 18.2 million node and 12 million edge denoised graph created by directing edges to ultimate parents. The graph was created by filtering out more than 30 hand-picked and 2.3 million pattern-matched clumping projects. Projects that introduced unwanted clumping were identified by repeatedly visualizing shortest path distances between unrelated important projects. Our dataset identified 30 thousand duplicate projects in an existing popular reference dataset of 1.8 million projects. An evaluation of our dataset against another created independently with different methods found a significant overlap, but also differences attributed to the operational definition of what projects are considered as related.

...read moreread less

Proceedings Article•10.1145/3379597.3387446•

Can We Use SE-specific Sentiment Analysis Tools in a Cross-Platform Setting?

[...]

Nicole Novielli¹, Fabio Calefato¹, Davide Dongiovanni¹, Daniela Girardi¹, Filippo Lanubile¹ - Show less +1 more•Institutions (1)

University of Bari¹

29 Jun 2020

TL;DR: In this article, the authors evaluate the performance of four SE-specific tools in a cross-platform setting, i.e., on a test set collected from data sources different from the one used for training.

...read moreread less

Abstract: In this paper, we address the problem of using sentiment analysis tools 'off-the-shelf', that is when a gold standard is not available for retraining. We evaluate the performance of four SE-specific tools in a cross-platform setting, i.e., on a test set collected from data sources different from the one used for training. We find that (i) the lexicon-based tools outperform the supervised approaches retrained in a cross-platform setting and (ii) retraining can be beneficial in within-platform settings in the presence of robust gold standard datasets, even using a minimal training set. Based on our empirical findings, we derive guidelines for reliable use of sentiment analysis tools in software engineering.

...read moreread less

Proceedings Article•10.1145/3379597.3387458•

RTPTorrent: An Open-source Dataset for Evaluating Regression Test Prioritization

[...]

Toni Mattis¹, Patrick Rein¹, Falco Dürsch¹, Robert Hirschfeld¹•Institutions (1)

Hasso Plattner Institute¹

29 Jun 2020

TL;DR: A new dataset, named RTPTorrent, based on 20 open-source Java programs is described, which allows researchers to evaluate prioritization heuristics based on version control meta-data, source code, and test results from fine-grained, automated builds over 9 years of development history.

...read moreread less

Abstract: The software engineering practice of automated testing helps programmers find defects earlier during development. With growing software projects and longer-running test suites, frequency and immediacy of feedback decline, thereby making defects harder to repair. Regression test prioritization (RTP) is concerned with running relevant tests earlier to lower the costs of defect localization and to improve feedback. Finding representative data to evaluate RTP techniques is non-trivial, as most software is published without failing tests. In this work, we systematically survey a wide range of RTP literature regarding whether their dataset uses real or synthetic defects or tests, whether they are publicly available, and whether datasets are reused. We observed that some datasets are reused, however, many projects study only few projects and these rarely resemble real-world development activity. In light of these threats to ecological validity, we describe the construction and characteristics of a new dataset, named RTPTorrent, based on 20 open-source Java programs. Our dataset allows researchers to evaluate prioritization heuristics based on version control meta-data, source code, and test results from fine-grained, automated builds over 9 years of development history. We provide reproducible baselines for initial comparisons and make all data publicly available. We see this as a step towards better reproducibility, ecological validity, and long-term availability of studied software in the field of test prioritization.

...read moreread less

Proceedings Article•10.1145/3379597.3387499•

A Complete Set of Related Git Repositories Identified via Community Detection Approaches Based on Shared Commits

[...]

Audris Mockus¹, Diomidis Spinellis², Zoe Kotti², Gabriel John Dusing¹•Institutions (2)

University of Tennessee¹, Athens University of Economics and Business²

29 Jun 2020

TL;DR: Louvain community detection algorithm is applied to this very large graph consisting of links between commits and projects, which successfully reduces the size of the megacluster with the largest group of highly interconnected projects containing under 400k repositories.

...read moreread less

Abstract: In order to understand the state and evolution of the entirety of open source software we need to get a handle on the set of distinct software projects. Most of open source projects presently utilize Git, which is a distributed version control system allowing easy creation of clones and resulting in numerous repositories that are almost entirely based on some parent repository from which they were cloned. Git commits are unlikely to get produce and represent a way to group cloned repositories. We use World of Code infrastructure containing approximately 2B commits and 100M repositories to create and share such a map. We discover that the largest group contains almost 14M repositories most of which are unrelated to each other. As it turns out, the developers can push git object to an arbitrary repository or pull objects from unrelated repositories, thus linking unrelated repositories. To address this, we apply Louvain community detection algorithm to this very large graph consisting of links between commits and projects. The approach successfully reduces the size of the megacluster with the largest group of highly interconnected projects containing under 400K repositories. We expect that the resulting map of related projects as well as tools and methods to handle the very large graph will serve as a reference set for mining software projects and other applications. Further work is needed to determine different types of relationships among projects induced by shared commits and other relationships, for example, by shared source code or similar filenames.

...read moreread less

Proceedings Article•10.1145/3379597.3387448•

AIMMX: Artificial Intelligence Model Metadata Extractor

[...]

Jason Tsay¹, Alan Braz¹, Martin Hirzel¹, Avraham Shinnar¹, Todd W. Mummert¹ - Show less +1 more•Institutions (1)

IBM¹

29 Jun 2020

TL;DR: An exploratory analysis for data and method reproducibility over the models in the evaluation dataset and a catalog tool for discovering and managing models that enables simplified AI Model Metadata eXtraction from software repositories are presented.

...read moreread less

Abstract: Despite all of the power that machine learning and artificial intelligence (AI) models bring to applications, much of AI development is currently a fairly ad hoc process. Software engineering and AI development share many of the same languages and tools, but AI development as an engineering practice is still in early stages. Mining software repositories of AI models enables insight into the current state of AI development. However, much of the relevant metadata around models are not easily extractable directly from repositories and require deduction or domain knowledge. This paper presents a library called AIMMX that enables simplified AI Model Metadata eXtraction from software repositories. The extractors have five modules for extracting AI model-specific metadata: model name, associated datasets, references, AI frameworks used, and model domain. We evaluated AIMMX against 7,998 open-source models from three sources: model zoos, arXiv AI papers, and state-of-the-art AI papers. Our platform extracted metadata with 87% precision and 83% recall. As preliminary examples of how AI model metadata extraction enables studies and tools to advance engineering support for AI development, this paper presents an exploratory analysis for data and method reproducibility over the models in the evaluation dataset and a catalog tool for discovering and managing models. Our analysis suggests that while data reproducibility may be relatively poor with 42% of models in our sample citing their datasets, method reproducibility is more common at 72% of models in our sample, particularly state-of-the-art models. Our collected models are searchable in a catalog that uses existing metadata to enable advanced discovery features for efficiently finding models.

...read moreread less

Proceedings Article•10.1145/3379597.3387468•

A Study on the Accuracy of OCR Engines for Source Code Transcription from Programming Screencasts

[...]

Abdulkarim Malkadi¹, Mohammad Alahmadi¹, Sonia Haiduc¹•Institutions (1)

Florida State University¹

29 Jun 2020

TL;DR: This paper presents an empirical study on the accuracy of six OCR engines for the extraction of source code from screencasts and code images, and offers guidelines for programming screencast creators on which fonts to use to enable a better OCR recognition of their source code.

...read moreread less

Abstract: Programming screencasts can be a rich source of documentation for developers. However, despite the availability of such videos, the information available in them, and especially the source code being displayed is not easy to find, search, or reuse by programmers. Recent work has identified this challenge and proposed solutions that identify and extract source code from video tutorials in order to make it readily available to developers or other tools. A crucial component in these approaches is the Optical Character Recognition (OCR) engine used to transcribe the source code shown on screen. Previous work has simply chosen one OCR engine, without consideration for its accuracy or that of other engines on source code recognition. In this paper, we present an empirical study on the accuracy of six OCR engines for the extraction of source code from screencasts and code images. Our results show that the transcription accuracy varies greatly from one OCR engine to another and that the most widely chosen OCR engine in previous studies is by far not the best choice. We also show how other factors, such as font type and size can impact the results of some of the engines. We conclude by offering guidelines for programming screencast creators on which fonts to use to enable a better OCR recognition of their source code, as well as advice on OCR choice for researchers aiming to analyze source code in screencasts.

...read moreread less

Proceedings Article•10.1145/3379597.3387510•

The Software Heritage Graph Dataset: Large-scale Analysis of Public Software Development History

[...]

Antoine Pietri¹, Diomidis Spinellis², Stefano Zacchiroli³•Institutions (3)

French Institute for Research in Computer Science and Automation¹, Athens University of Economics and Business², University of Paris³

29 Jun 2020

TL;DR: Software Heritage as mentioned in this paper is the largest existing public archive of software source code and accompanying development history, spanning more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects.

...read moreread less

Abstract: Software Heritage is the largest existing public archive of software source code and accompanying development history. It spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects. These software artifacts were retrieved from major collaborative development platforms (e.g., GitHub, GitLab) and package repositories (e.g., PyPI, Debian, NPM), and stored in a uniform representation linking together source code files, directories, commits, and full snapshots of version control systems (VCS) repositories as observed by Software Heritage during periodic crawls. This dataset is unique in terms of accessibility and scale, and allows to explore a number of research questions on the long tail of public software development, instead of solely focusing on "most starred" repositories as it often happens.

...read moreread less

Proceedings Article•10.1145/3379597.3387489•

On the Shoulders of Giants: A New Dataset for Pull-based Development Research

[...]

Xunhui Zhang¹, Ayushi Rastogi², Yue Yu¹•Institutions (2)

National University of Defense Technology¹, Delft University of Technology²

29 Jun 2020

TL;DR: A new dataset containing 96 features collected from 11,230 projects and 3,347,937 pull requests is presented, which is the most comprehensive and largest one toward a complete picture for pull-based development research.

...read moreread less

Abstract: Pull-based development is a widely adopted paradigm for collaboration in distributed software development, attracting eyeballs from both academic and industry. To better study pull-based development model, this paper presents a new dataset containing 96 features collected from 11,230 projects and 3,347,937 pull requests. We describe the creation process and explain the features in details. To the best of our knowledge, our dataset is the most comprehensive and largest one toward a complete picture for pull-based development research.

...read moreread less

Proceedings Article•10.1145/3379597.3387464•

An Empirical Study on Regular Expression Bugs

[...]

Peipei Wang¹, Christopher S. Brown¹, Jamie A. Jennings¹, Kathryn T. Stolee¹•Institutions (1)

North Carolina State University¹

29 Jun 2020

TL;DR: By studying the code changes of regex-related pull requests, this paper observes that fixing regular expression bugs is nontrivial as it takes more time and more lines of code to fix them compared to the general pull requests.

...read moreread less

Abstract: Understanding the nature of regular expression (regex) issues is important to tackle practical issues developers face in regular expression usage. Knowledge about the nature and frequency of various types of regular expression issues, such as those related to performance, API misuse, and code smells, can guide testing, inform documentation writers, and motivate refactoring efforts. However, beyond ReDoS (Regular expression Denial of Service), little is known about to what extent regular expression issues affect software development and how these issues are addressed in practice. This paper presents a comprehensive empirical study of 350 merged regex-related pull requests from Apache, Mozilla, Facebook, and Google GitHub repositories. Through classifying the root causes and manifestations of those bugs, we show that incorrect regular expression behavior is the dominant root cause of regular expression bugs (165/356, 46.3%). The remaining root causes are incorrect API usage (9.3%) and other code issues that require regular expression changes in the fix (29.5%). By studying the code changes of regex-related pull requests, we observe that fixing regular expression bugs is nontrivial as it takes more time and more lines of code to fix them compared to the general pull requests. The results of this study contribute to a broader understanding of the practical problems faced by developers when using regular expressions.

...read moreread less

Proceedings Article•10.1145/3379597.3387441•

An Empirical Study of Method Chaining in Java

[...]

Tomoki Nakamaru¹, Tomomasa Matsunaga¹, Tetsuro Yamazaki¹, Soramichi Akiyama¹, Shigeru Chiba¹ - Show less +1 more•Institutions (1)

University of Tokyo¹

29 Jun 2020

TL;DR: Whether method chaining is a programming style accepted by real-world programmers is investigated, and language features that are helpful to the method-chaining style but have not been supported yet in Java are explored.

...read moreread less

Abstract: While some promote method chaining as a good practice for improving code readability, others refer to it as a bad practice that worsens code quality. In this paper, we first investigate whether method chaining is a programming style accepted by real-world programmers. To answer this question, we collected 2,814 Java repositories on GitHub and analyzed historical trends in the frequency of method chaining. The results of our analysis revealed the increasing use of method chaining; 23.1% of method invocations were part of method chains in 2018, whereas only 16.0% were such invocations in 2010. We then explore language features that are helpful to the method-chaining style but have not been supported yet in Java. For this aim, we conducted manual inspections of method chains that are randomly sampled from the collected repositories. We also estimated how effective they are to encourage the method-chaining style if they are adopted in Java.

...read moreread less

Proceedings Article•10.1145/3379597.3387443•

PUMiner: Mining Security Posts from Developer Question and Answer Websites with PU Learning

[...]

Triet H. M. Le¹, David Hin¹, Roland Croft¹, M. Ali Babar¹•Institutions (1)

University of Adelaide¹

29 Jun 2020

TL;DR: PUMiner builds a context-aware embedding model to extract features of the posts, and then develops a two-stage PU model to identify security content using the labelled Positive and Un-labelled posts to provide the largest and up-to-date security content on Q&A websites for practitioners and researchers.

...read moreread less

Abstract: Security is an increasing concern in software development. Developer Question and Answer (QA however, the required negative (non-security) class is too expensive to obtain. We propose a novel learning framework, PUMiner, to automatically mine security posts from Q&A websites. PUMiner builds a context-aware embedding model to extract features of the posts, and then develops a two-stage PU model to identify security content using the labelled Positive and Un-labelled posts. We evaluate PUMiner on more than 17.2 million posts on Stack Overflow and 52,611 posts on Security StackExchange. We show that PUMiner is effective with the validation performance of at least 0.85 across all model configurations. Moreover, Matthews Correlation Coefficient (MCC) of PUMiner is 0.906, 0.534 and 0.084 points higher than one-class SVM, positive-similarity filtering, and one-stage PU models on unseen testing posts, respectively. PUMiner also performs well with an MCC of 0.745 for scenarios where string matching totally fails. Even when the ratio of the labelled positive posts to the un-labelled ones is only 1:100, PUMiner still achieves a strong MCC of 0.65, which is 160% better than fully-supervised learning. Using PUMiner, we provide the largest and up-to-date security content on Q&A websites for practitioners and researchers.

...read moreread less