TL;DR: To understand good and bad practices used in the development of real notebooks, 1.4 million notebooks from GitHub are studied and a detailed analysis of their characteristics that impact reproducibility is presented.
Abstract: Jupyter Notebooks have been widely adopted by many different communities, both in science and industry. They support the creation of literate programming documents that combine code, text, and execution results with visualizations and all sorts of rich media. The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices, and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we studied 1.4 million notebooks from GitHub. We present a detailed analysis of their characteristics that impact reproducibility. We also propose a set of best practices that can improve the rate of reproducibility and discuss open challenges that require further research and development.
TL;DR: This paper proposes an end-to-end deep learning framework, named DeepJIT, that automatically extracts features from commit messages and code changes and use them to identify defects.
Abstract: Software quality assurance efforts often focus on identifying defective code. To find likely defective code early, change-level defect prediction - aka. Just-In-Time (JIT) defect prediction - has been proposed. JIT defect prediction models identify likely defective changes and they are trained using machine learning techniques with the assumption that historical changes are similar to future ones. Most existing JIT defect prediction approaches make use of manually engineered features. Unlike those approaches, in this paper, we propose an end-to-end deep learning framework, named DeepJIT, that automatically extracts features from commit messages and code changes and use them to identify defects. Experiments on two popular software projects (i.e., QT and OPENSTACK) on three evaluation settings (i.e., cross-validation, short-period, and long-period) show that the best variant of DeepJIT (DeepJIT-Combined), compared with the best performing state-of-the-art approach, achieves improvements of 10.36--11.02% for the project QT and 9.51--13.69% for the project OPENSTACK in terms of the Area Under the Curve (AUC).
TL;DR: This paper reports on the experience of deploying a new deep learning tree-based defect prediction model built upon the tree-structured Long Short Term Memory network which directly matches with the Abstract Syntax Tree representation of source code.
Abstract: Defects are common in software systems and cause many problems for software users. Different methods have been developed to make early prediction about the most likely defective modules in large codebases. Most focus on designing features (e.g. complexity metrics) that correlate with potentially defective code. Those approaches however do not sufficiently capture the syntax and multiple levels of semantics of source code, a potentially important capability for building accurate prediction models. In this paper, we report on our experience of deploying a new deep learning tree-based defect prediction model in practice. This model is built upon the tree-structured Long Short Term Memory network which directly matches with the Abstract Syntax Tree representation of source code. We discuss a number of lessons learned from developing the model and evaluating it on two datasets, one from open source projects contributed by our industry partner Samsung and the other from the public PROMISE repository.
TL;DR: In this paper, the authors present a dataset that maps 624 publicly disclosed vulnerabilities affecting 205 distinct open-source Java projects, used in SAP products or internal tools, onto the 1282 commits that fix them, out of which 29 do not have a CVE (Common Vulnerability and Exposure) identifier at all, and 46 which do have such identifier assigned by a numbering authority, are not available in the NVD yet.
Abstract: Advancing our understanding of software vulnerabilities, automating their identification, the analysis of their impact, and ultimately their mitigation is necessary to enable the development of software that is more secure. While operating a vulnerability assessment tool, which we developed, and that is currently used by hundreds of development units at SAP, we manually collected and curated a dataset of vulnerabilities of open-source software, and the commits fixing them. The data were obtained both from the National Vulnerability Database (NVD), and from project-specific web resources, which we monitor on a continuous basis. From that data, we extracted a dataset that maps 624 publicly disclosed vulnerabilities affecting 205 distinct open-source Java projects, used in SAP products or internal tools, onto the 1282 commits that fix them. Out of 624 vulnerabilities, 29 do not have a CVE (Common Vulnerability and Exposure) identifier at all, and 46, which do have such identifier assigned by a numbering authority, are not available in the NVD yet. The dataset is released under an open-source license, together with supporting scripts that allow researchers to automatically retrieve the actual content of the commits from the corresponding repositories, and to augment the attributes available for each instance. Moreover, these scripts allow to complement the dataset with additional instances that are not security fixes (which is useful, for example, in machine learning applications). Our dataset has been successfully used to train classifiers that could automatically identify security-relevant commits in code repositories. The release of this dataset and the supporting code as open-source will allow future research to be based on data of industrial relevance; it also represents a concrete step towards making the maintenance of this dataset a shared effort involving open-source communities, academia, and the industry.
TL;DR: There is no evidence that projects switch to semantic versioning on a large scale, and it is found that many package managers support — and the respective community adapts — flexible versioning practices, but this does not always work.
Abstract: Many modern software systems are built on top of existing packages (modules, components, libraries). The increasing number and complexity of dependencies has given rise to automated dependency management where package managers resolve symbolic dependencies against a central repository. When declaring dependencies, developers face various choices, such as whether or not to declare a fixed version or a range of versions. The former results in runtime behaviour that is easier to predict, whilst the latter enables flexibility in resolution that can, for example, prevent different versions of the same package being included and facilitates the automated deployment of bug fixes. We study the choices developers make across 17 different package managers, investigating over 70 million dependencies. This is complemented by a survey of 170 developers. We find that many package managers support --- and the respective community adapts --- flexible versioning practices. This does not always work: developers struggle to find the sweet spot between the predictability of fixed version dependencies, and the agility of flexible ones, and depending on their experience, adjust practices. We see some uptake of semantic versioning in some package managers, supported by tools. However, there is no evidence that projects switch to semantic versioning on a large scale. The results of this study can guide further research into better practices for automated dependency management, and aid the adaptation of semantic versioning.
TL;DR: This paper has created a reliable Android malware dataset containing 9,133 samples that belong to 56 malware families with high confidence that will boost a series of research studies including Android malware detection and classification, mining apps for anomalies, and app store mining, etc.
Abstract: A large number of research studies have been focused on detecting Android malware in recent years. As a result, a reliable and large-scale malware dataset is essential to build effective malware classifiers and evaluate the performance of different detection techniques. Although several Android malware benchmarks have been widely used in our research community, these benchmarks face several major limitations. First, most of the existing datasets are outdated and cannot reflect current malware evolution trends. Second, most of them only rely on VirusTotal to label the ground truth of malware, while some anti-virus engines on VirusTotal may not always report reliable results. Third, all of them only contain the apps themselves (apks), while other important app information (e.g., app description, user rating, and app installs) is missing, which greatly limits the usage scenarios of these datasets. In this paper, we have created a reliable Android malware dataset based on Google Play's app maintenance results over several years. We first created four snapshots of Google Play in 2014, 2015, 2017 and 2018 respectively. Then we use VirusTotal to label apps with possible sensitive behaviors, and monitor these apps on Google Play to see whether Google has removed them or not. Based on this approach, we have created a malware dataset containing 9,133 samples that belong to 56 malware families with high confidence. We believe this dataset will boost a series of research studies including Android malware detection and classification, mining apps for anomalies, and app store mining, etc.
TL;DR: PtrGNCMsg, a novel approach which is based on an improved sequence-to-sequence model with the pointer-generator network to translate code diffs into commit messages outperforms recent approaches based on neural machine translation, and first enables the prediction of OOV words.
Abstract: The commit messages in source code repositories are valuable but not easy to be generated manually in time for tracking issues, reporting bugs, and understanding codes. Recently published works indicated that the deep neural machine translation approaches have drawn considerable attentions on automatic generation of commit messages. However, they could not deal with out-of-vocabulary (OOV) words, which are essential context-specific identifiers such as class names and method names in code diffs. In this paper, we propose PtrGNCMsg, a novel approach which is based on an improved sequence-to-sequence model with the pointer-generator network to translate code diffs into commit messages. By searching the smallest identifier set with the highest probability, PtrGNCMsg outperforms recent approaches based on neural machine translation, and first enables the prediction of OOV words. The experimental results based on the corpus of diffs and manual commit messages from the top 2,000 Java projects in GitHub show that PtrGNCMsg outperforms the state-of-the-art approach with improved BLEU by 1.02, ROUGE-1 by 4.00 and ROUGE-L by 3.78, respectively.
TL;DR: The study is designed to investigate the availability of information that has been successfully mined from other developer communications, particularly Stack Overflow, and the prevalence of useful information, including API mentions and code snippets with descriptions, and several hurdles that need to be overcome to automate mining that information.
Abstract: Modern software development communities are increasingly social. Popular chat platforms such as Slack host public chat communities that focus on specific development topics such as Python or Ruby-on-Rails. Conversations in these public chats often follow a Q&A format, with someone seeking information and others providing answers in chat form. In this paper, we describe an exploratory study into the potential use-fulness and challenges of mining developer Q&A conversations for supporting software maintenance and evolution tools. We designed the study to investigate the availability of information that has been successfully mined from other developer communications, particularly Stack Overflow. We also analyze characteristics of chat conversations that might inhibit accurate automated analysis. Our results indicate the prevalence of useful information, including API mentions and code snippets with descriptions, and several hurdles that need to be overcome to automate mining that information.
TL;DR: This paper presents a clone detection method based on semi-supervised machine learning designed to detect clones across programming languages with similar syntax, using an unsupervised learning approach to learn token-level vector representations and an LSTM-based neural network to predict whether two code fragments are clones.
Abstract: Clone detection across programs written in the same programming language has been studied extensively in the literature. On the contrary, the task of detecting clones across multiple programming languages has not been studied as much, and approaches based on comparison cannot be directly applied. In this paper, we present a clone detection method based on semi-supervised machine learning designed to detect clones across programming languages with similar syntax. Our method uses an unsupervised learning approach to learn token-level vector representations and an LSTM-based neural network to predict whether two code fragments are clones. To train our network, we present a cross-language code clone dataset --- which is to the best of our knowledge the first of its kind --- containing around 45,000 code fragments written in Java and Python. We evaluate our approach on the dataset we created and show that our method gives promising results when detecting similarities between code fragments written in Java and Python.
TL;DR: It is found that topic generation with Latent Dirichlet Allocation can suggest more appropriate tags that can make a machine learning post more visible and thus can help in receiving immediate feedback from sites like SO.
Abstract: Machine learning, a branch of Artificial Intelligence, is now popular in software engineering community and is successfully used for problems like bug prediction, and software development effort estimation. Developers' understanding of machine learning, however, is not clear, and we require investigation to understand what educators should focus on, and how different online programming discussion communities can be more helpful. We conduct a study on Stack Overflow (SO) machine learning related posts using the SOTorrent dataset. We found that some machine learning topics are significantly more discussed than others, and others need more attention. We also found that topic generation with Latent Dirichlet Allocation (LDA) can suggest more appropriate tags that can make a machine learning post more visible and thus can help in receiving immediate feedback from sites like SO.
TL;DR: In this article, a broad study of parameters to arrive at good local optima for GitHub and Stack Overflow text corpora, an a-posteriori characterisation of text corpus related to eight programming languages, and an analysis of corpus feature importance via per-corpus LDA configuration.
Abstract: Software repositories contain large amounts of textual data, ranging from source code comments and issue descriptions to questions, answers, and comments on Stack Overflow. To make sense of this textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and there are only rough and sometimes conflicting guidelines available on how these parameters should be set. In this paper, we contribute (i) a broad study of parameters to arrive at good local optima for GitHub and Stack Overflow text corpora, (ii) an a-posteriori characterisation of text corpora related to eight programming languages, and (iii) an analysis of corpus feature importance via per-corpus LDA configuration. We find that (1) popular rules of thumb for topic modelling parameter configuration are not applicable to the corpora used in our experiments, (2) corpora sampled from GitHub and Stack Overflow have different characteristics and require different configurations to achieve good model fit, and (3) we can predict good configurations for unseen corpora reliably. These findings support researchers and practitioners in efficiently determining suitable configurations for topic modelling when analysing textual data contained in software repositories.
TL;DR: The proposed systematic approach can effectively tackle the concept drift issue of the SVs' descriptions reported from 2000 to 2018 in NVD even without retraining the model and performs competitively compared to the existing word-only method.
Abstract: Software Engineering researchers are increasingly using Natural Language Processing (NLP) techniques to automate Software Vulnerabilities (SVs) assessment using the descriptions in public repositories. However, the existing NLP-based approaches suffer from concept drift. This problem is caused by a lack of proper treatment of new (out-of-vocabulary) terms for the evaluation of unseen SVs over time. To perform automated SVs assessment with concept drift using SVs' descriptions, we propose a systematic approach that combines both character and word features. The proposed approach is used to predict seven Vulnerability Characteristics (VCs). The optimal model of each VC is selected using our customized time-based cross-validation method from a list of eight NLP representations and six well-known Machine Learning models. We have used the proposed approach to conduct large-scale experiments on more than 100,000 SVs in the National Vulnerability Database (NVD). The results show that our approach can effectively tackle the concept drift issue of the SVs' descriptions reported from 2000 to 2018 in NVD even without retraining the model. In addition, our approach performs competitively compared to the existing word-only method. We also investigate how to build compact concept-drift-aware models with much fewer features and give some recommendations on the choice of classifiers and NLP representations for SVs assessment.
TL;DR: ACRYL learns from changes implemented in other apps in response to API changes, which allows not only to detect compatibility issues, but also to suggest a fix, and points to the future possibility of combining the two approaches, trying to learn detection/fixing rules on both the API and the client side.
Abstract: Android apps are inextricably linked to the official Android APIs. Such a strong form of dependency implies that changes introduced in new versions of the Android APIs can severely impact the apps' code, for example because of deprecated or removed APIs. In reaction to those changes, mobile app developers are expected to adapt their code and avoid compatibility issues. To support developers, approaches have been proposed to automatically identify API compatibility issues in Android apps. The state-of-the-art approach, named CiD, is a data-driven solution learning how to detect those issues by analyzing the changes in the history of Android APIs ("API side" learning). While it can successfully identify compatibility issues, it cannot recommend coding solutions. We devised an alternative data-driven approach, named ACRyL. ACRyL learns from changes implemented in other apps in response to API changes ("client side" learning). This allows not only to detect compatibility issues, but also to suggest a fix. When empirically comparing the two tools, we found that there is no clear winner, since the two approaches are highly complementary, in that they identify almost disjointed sets of API compatibility issues. Our results point to the future possibility of combining the two approaches, trying to learn detection/fixing rules on both the API and the client side.
TL;DR: This work proposes to mine a large dataset of updates within about 40 000 real-world app lineages to infer API usage rules, and yields negative results on the assumption that API usage updates tend to correct misuses.
Abstract: Android app developers recurrently use crypto-APIs to provide data security to app users. Unfortunately, misuse of APIs only creates an illusion of security and even exposes apps to systematic attacks. It is thus necessary to provide developers with a statically-enforceable list of specifications of crypto-API usage rules. On the one hand, such rules cannot be manually written as the process does not scale to all available APIs. On the other hand, a classical mining approach based on common usage patterns is not relevant in Android, given that a large share of usages include mistakes. In this work, building on the assumption that "developers update API usage instances to fix misuses", we propose to mine a large dataset of updates within about 40 000 real-world app lineages to infer API usage rules. Eventually, our investigations yield negative results on our assumption that API usage updates tend to correct misuses. Actually, it appears that updates that fix misuses may be unintentional: the same misuses patterns are quickly re-introduced by subsequent updates.
TL;DR: MUDetect is designed, an API-misuse detector that builds on the strengths of existing detectors and tries to mitigate their weaknesses, and uses a new graph representation of API usages that captures different types of API misuses and a systematically designed ranking strategy that effectively improves precision.
Abstract: Application Programming Interfaces (APIs) often impose constraints such as call order or preconditions. API misuses, i.e., usages violating these constraints, may cause software crashes, data-loss, and vulnerabilities. Researchers developed several approaches to detect API misuses, typically still resulting in low recall and precision. In this work, we investigate ways to improve API-misuse detection. We design MuDetect, an API-misuse detector that builds on the strengths of existing detectors and tries to mitigate their weaknesses. MuDetect uses a new graph representation of API usages that captures different types of API misuses and a systematically designed ranking strategy that effectively improves precision. Evaluation shows that MuDetect identifies real-world API misuses with twice the recall of previous detectors and 2.5x higher precision. It even achieves almost 4x higher precision and recall, when mining patterns across projects, rather than from only the target project.
TL;DR: This work proposes an approach for energy-aware development that combines the construction of application-independent energy profiles of Java collections and static analysis to produce an estimate of in which ways and how intensively a system employs these collections.
Abstract: Over the last years, increasing attention has been given to creating energy-efficient software systems. However, developers still lack the knowledge and the tools to support them in that task. In this work, we explore our vision that energy consumption non-specialists can build software that consumes less energy by alternating, at development time, between third-party, readily available, diversely-designed pieces of software, without increasing the development complexity. To support our vision, we propose an approach for energy-aware development that combines the construction of application-independent energy profiles of Java collections and static analysis to produce an estimate of in which ways and how intensively a system employs these collections. By combining these two pieces of information, it is possible to produce energy-saving recommendations for alternative collection implementations to be used in different parts of the system. We implement this approach in a tool named CT+ that works with both desktop and mobile Java systems, and is capable of analyzing 40 different collection implementations of lists, maps, and sets. We applied CT+ to twelve software systems: two mobile-based, seven desktop-based, and three that can run in both environments. Our evaluation infrastructure involved a high-end server, a notebook, and three mobile devices. When applying the (mostly trivial) recommendations, we achieved up to 17.34% reduction in energy consumption just by replacing collection implementations. Even for a real world, mature, highly-optimized system such as Xalan, CT+ could achieve a 5.81% reduction in energy consumption. Our results indicate that some widely used collections, e.g., ArrayList, HashMap, and HashTable, are not energy-efficient and sometimes should be avoided when energy consumption is a major concern.
TL;DR: Two empirical studies conducted for the combination of Scala and ScalaTest show that test smells have a low diffusion across test classes, that the most frequently occurring test smells are Lazy Test, Eager Test, and Assertion Roulette, and that many developers were able to perceive but not to identify the smells.
Abstract: Test smells are, analogously to code smells, defined as the characteristics exhibited by poorly designed unit tests. Their negative impact on test effectiveness, understanding, and maintenance has been demonstrated by several empirical studies. However, the scope of these studies has been limited mostly to JAVA in combination with the JUNIT testing framework. Results for other language and framework combinations are ---despite their prevalence in practice--- few and far between, which might skew our understanding of test smells. The combination of Scala and ScalaTest, for instance, offers more comprehensive means for defining and reusing test fixtures, thereby possibly reducing the diffusion and perception of fixture-related test smells. This paper therefore reports on two empirical studies conducted for this combination. In the first study, we analyse the tests of 164 open-source Scala projects hosted on GitHub for the diffusion of test smells. This required the transposition of their original definition to this new context, and the implementation of a tool (SoCRATES) for their automated detection. In the second study, we assess the perception and the ability of 14 Scala developers to identify test smells. For this context, our results show (i) that test smells have a low diffusion across test classes, (ii) that the most frequently occurring test smells are Lazy Test, Eager Test, and Assertion Roulette, and (iii) that many developers were able to perceive but not to identify the smells.
TL;DR: The Maven Dependency Graph as discussed by the authors is a dataset of 2.8M artifacts from the Maven Central Repository with metadata such as exact version, date of upload and list of dependencies towards other artifacts.
Abstract: The Maven Central Repository provides an extraordinary source of data to understand complex architecture and evolution phenomena among Java applications. As of September 6, 2018, this repository includes 2.8M artifacts (compiled piece of code implemented in a JVM-based language), each of which is characterized with metadata such as exact version, date of upload and list of dependencies towards other artifacts. Today, one who wants to analyze the complete ecosystem of Maven artifacts and their dependencies faces two key challenges: (i) this is a huge data set; and (ii) dependency relationships among artifacts are not modeled explicitly and cannot be queried. In this paper, we present the Maven Dependency Graph. This open source data set provides two contributions: a snapshot of the whole Maven Central taken on September 6, 2018, stored in a graph database in which we explicitly model all dependencies; an open source infrastructure to query this huge dataset.
TL;DR: In this paper, the authors hypothesize that the immutability of Maven artifacts and the ability to choose any version naturally support the emergence of software diversity within Maven Central.
Abstract: Maven artifacts are immutable: an artifact that is uploaded on Maven Central cannot be removed nor modified. The only way for developers to upgrade their library is to release a new version. Consequently, Maven Central accumulates all the versions of all the libraries that are published there, and applications that declare a dependency towards a library can pick any version. In this work, we hypothesize that the immutability of Maven artifacts and the ability to choose any version naturally support the emergence of software diversity within Maven Central. We analyze 1,487,956 artifacts that represent all the versions of 73,653 libraries. We observe that more than 30% of libraries have multiple versions that are actively used by latest artifacts. In the case of popular libraries, more than 50% of their versions are used. We also observe that more than 17% of libraries have several versions that are significantly more used than the other versions. Our results indicate that the immutability of artifacts in Maven Central does support a sustained level of diversity among versions of libraries in the repository.
TL;DR: A new dataset that includes 1,558 mature Github projects that develop Python software for Data Science tasks, which use a diverse set of machine learning libraries and managed by a variety of users and organizations is created.
Abstract: The popularity of Python programming language has surged in recent years due to its increasing usage in Data Science. The availability of Python repositories in Github presents an opportunity for mining software repository research, e.g., suggesting the best practices in developing Data Science applications, identifying bug-patterns, recommending code enhancements, etc. To enable this research, we have created a new dataset that includes 1,558 mature Github projects that develop Python software for Data Science tasks. By analyzing the metadata and code, we have included the projects in our dataset which use a diverse set of machine learning libraries and managed by a variety of users and organizations. The dataset is made publicly available through Boa infrastructure both as a collection of raw projects as well as in a processed form that could be used for performing large scale analysis using Boa language. We also present two initial applications to demonstrate the potential of the dataset that could be leveraged by the community.
TL;DR: The impact of two machine learning techniques, oversampling and undersampling of data, on the training of a sentiment classifier for handling small SE datasets with a skewed distribution are investigated.
Abstract: Sentiment analysis (SA) of text-based software artifacts is increasingly used to extract information for various tasks including providing code suggestions, improving development team productivity, giving recommendations of software packages and libraries, and recommending comments on defects in source code, code quality, possibilities for improvement of applications. Studies of state-of-the-art sentiment analysis tools applied to software-related texts have shown varying results based on the techniques and training approaches. In this paper, we investigate the impact of two potential opportunities to improve the training for sentiment analysis of SE artifacts in the context of the use of neural networks customized using the Stack Overflow data developed by Lin et al. We customize the process of sentiment analysis to the software domain, using software domain-specific word embeddings learned from Stack Overflow (SO) posts, and study the impact of software domain-specific word embeddings on the performance of the sentiment analysis tool, as compared to generic word embeddings learned from Google News. We find that the word embeddings learned from the Google News data performs mostly similar and in some cases better than the word embeddings learned from SO posts. We also study the impact of two machine learning techniques, oversampling and undersampling of data, on the training of a sentiment classifier for handling small SE datasets with a skewed distribution. We find that oversampling alone, as well as the combination of oversampling and undersampling together, helps in improving the performance of a sentiment classifier.
TL;DR: An empirical study using 529,054 code blocks collected from Python-related 44,966 answers posted on Stack Overflow finds user reputation not to relate with the presence of insecure code blocks, suggesting that both high and low-reputed users are likely to introduce secure code blocks.
Abstract: Despite being the most popular question and answer website for software developers, answers posted on Stack Overflow (SO) are susceptible to contain Python-related insecure coding practices. A systematic analysis on how frequently insecure coding practices appear in SO answers can help the SO community assess the prevalence of insecure Python code blocks in SO. An insecure coding practice is recurrent use of insecure coding patterns in Python. We conduct an empirical study using 529,054 code blocks collected from Python-related 44,966 answers posted on SO. We observe 7.1% of the 44,966 Python-related answers to include at least one insecure coding practice. The most frequently occurring insecure coding practice is code injection. We observe 9.8% of the 7,444 accepted answers to include at least one insecure code block. We also find user reputation not to relate with the presence of insecure code blocks, suggesting that both high and low-reputed users are likely to introduce insecure code blocks.
TL;DR: This work demonstrates that by combining word2vec with the power of MRF, it is possible to achieve improvements between 6% and 30% in retrieval accuracy over the best results that can be obtained with the more traditional applications of MRf to representations based on term and term-term frequencies.
Abstract: Word embeddings produced by the word2vec algorithm provide us with a strong mechanism to discover relationships between the words based on the degree to which they are contextually related to one another. In and of itself, algorithms like word2vec do not give us a mechanism to impose ordering constraints on the embedded word representations. Our main goal in this paper is to exploit the semantic word vectors obtained from word2vec in such a way that allows for the ordering constraints to be invoked on them when comparing a sequence of words in a query with a sequence of words in a file for source code retrieval. These ordering constraints employ the logic of Markov Random Fields (MRF), a framework used previously to enhance the precision of the source-code retrieval engines based on the Bag-of-Words (BoW) assumption. The work we present here demonstrates that by combining word2vec with the power of MRF, it is possible to achieve improvements between 6% and 30% in retrieval accuracy over the best results that can be obtained with the more traditional applications of MRF to representations based on term and term-term frequencies. The performance improvement was 30% for the Java AspectJ repository using only the titles of the bug reports provided by iBUGS, and 6% for the case of the Eclipse repository using titles as well as descriptions of the bug reports provided by BUGLinks.
TL;DR: An exploratory study on the reproducibility of the issues discussed in 400 questions of Stack Overflow finds that 68% of the code segments require minor and major modifications in order to reproduce the issues reported by the developers.
Abstract: Software developers often look for solutions to their code level problems at Stack Overflow. Hence, they frequently submit their questions with sample code segments and issue descriptions. Unfortunately, it is not always possible to reproduce their reported issues from such code segments. This phenomenon might prevent their questions from getting prompt and appropriate solutions. In this paper, we report an exploratory study on the reproducibility of the issues discussed in 400 questions of Stack Overflow. In particular, we parse, compile, execute and even carefully examine the code segments from these questions, spent a total of 200 man hours, and then attempt to reproduce their programming issues. The outcomes of our study are two-fold. First, we find that 68% of the code segments require minor and major modifications in order to reproduce the issues reported by the developers. On the contrary, 22% code segments completely fail to reproduce the issues. We also carefully investigate why these issues could not be reproduced and then provide evidence-based guidelines for writing effective code examples for Stack Overflow questions. Second, we investigate the correlation between issue reproducibility status (of questions) and corresponding answer meta-data such as the presence of an accepted answer. According to our analysis, a question with reproducible issues has at least three times higher chance of receiving an accepted answer than the question with irreproducible issues.
TL;DR: It is discovered that there is no single JavaScript code snippet without a rule violation, and rules related to stylistic issues are by far the most violated ones.
Abstract: Programming code snippets readily available on platforms such as StackOverflow are undoubtedly useful for software engineers. Unfortunately, these code snippets might contain issues such as deprecated, misused, or even buggy code. These issues could pass unattended, if developers do not have adequate knowledge, time, or tool support to catch them. In this work we expand the understanding of such issues (or the so called "violations") hidden in code snippets written in JavaScript, the programming language with the highest number of questions on StackOverflow. To characterize the violations, we extracted 336k code snippets from answers to JavaScript questions on StackOverflow and statically analyzed them using ESLinter, a JavaScript linter. We discovered that there is no single JavaScript code snippet without a rule violation. On average, our studied code snippets have 11 violations, but we found instances of more than 200 violations. In particular, rules related to stylistic issues are by far the most violated ones (82.9% of the violations pertain to this category). Possible errors, which developers might be more interested in, represent only 0.1% of the violations. Finally, we found a small fraction of code snippets flagged with possible errors being reused on actual GitHub software projects. Indeed, one single code snippet with possible errors was reused 1,261 times.
TL;DR: Git2Net as mentioned in this paper is a scalable python software that facilitates the extraction of fine-grained co-editing networks in large git repositories, where a link signifies that one developer has edited a block of source code originally written by another developer.
Abstract: Data from software repositories have become an important foundation for the empirical study of software engineering processes. A recurring theme in the repository mining literature is the inference of developer networks capturing e.g. collaboration, coordination, or communication, from the commit history of projects. Most of the studied networks are based on the co-authorship of software artefacts defined at the level of files, modules, or packages. While this approach has led to insights into the social aspects of software development, it neglects detailed information on code changes and code ownership, e.g. which exact lines of code have been authored by which developers, that is contained in the commit log of software projects. Addressing this issue, we introduce git2net, a scalable python software that facilitates the extraction of fine-grained co-editing networks in large git repositories. It uses text mining techniques to analyse the detailed history of textual modifications within files. This information allows us to construct directed, weighted, and time-stamped networks, where a link signifies that one developer has edited a block of source code originally written by another developer. Our tool is applied in case studies of an Open Source and a commercial software project. We argue that it opens up a massive new source of high-resolution data on human collaboration patterns.
TL;DR: The magnitude of the sleeping defects and of the snoring classes are analyzed, on data from 282 releases of six open source projects from the Apache ecosystem, to indicate that on all projects, most of the defects in a project slept for more than 20% of the existing releases.
Abstract: In order to develop and train defect prediction models, researchers rely on datasets in which a defect is often attributed to a release where the defect itself is discovered. However, in many circumstances, it can happen that a defect is only discovered several releases after its introduction. This might introduce a bias in the dataset, i.e., treating the intermediate releases as defect-free and the latter as defect-prone. We call this phenomenon as "sleeping defects". We call "snoring" the phenomenon where classes are affected by sleeping defects only, that would be treated as defect-free until the defect is discovered. In this paper we analyze, on data from 282 releases of six open source projects from the Apache ecosystem, the magnitude of the sleeping defects and of the snoring classes. Our results indicate that 1) on all projects, most of the defects in a project slept for more than 20% of the existing releases, and 2) in the majority of the projects the missing rate is more than 25% even if we remove the last 50% of releases.
TL;DR: NFBugs, a data set of 133 non-functional bug fixes collected from 65 open-source projects written in Java and Python, is introduced, which can be used to support code recommender systems focusing on non- functional properties.
Abstract: While several researchers have published bug data sets in the past, there has been less focus on bugs related to non-functional requirements. Non-functional requirements describe the quality attributes of a program. In this work, we introduce NFBugs, a data set of 133 non-functional bug fixes collected from 65 open-source projects written in Java and Python. NFBugs can be used to support code recommender systems focusing on non-functional properties.
TL;DR: This paper revises an initial case study comparing automatic and manually generated test suites using current tools as well as complementing their research method by evaluating these tools' ability in finding regressions.
Abstract: Good unit tests play a paramount role when it comes to foster and evaluate software quality. However, writing effective tests is an extremely costly and time consuming practice. To reduce such a burden for developers, researchers devised ingenious techniques to automatically generate test suite for existing code bases. Nevertheless, how automatically generated test cases fare against manually written ones is an open research question. In 2008, Bacchelli et.al. conducted an initial case study comparing automatic and manually generated test suites. Since in the last ten years we have witnessed a huge amount of work on novel approaches and tools for automatic test generation, in this paper we revise their study using current tools as well as complementing their research method by evaluating these tools' ability in finding regressions. Preprint [\url{https://doi.org/10.5281/zenodo.2595232}], dataset [\url{https://doi.org/10.6084/m9.figshare.7628642}].
TL;DR: This research proposes to generate easier-to-understand ESs by using not only structures of AST but also information of line differences, and confirmed that ESs generated by this methodology are more helpful to understand the differences of source code than GumTree.
Abstract: On development using a version control system, understanding differences of source code is important. Edit scripts (in short, ES) represent differences between two versions of source code. One of the tools generating ESs is GumTree. GumTree takes two versions of source code as input and generates an ES consisting of insert, delete, update and move nodes of abstract syntax tree (in short, AST). However, the accuracy of move and update actions generated by GumTree is insufficient, which makes ESs more difficult to understand. A reason why the accuracy is insufficient is that GumTree generates ESs from only information of AST. Thus, in this research, we propose to generate easier-to-understand ESs by using not only structures of AST but also information of line differences. To evaluate our methodology, we applied it to some open source software, and we confirmed that ESs generated by our methodology are more helpful to understand the differences of source code than GumTree.