TL;DR: This paper begins with a discussion of the infrastructure (including a novel use of Scientific Workflow software) and then discusses the approach to mining the email archives, and presents some preliminary results from the data analysis.
Abstract: Communication & Co-ordination activities are central to large software projects, but are difficult to observe and study in traditional (closed-source, commercial) settings because of the prevalence of informal, direct communication modes. OSS projects, on the other hand, use the internet as the communication medium,and typically conduct discussions in an open, public manner. As a result, the email archives of OSS projects provide a useful trace of the communication and co-ordination activities of the participants. However, there are various challenges that must be addressed before this data can be effectively mined. Once this is done, we can construct social networks of email correspondents, and begin to address some interesting questions. These include questions relating to participation in the email; the social status of different types of OSS participants; the relationship of email activity and commit activity (in the CVS repositories) and the relationship of social status with commit activity. In this paper, we begin with a discussion of our infrastructure (including a novel use of Scientific Workflow software) and then discuss our approach to mining the email archives; and finally we present some preliminary results from our data analysis.
TL;DR: An API usage mining framework and its supporting tool called MAPO, which leverages the existing source code search engines to gather relevant source files and conducts data mining and the preliminary results show that the framework is practical for providing informative and succinct API usage patterns.
Abstract: To improve software productivity, when constructing new software systems, developers often reuse existing class libraries or frameworks by invoking their APIs. Those APIs, however, are often complex and not well documented, posing barriers for developers to use them in new client code. To get familiar with how those APIs are used, developers may search the Web using a general search engine to find relevant documents or code examples. Developers can also use a source code search engine to search open source repositories for source files that use the same APIs. Nevertheless, the number of returned source files is often large. It is difficult for developers to learn API usages from a large number of returned results. In order to help developers understand API usages and write API client code more effectively, we have developed an API usage mining framework and its supporting tool called MAPO (for Mining API usages from Open source repositories). Given a query that describes a method, class, or package for an API, MAPO leverages the existing source code search engines to gather relevant source files and conducts data mining. The mining leads to a short list of frequent API usages for developers to inspect. MAPO currently consists of five components: a code search engine, a source code analyzer, a sequence preprocessor, a frequent sequence miner, and a frequent sequence post processor. We have examined the effectiveness of MAPO using a set of various queries. The preliminary results show that the framework is practical for providing informative and succinct API usage patterns.
TL;DR: This report compute the bug-fix time of files in ArgoUML and PostgreSQL by identifying when bugs are introduced and when the bugs are fixed by identifying the top 20 bug- fix time files of two projects.
Abstract: The number of bugs (or fixes) is a common factor used to measure the quality of software and assist bug related analysis. For example, if software files have many bugs, they may be unstable. In comparison, the bug-fix time--the time to fix a bug after the bug was introduced--is neglected. We believe that the bug-fix time is an important factor for bug related analysis, such as measuring software quality. For example, if bugs in a file take a relatively long time to be fixed, the file may have some structural problems that make it difficult to make changes. In this report, we compute the bug-fix time of files in ArgoUML and PostgreSQL by identifying when bugs are introduced and when the bugs are fixed. This report includes bug-fix time statistics such as average bug-fix time, and distributions of bug-fix time. We also list the top 20 bug-fix time files of two projects.
TL;DR: This work focuses on defect density prediction and presents an approach that applies a decision tree learner on evolution data extracted from the Mozilla open source web browser project, which includes different source code, modification, and defect measures computed from seven recent Mozilla releases.
Abstract: With the advent of open source software repositories the data available for defect prediction in source files increased tremendously. Although traditional statistics turned out to derive reasonable results the sheer amount of data and the problem context of defect prediction demand sophisticated analysis such as provided by current data mining and machine learning techniques.In this work we focus on defect density prediction and present an approach that applies a decision tree learner on evolution data extracted from the Mozilla open source web browser project. The evolution data includes different source code, modification, and defect measures computed from seven recent Mozilla releases. Among the modification measures we also take into account the change coupling, a measure for the number of change-dependencies between source files. The main reason for choosing decision tree learners, instead of for example neural nets, was the goal of finding underlying rules which can be easily interpreted by humans. To find these rules, we set up a number of experiments to test common hypotheses regarding defects in software entities. Our experiments showed, that a simple tree learner can produce good results with various sets of input data.
TL;DR: Initial results of the technique indicate that it is indeed useful to identify similar Java classes, and it successfully identifies the ex ante and ex post versions of refactored classes and provides some interesting insights into within-version and between-version dependencies of classes within a Java project.
Abstract: Similarity analysis of source code is helpful during development to provide, for instance, better support for code reuse. Consider a development environment that analyzes code while typing and that suggests similar code examples or existing implementations from a source code repository. Mining software repositories by means of similarity measures enables and enforces reusing existing code and reduces the developing effort needed by creating a shared knowledge base of code fragments. In information retrieval similarity measures are often used to find documents similar to a given query document. This paper extends this idea to source code repositories. It introduces our approach to detect similar Java classes in software projects using tree similarity algorithms. We show how our approach allows to find similar Java classes based on an evaluation of three tree-based similarity measures in the context of five user-defined test cases as well as a preliminary software evolution analysis of a medium-sized Java project. Initial results of our technique indicate that it (1) is indeed useful to identify similar Java classes, (2)successfully identifies the ex ante and ex post versions of refactored classes, and (3) provides some interesting insights into within-version and between-version dependencies of classes within a Java project.
TL;DR: Using data recovered from CVS, this study reveals that over time the percentage of commented functions remains constant except for early fluctuation due to the commenting style of a particular active developer.
Abstract: It is common, especially in large software systems, for developers to change code without updating its associated comments due to their unfamiliarity with the code or due to time constraints. This is a potential problem since outdated comments may confuse or mislead developers who perform future development. Using data recovered from CVS, we study the evolution of code comments in the PostgreSQL project. Our study reveals that over time the percentage of commented functions remains constant except for early fluctuation due to the commenting style of a particular active developer.
TL;DR: Two different techniques the authors have implemented in FindBugs for tracking defects across versions are discussed, their relative merits and how they can be incorporated into the software development process, and the results of tracking defect warnings across Sun's Java runtime library are discussed.
Abstract: Various static analysis tools will analyze a software artifact in order to identify potential defects, such as misused APIs, race conditions and deadlocks, and security vulnerabilities. For a number of reasons, it is important to be able to track the occurrence of each potential defect over multiple versions of a software artifact understudy: in other words, to determine when warnings reported in multiple versions of the software all correspond the same underlying issue. One motivation for this capability is to remember decisions about code that has been reviewed and found to be safe despite the occurrence of a warning. Another motivation is constructing warning deltas between versions, showing which warnings are new, which have persisted,and which have disappeared. This allows reviewers to focus their efforts on inspecting new warnings. Finally, tracking warnings through a series of software versions reveals where potential defects are introduced and fixed, and how long they persist, exposing interesting trends and patterns.We will discuss two different techniques we have implemented in FindBugs (a static analysis tool to find bugs in Java programs) for tracking defects across versions, discuss their relative merits and how they can be incorporated into the software development process, and discuss the results of tracking defect warnings across Sun's Java runtime library.
TL;DR: The annotation graph provides more fine-grained software evolution information such as life cycles of each line and related changes: "Whenever a developer changed line 1 of version.txt she also changed line 25 of Library.java."
Abstract: Files, classes, or methods have frequently been investigated in recent research on co-change. In this paper, we present a first study at the level of lines. To identify line changes across several versions, we define the annotation graph which captures how lines evolve over time. The annotation graph provides more fine-grained software evolution information such as life cycles of each line and related changes: "Whenever a developer changed line 1 of version.txt she also changed line 25 of Library.java."
TL;DR: This paper has taken the database of users registered at SourceForge, the largest libre software development web-based platform, and inferred their geographical locations, and shows a snapshot of the regional distribution of SourceForge users, which may be a good proxy of the actual distribution oflibre software developers.
Abstract: The development of libre (free/open source) software is usually performed by geographically distributed teams. Participation in most cases is voluntary, sometimes sporadic, and often not framed by a pre-defined management structure. This means that anybody can contribute, and in principle no national origin has advantages over others, except for the differences in availability and quality of Internet connections and language. However, differences in participation across regions do exist, although there are little studies about them. In this paper we present some data which can be the basis for some of those studies. We have taken the database of users registered at SourceForge, the largest libre software development web-based platform, and have inferred their geographical locations. For this, we have applied several techniques and heuristics on the available data (mainly e-mail addresses and time zones), which are presented and discussed in detail. The results show a snapshot of the regional distribution of SourceForge users, which may be a good proxy of the actual distribution of libre software developers. In addition, the methodology may be of interest for similar studies in other domains, when the available data is similar (as is the case of mailing lists related to software projects).
TL;DR: An open framework for visual mining of CVS software repositories is presented and a new technique to enrich the raw data with information about artifacts showing similar evolution is presented.
Abstract: We present an open framework for visual mining of CVS software repositories. We address three aspects: data extraction, analysis and visualization. We first discuss the challenges of CVS data extraction and storage, and propose a flexible way to deal with CVS implementation inconsistencies. We next present a new technique to enrich the raw data with information about artifacts showing similar evolution. Finally, we propose a visualization backend and show its applicability on industry-size repositories.
TL;DR: This work performs micro-pattern evolution analysis on three open source projects, ArgoUML, Columba, and jEdit to identify micro pattern frequencies, common kinds of pattern evolution, and bug-prone patterns.
Abstract: When analyzing the evolution history of a software project, we wish to develop results that generalize across projects. One approach is to analyze design patterns, permitting characteristics of the evolution to be associated with patterns, instead of source code. Traditional design patterns are generally not amenable to reliable automatic extraction from source code, yet automation is crucial for scalable evolution analysis. Instead, we analyze "micro pattern" evolution; patterns whose abstraction level is closer to source code, and designed to be automatically extractable from Java source code or bytecode. We perform micro-pattern evolution analysis on three open source projects, ArgoUML, Columba, and jEdit to identify micro pattern frequencies, common kinds of pattern evolution, and bug-prone patterns. In all analyzed projects, we found that the micro patterns of Java classes do not change often. Common bug-prone pattern evolution kinds are 'Pool → Pool', 'Implementor → NONE', and 'Sampler → Sampler'. Among all pattern evolution kinds, 'Box', 'CompoundBox', 'Pool', 'CommonState', and 'Outline' micro patterns have high bug rates, but they have low frequencies and a small number of changes. The pattern evolution kinds that are bug-prone are somewhat similar across projects. The bug-prone pattern evolution kinds of two different periods of the same project are almost identical.
TL;DR: The TA-RE corpus is presented, which collects extracted data from software repositories in order to build a collection of projects that will simplify extraction process and an exchange language capable of making sharing and reusing data as simple as possible is proposed.
Abstract: Software repositories have been getting a lot of attention from researchers in recent years. In order to analyze software repositories, it is necessary to first extract raw data from the version control and problem tracking systems. This poses two challenges: (1) extraction requires a non-trivial effort, and (2) the results depend on the heuristics used during extraction. These challenges burden researchers that are new to the community and make it difficult to benchmark software repository mining since it is almost impossible to reproduce experiments done by another team. In this paper we present the TA-RE corpus. TA-RE collects extracted data from software repositories in order to build a collection of projects that will simplify extraction process. Additionally the collection can be used for benchmarking. As the first step we propose an exchange language capable of making sharing and reusing data as simple as possible.
TL;DR: This paper analyzes the data extracted from several open source software repositories and develops three probabilistic models to predict which files will have changes or bugs, and evaluates the performance of different prediction models empirically using the proposed information-theoretic approach.
Abstract: In this paper, we analyze the data extracted from several open source software repositories. We observe that the change data follows a Zipf distribution. Based on the extracted data, we then develop three probabilistic models to predict which files will have changes or bugs. The first model is Maximum Likelihood Estimation (MLE), which simply counts the number of events, i.e., changes or bugs, that happen to each file and normalizes the counts to compute a probability distribution. The second model is Reflexive Exponential Decay (RED) in which we postulate that the predictive rate of modification in a file is incremented by any modification to that file and decays exponentially. The third model is called RED-Co-Change. With each modification to a given file, the RED-Co-Change model not only increments its predictive rate, but also increments the rate for other files that are related to the given file through previous co-changes. We then present an information-theoretic approach to evaluate the performance of different prediction models. In this approach, the closeness of model distribution to the actual unknown probability distribution of the system is measured using cross entropy. We evaluate our prediction models empirically using the proposed information-theoretic approach for six large open source systems. Based on this evaluation, we observe that of our three prediction models, the RED-Co-Change model predicts the distribution that is closest to the actual distribution for all the studied systems.
TL;DR: To effectively implement full-text search in the absence of hyperlinks, a proposed method for detecting textual allusions to software artifacts in natural-language prose is proposed.
Abstract: Much of what is written about a software project is soon forgotten. Software repositories are full of valuable information about the project: Bug descriptions, check-in messages, email and newsgroup archives, specifications, design documents, product documentation, and product support logs contain a wealth of information that can potentially help software developers resolve crucial questions about the history, rationale, and future plans for source code. For a variety of reasons, developers rarely turn to these resources when trying to answer these questions. We are building a full-text search that encompasses multiple repositories. To effectively implement full-text search in the absence of hyperlinks we propose detecting textual allusions to software artifacts in natural-language prose. Allusions are shown to contribute a significant portion of the relationships represented in the graph.
TL;DR: This report describes some characteristics of the development team of PostgreSQL that were uncovered by analyzing the history of its software artifacts as recorded by the project's CVS repository.
Abstract: This report describes some characteristics of the development team of PostgreSQL that were uncovered by analyzing the history of its software artifacts as recorded by the project's CVS repository.
TL;DR: The CVSgrab tool is used to acquire the data and interactively visualize the evolution of ArgoUML and PostgreSQL, in order to answer three relevant questions about the process and team analysis categories of the MSR Mining Challenge 2006.
Abstract: In this paper we address the process and team analysis categories of the MSR Mining Challenge 2006. We use our CVSgrab tool to acquire the data and interactively visualize the evolution of ArgoUML and PostgreSQL, in order to answer three relevant questions. We conclude summarizing the strong and weak points of using CVSgrab for mining large software repositories.
TL;DR: This work used mailing lists (MLs) archives of Postgres to identify developers’ working time and found that the ML of hackers had many more messages than other MLs.
Abstract: 2. INPUT DATA We used mailing lists (MLs) archives of PostgreSQL, downloaded from http://www.postgresql.org/community/lists/. The MLs mainly consist of user lists and developer lists. We used developer lists archive since we needed developers’ working time. Table 1 explains details of each ML. Figure 1 shows amounts of messages of each ML in the developer lists. Amounts of messages were increasing year by year. The ML of hackers had many more messages than other MLs. We extracted MLs archives till December 2005. Note that most of committers’ messages were automatically generated when source code was checked into software configuration management repository. We picked up “mail sent time” to identify developers’ working time. Getting mail sent time from the MLs archives consists of the following two steps: First, we downloaded the MLs archives with 0 500
TL;DR: This work suggests augmenting revision histories with the interaction history of programmers, where all historical artifacts associated with the program are included and enables the development of several interesting applications including an influence-recommendation system and a task-mining system.
Abstract: Revision history provides a rich source of information to improve the understanding of changes made to programs, but it yields only limited insight into how these changes occurred. We explore an additional source of information - program viewing and editing history - where all historical artifacts associated with the program are included. In particular, we suggest augmenting revision histories with the interaction history of programmers. Using this additional information source enables the development of several interesting applications including an influence-recommendation system and a task-mining system. We present some results from a case study in which interaction histories from professional programmers were obtained and analyzed.
TL;DR: In this article, a software project repository (SEC repository) consisting of 253 enterprise software development projects in Japanese companies, established by Software Engineering Center (SEC), Information-technology Promotion Agency, Japan.
Abstract: To clarify the relation between controllable attributes of a software development and its productivity, this paper experimentally analyzed a software project repository (SEC repository), consisting of 253 enterprise software development projects in Japanese companies, established by Software Engineering Center (SEC), Information-technology Promotion Agency, Japan. In the experiment, as controllable attributes, we focused on the outsourcing ratio of a software project, defined as an effort outsourced to subcontract companies divided by a whole development effort, and on the effort allocation balance among development phases. Our major findings include both larger outsourcing ratio and smaller upstream process effort leads to worse productivity.
TL;DR: Analysis of the similarity of birthmarks for all pairs of classes in ArgoUML and visualized them using Multi-Dimensional Scaling (MDS) identified three pairs of very similar class files that seem to be made by copy-and-paste programming.
Abstract: Software birthmarks are unique and native characteristics of every software component. Two components having similar birthmarks indicate that they are similar in functionality, structure and im-plementation. Questions addressed in this paper include: Which are similar class files? Can they be gathered into one class file? What are major functionalities among class files? To answer to these questions, this paper analyzed the similarity of birthmarks for all pairs of classes in ArgoUML, and visualized them using Multi-Dimensional Scaling (MDS). As a result, three pairs of very similar class files, which seem to be made by copy-and-paste programming, were identified. Also, four major functionalities were identified in the MDS space.
TL;DR: The database populating process, performed in batch mode, consists in doing a checkout of the system, parsing it and storing the structure information in the database, and parsing the CVS logs and storing all the commit-related information.
Abstract: 2. INPUT DATA To analyze the target system, i.e., PostgreSQL, we use its whole history, as recorded by the CVS version control system, stored in a database called Release History Database (RHDB) [1, 3]. The database populating process, performed in batch mode, consists in (i) doing a checkout of the system, parsing it and storing the structure information in the database, (ii) parsing the CVS logs and storing all the commit-related information. The RHDB includes information about all the files in the system,i.e., source code, documentation, make-files, etc. For our analysis we consider only the source code data, i.e.,.c and .h files (since PostgreSQL is written in c). We decompose the system using the top-most directories in the src directory tree, i.e., we define a module as all the files belonging to a directory subtree.
TL;DR: A method to automatically create evolutionary annotations from change logs, defect tracking systems and mailing lists is described and the design of a prototype for Eclipse that can filter and present these annotations alongside their corresponding source code and in workbench views is described.
Abstract: Evolutionary annotations are descriptions of how source code evolves over time. Typical source comments, given their static nature, are usually inadequate for describing how a program has evolved over time; instead, source code comments are typically a description of what a program currently does. We propose the use of evolutionary annotations as a way of describing the rationale behind changes applied to a given program (for example "These lines were added to ..."). Evolutionary annotations can assist a software developer in the understanding of how a given portion of source code works by showing him how the source has evolved into its current form.In this paper we describe a method to automatically create evolutionary annotations from change logs, defect tracking systems and mailing lists. We describe the design of a prototype for Eclipse that can filter and present these annotations alongside their corresponding source code and in workbench views. We use Apache as a test case to demonstrate the feasibility of this approach.
TL;DR: This paper combines the results of the refactoring reconstruction technique with bug, mail and release information to perform process and bug analyses of the ARGOUML CVS archive.
Abstract: In this paper we combine the results of our refactoring reconstruction technique with bug, mail and release information to perform process and bug analyses of the ARGOUML CVS archive.
TL;DR: In this paper, the authors refine the classical co-change to the addition of method calls and use this concept to find usage patterns and to identify cross-cutting concerns for ArgoUML.
Abstract: In this paper we refine the classical co-change to the addition of method calls. We use this concept to find usage patterns and to identify cross-cutting concerns for ArgoUML.
TL;DR: This position paper introduces the latest activities on architecture evolution analysis through software repository mining, and introduces a meta-model covering the design and implementation spaces, and defines a set of scenarios that demonstrate the architecturally significant analysis that can be conducted by mining the software repository.
Abstract: In this position paper, we introduce our latest activities on architecture evolution analysis through software repository mining. The traditional approaches for software repository mining provide means for analyzing source-level information. However, we believe that software repository mining can also provide valuable results for analyzing the system evolution at the architectural level.There are two challenges for analyzing the architecture evolution. The first one is to have in place a process for recovering the architectural models of the various releases. Architecture evolution is often visible only in the evolution of the implementation and this complicates the monitoring process. The second one is to have access to the past design models that were created by the architects during the design phase. A practical solutions for versioning the architectural models is not in use yet and this complicates the possibility of accessing the past design decisions.Analyzing architecture evolution through software repository mining represents the most promising choice. In order to conduct the analysis through software repository mining, we introduce our meta-model covering the design and implementation spaces. Then, we define a set of scenarios that demonstrate the architecturally significant analysis that we can conduct by mining the software repository.
TL;DR: This paper analyzes ArgoUML software repositories with a tool and shows what are Bugzilla fields that better predict code entities impacted by a new bug report, that is where knowledge about bug resolution is stored.
Abstract: ArgoUML uses both CVS and Bugzilla to keep track of bug-fixing activities since 1998. A common practice is to reference source code changes resolving a bug stored in Bugzilla by inserting the id number of the bug in the CVS commit notes. This relationship reveals useful to predict code entities impacted by a new bug report.In this paper we analyze ArgoUML software repositories with a tool, we have implemented, showing what are Bugzilla fields that better predict such impact relationship, that is where knowledge about bug resolution is stored.
TL;DR: This work presents several applications of Logic file system to software engineering: multi-criteria indexation of software components, multi-concern browsing of source files, and bug finding in test traces.
Abstract: Logic information systems use formal concept analysis in a novel way to manage information. A file system implementation has been designed under the name of Logic file system. It offers a flexible management of non-hierarchical data. We present several applications of Logic file system to software engineering: multi-criteria indexation of software components, multi-concern browsing of source files, and bug finding in test traces.We detail multi-criteria indexing of software components. Three independent indexing frameworks are developed and merged in a single multi-criteria framework. The three indexing frameworks capture formal criteria like type isomorphisms and inheritance relations, semi-formal criteria like naming conventions, and informal criteria like keywords of comments. We show how the logical orientation of Logic file system helps in capturing all these criteria in a single framework.
TL;DR: This work proposes a conceptual framework for a concern-oriented query language for software repositories, and a pattern-based implementation scheme is discussed, exploiting existing tools.
Abstract: In the current trend of software engineering, software systems are viewed as clusters of overlapping structures representing various concerns, covering heterogeneous artifacts like models, code, resource files etc. In those cases, adequate search mechanisms for software repositories should be based on such fragmented nature of software systems, allowing concern-oriented queries on the system data. For this purpose, we propose a conceptual framework for a concern-oriented query language for software repositories. A pattern-based implementation scheme is discussed, exploiting existing tools. The applicability of the approach is studied in the context of an industrial case study.
TL;DR: Following the success of the first two iterations of the MSR workshop in 2004 and 2005, MSR 2006 attracted even more submissions and received 45 papers from 15 different countries, which were accepted for presentation at the workshop and inclusion in the proceedings.
Abstract: Software repositories such as source control systems,defect tracking systems,or archived communications between project personnel are used to help manage the progress of software projects.Software practitioners and researchers are beginning to recognize the potential bene .t of mining this information to support the maintenance of software systems,improve software design/reuse,and empirically validate novel ideas and techniques.Research is now proceeding to uncover the ways in which mining these repositories can help to understand software development,to support predictions about software development,and to plan various aspects of software projects.Following the success of the first two iterations of the MSR workshop in 2004 and 2005,MSR 2006 attracted even more submissions:We received 45 papers from 15 different countries.The international program committee accepted 16 full and 12 short papers for presentation at the workshop and inclusion in the proceedings.We are grateful for the excellent and professional review job done by the reviewers on such a tight schedule.