Top 23 papers presented at Mining Software Repositories in 2005

Showing papers presented at "Mining Software Repositories in 2005"

Journal Article•10.1145/1082983.1083147•

When do changes induce fixes

[...]

Jacek Śliwerski, Thomas Zimmermann¹, Andreas Zeller¹•Institutions (1)

17 May 2005

TL;DR: In a first investigation of the MOZILLA and ECLIPSE history, it turns out that fix-inducing changes show distinct patterns with respect to their size and the day of week they were applied.

...read moreread less

Abstract: As a software system evolves, programmers make changes that sometimes cause problems. We analyze CVS archives for fix-inducing changes---changes that lead to problems, indicated by fixes. We show how to automatically locate fix-inducing changes by linking a version archive (such as CVS) to a bug database (such as BUGZILLA). In a first investigation of the MOZILLA and ECLIPSE history, it turns out that fix-inducing changes show distinct patterns with respect to their size and the day of week they were applied.

...read moreread less

1,067 citations

Journal Article•10.1145/1082983.1083143•

Understanding source code evolution using abstract syntax tree matching

[...]

Iulian Neamtiu¹, Jeffrey S. Foster¹, Michael Hicks¹•Institutions (1)

University of Maryland, College Park¹

17 May 2005

TL;DR: A tool for quickly comparing the source code of different versions of a C program based on partial abstract syntax tree matching is presented, and can track simple changes to global variables, types and functions.

...read moreread less

Abstract: Mining software repositories at the source code level can provide a greater understanding of how software evolves. We present a tool for quickly comparing the source code of different versions of a C program. The approach is based on partial abstract syntax tree matching, and can track simple changes to global variables, types and functions. These changes can characterize aspects of software evolution useful for answering higher level questions. In particular, we consider how they could be used to inform the design of a dynamic software updating system. We report results based on measurements of various versions of popular open source programs. including BIND, OpenSSH, Apache, Vsftpd and the Linux kernel.

...read moreread less

292 citations

Proceedings Article•

When do changes induce fixes? On Fridays

[...]

Jacek Sliwerski, Thomas Zimmermann, Andreas Zeller

1 Jan 2005

96 citations

Journal Article•10.1145/1082983.1083150•

Mining student CVS repositories for performance indicators

[...]

Keir Mierle¹, Kevin Laven¹, Sam T. Roweis¹, Greg Wilson¹•Institutions (1)

University of Toronto¹

17 May 2005

TL;DR: Despite examining 166 features, it is found that grade performance cannot be accurately predicted; certainly no predictors stronger than simple lines-of-code were found.

...read moreread less

Abstract: Over 200 CVS repositories representing the assignments of students in a second year undergraduate computer science course have been assembled. This unique data set represents many individuals working separately on identical projects, presenting the opportunity to evaluate the effects of the work habits captured by CVS on performance. This paper outlines our experiences mining and analyzing these repositories. We extracted various quantitative measures of student behaviour and code quality, and attempted to correlate these features with grades. Despite examining 166 features, we find that grade performance cannot be accurately predicted; certainly no predictors stronger than simple lines-of-code were found.

...read moreread less

64 citations

Journal Article•10.1145/1082983.1083146•

Using a clone genealogy extractor for understanding and supporting evolution of code clones

[...]

Miryung Kim¹, David Notkin¹•Institutions (1)

University of Washington¹

17 May 2005

TL;DR: The initial results suggest that aggressive refactoring may not be the best solution for all code clones; thus, this work proposes alternative tool solutions that assist in maintaining code clones using clone genealogy information.

...read moreread less

Abstract: Programmers often create similar code snippets or reuse existing code snippets by copying and pasting. Code clones---syntactically and semantically similar code snippets---can cause problems during software maintenance because programmers may need to locate code clones and change them consistently. In this work, we investigate (1) how code clones evolve, (2) how many code clones impose maintenance challenges, and (3) what kind of tool or engineering process would be useful for maintaining code clones.Based on a formal definition of clone evolution, we built a clone genealogy tool that automatically extracts the history of code clones from a source code repository (CVS). Our clone genealogy tool enables several analyses that reveal evolutionary characteristics of code clones. Our initial results suggest that aggressive refactoring may not be the best solution for all code clones; thus, we propose alternative tool solutions that assist in maintaining code clones using clone genealogy information.

...read moreread less

62 citations

Journal Article•10.1145/1082983.1083149•

Software repository mining with Marmoset: an automated programming project snapshot and testing system

[...]

Jaime Spacco¹, Jaymie Strecker¹, David Hovemeyer¹, William Pugh¹•Institutions (1)

University of Maryland, College Park¹

17 May 2005

TL;DR: To gain insight into students' programming habits, Marmoset is developed, a project snapshot and submission system which allows students to submit versions of their projects to a central server, which automatically tests them and records the results.

...read moreread less

Abstract: Most computer science educators hold strong opinions about the "right" approach to teaching introductory level programming. Unfortunately, we have comparatively little hard evidence about the effectiveness of these various approaches because we generally lack the infrastructure to obtain sufficiently detailed data about novices' programming habits.To gain insight into students' programming habits, we developed Marmoset, a project snapshot and submission system. Like existing project submission systems, Marmoset allows students to submit versions of their projects to a central server, which automatically tests them and records the results. Unlike existing systems, Marmoset also collects finegrained code snapshots as students work on projects: each time a student saves her work, it is automatically committed to a CVS repository.We believe the data collected by Marmoset will be a rich source of insight about learning to program and software evolution in general. To validate the effectiveness of our tool, we performed an experiment which found a statistically significant correlation between warnings reported by a static analysis tool and failed unit tests.To make fine-grained code evolution data more useful, we present a data schema which allows a variety of useful queries to be more easily formulated and answered.

...read moreread less

61 citations

Journal Article•10.1145/1082983.1083163•

Accelerating cross-project knowledge collaboration using collaborative filtering and social networks

[...]

Masao Ohira¹, Naoki Ohsugi¹, Tetsuya Ohoka¹, Kenichi Matsumoto¹•Institutions (1)

National Archives and Records Administration¹

17 May 2005

TL;DR: A case study of applying the tools to F/OSS projects data collected from SourceForge and how effective the tools can be used for helping cross-project knowledge collaboration is reported.

...read moreread less

Abstract: Vast numbers of free/open source software (F/OSS) development projects use hosting sites such as Java.net and Source-Forge.net. These sites provide each project with a variety of software repositories (e.g. repositories for source code sharing, bug tracking, discussions, etc.) as a media for communication and collaboration. They tend to focus on supporting rich collaboration among members in each project. However, a majority of hosted projects are relatively small projects consisting of few developers and often need more resources for solving problems. In order to support cross-project knowledge collaboration in F/OSS development, we have been developing tools to collect data of projects and developers at SourceForge, and to visualize the relationship among them using the techniques of collaborative filtering and social networks. The tools help a developer identify "who should I ask?" and "what can I ask?" and so on. In this paper, we report a case study of applying the tools to F/OSS projects data collected from SourceForge and how effective the tools can be used for helping cross-project knowledge collaboration.

...read moreread less

61 citations

Journal Article•10.1145/1082983.1083158•

Mining version histories to verify the learning process of Legitimate Peripheral Participants

[...]

Shih-Kun Huang¹, Kang-min Liu²•Institutions (2)

Academia Sinica¹, National Chiao Tung University²

17 May 2005

TL;DR: A developer-module relationship model is proposed to analyze the grouping structures between developers and modules and shows some process cases of relative importance on the constructed graph of project development.

...read moreread less

Abstract: Since code revisions reflect the extent of human involvement in the software development process, revision histories reveal the interactions and interfaces between developers and modules.We therefore divide developers and modules into groups according to the revision histories of the open source software repository, for example, sourceforge.net. To describe the interactions in the open source development process, we use a representative model, Legitimate Peripheral Participation (LPP) [6], to divide developers into groups such as core and peripheral teams, based on the evolutionary process of learning behavior.With the conventional module relationship, we divide modules into kernel and non-kernel types (such as UI). In the past, groups of developers and modules have been partitioned naturally with informal criteria. In this work, however, we propose a developer-module relationship model to analyze the grouping structures between developers and modules. Our results show some process cases of relative importance on the constructed graph of project development. The graph reveals certain subtle relationships in the interactions between core and non-core team developers, and the interfaces between kernel and non-kernel modules.

...read moreread less

52 citations

Journal Article•10.1145/1082983.1083164•

Collaboration using OSSmole: a repository of FLOSS data and analyses

[...]

Megan Conklin¹, James Howison², Kevin Crowston²•Institutions (2)

Elon University¹, Syracuse University²

17 May 2005

TL;DR: Current difficulties with the typical quantitative FLOSS research process are outlined and uses these to develop requirements for such a collaborative data repository, and the design of the OSSmole system is presented.

...read moreread less

Abstract: This paper introduces a collaborative project OSSmole which collects, shares, and stores comparable data and analyses of free, libre and open source software (FLOSS) development for research purposes. The project is a clearinghouse for data from the ongoing collection and analysis efforts of many disparate research groups. A collaborative data repository reduces duplication and promote compatibility both across sources of FLOSS data and across research groups and analyses. The primary objective of OSSmole is to mine FLOSS source code repositories and provide the resulting data and summary analyses as open source products. However, the OSSmole data model additionally supports donated raw and summary data from a variety of open source researchers and other software repositories. The paper first outlines current difficulties with the typical quantitative FLOSS research process and uses these to develop requirements for such a collaborative data repository. Finally, the design of the OSSmole system is presented, as well as examples of current research and analyses using OSSmole.

...read moreread less

46 citations

Journal Article•10.1145/1082983.1083153•

Text mining for software engineering: how analyst feedback impacts final results

[...]

Jane Huffman Hayes¹, Alex Dekhtyar¹, Senthil Karthikeyan Sundaram¹•Institutions (1)

University of Kentucky¹

17 May 2005

TL;DR: In this paper, a pilot study is undertook to examine the impact of analyst decisions on the final outcome of a task.

...read moreread less

Abstract: The mining of textual artifacts is requisite for many important activities in software engineering: tracing of requirements; retrieval of components from a repository; location of manpage text for an area of question, etc. Many such activities leave the "final word" to the analyst --- have the relevant items been retrieved? are there other items that should have been retrieved? When analysts become a part of the text mining process, their decisions on the relevance of retrieved elements impact the final outcome of the activity. In this paper, we undertook a pilot study to examine the impact of analyst decisions on the final outcome of a task.

...read moreread less

45 citations

Journal Article•10.1145/1082983.1083148•

Error detection by refactoring reconstruction

[...]

Carsten Görg¹, Peter Weißgerber²•Institutions (2)

Saarland University¹, The Catholic University of America²

17 May 2005

TL;DR: This paper shows how to detect incomplete refactorings - which can cause long standing bugs because some of them do not cause compiler errors - by analyzing software archives by reconstructing the class inheritance hierarchies.

...read moreread less

Abstract: In many cases it is not sufficient to perform a refactoring only at one location of a software project. For example, refactorings may have to be performed consistently to several classes in the inheritance hierarchy, e.g. subclasses or implementing classes, to preserve equal behavior.In this paper we show how to detect incomplete refactorings - which can cause long standing bugs because some of them do not cause compiler errors - by analyzing software archives. To this end we reconstruct the class inheritance hierarchies, as well as refactorings on the level of methods. Then, we relate these refactorings to the corresponding hierarchy in order to find missing refactorings and thus, errors and inconsistencies that have been introduced in a software project at some point of the history.Finally. we demonstrate our approach by case studies on two open source projects.

...read moreread less

Journal Article•10.1145/1082983.1083161•

SCQL: a formal model and a query language for source control repositories

[...]

Abram Hindle¹, Daniel M. German¹•Institutions (1)

University of Victoria¹

17 May 2005

TL;DR: A generalized formal model of source control repositories is described, which is a graph in which the different entities stored in the repository become vertices and their relationships become edges, and SCQL, a first order, and temporal logic based query language for source control repository is defined.

...read moreread less

Abstract: Source Control Repositories are used in most software projects to store revisions to source code files. These repositories operate at the file level and support multiple users. A generalized formal model of source control repositories is described herein. The model is a graph in which the different entities stored in the repository become vertices and their relationships become edges. We then define SCQL, a first order, and temporal logic based query language for source control repositories. We demonstrate how SCQL can be used to specify some questions and then evaluate them using the source control repositories of five different large software projects.

...read moreread less

Journal Article•10.1145/1082983.1083151•

Toward mining "concept keywords" from identifiers in large software projects

[...]

Masaru Ohba¹, Katsuhiko Gondow¹•Institutions (1)

Tokyo Institute of Technology¹

17 May 2005

TL;DR: The proposed ckTF/IDF method is applied to the educational operating system udos, which suggests that the approach is useful for mining concept keywords from identifiers, although more research and experience are needed.

...read moreread less

Abstract: We propose the Concept Keyword Term Frequency/Inverse Document Frequency (ckTF/IDF) method as a novel technique to efficiency mine concept keywords from identifiers in large software projects. ckTF/IDF is suitable for mining concept keywords, since the ckTF/IDF is more lightweight than the TF/IDF method, and the ckTF/IDF's heuristics is tuned for identifiers in programs.We then experimentally apply the ckTF/IDF to our educational operating system udos, consisting of around 5,000 lines in C code, which produced promising results; the udos's source code was processed in 1.4 seconds with an accuracy of around 57%. This preliminary result suggests that our approach is useful for mining concept keywords from identifiers, although we need more research and experience.

...read moreread less

Journal Article•10.1145/1082983.1083144•

Recovering system specific rules from software repositories

[...]

Chadd C. Williams¹, Jeffrey K. Hollingsworth¹•Institutions (1)

University of Maryland, College Park¹

17 May 2005

TL;DR: A method to automatically recover a subset of system-specific rules, function usage patterns, by mining the software repository is discussed and a preliminary study is presented that applies the work to a large open source software project.

...read moreread less

Abstract: One of the most successful applications of static analysis based bug finding tools is to search the source code for violations of system-specific rules. These rules may describe how functions interact in the code, how data is to be validated or how an API is to be used. To apply these tools, the developer must encode a rule that must be followed in the source code. The difficulty is that many of these system-specific rules are undocumented and "grow" over time as the source code changes. Most research in this area relies on expert programmers to document these little-known rules. In this paper we discuss a method to automatically recover a subset of these rules, function usage patterns, by mining the software repository. We present a preliminary study that applies our work to a large open source software project.

...read moreread less

Journal Article•10.1145/1082983.1083145•

Mining evolution data of a product family

[...]

Michael Fischer¹, Johann Oberleitner¹, Jacek Ratzinger¹, Harald C. Gall²•Institutions (2)

University of Vienna¹, University of Zurich²

17 May 2005

TL;DR: This work study the evolution and commonalities of three variants of the BSD (Berkeley Software Distribution), a large open source operating system and extended the previously developed approach for storing release history information to support the analysis of product families.

...read moreread less

Abstract: Diversification of software assets through changing requirements impose a constant challenge on the developers and maintainers of large software systems Recent research has addressed the mining for data in software repositories of single products ranging from fine- to coarse grained analyses But so far, little attention has been payed to mining data about the evolution of product families In this work, we study the evolution and commonalities of three variants of the BSD (Berkeley Software Distribution), a large open source operating system The research questions we tackle are concerned with how to generate high level views of the system discovering and indicating evolutionary highlights To process the large amount of data, we extended our previously developed approach for storing release history information to support the analysis of product families In a case study we apply our approach on data from three different code repositories representing about 85GB of data and 10 years of active development

...read moreread less

Journal Article•10.1145/1082983.1083160•

A framework for describing and understanding mining tools in software development

[...]

Daniel M. German¹, Davor Cubranic¹, Margaret-Anne Storey¹•Institutions (1)

University of Victoria¹

17 May 2005

TL;DR: This framework has the following purposes: to help tool designers in the understanding and comparison of different tools, to assist users in the assessment of a potential tool; and to identify new research areas.

...read moreread less

Abstract: We propose a framework for describing, comparing and understanding tools for the mining of software repositories. The fundamental premise of this framework is that mining should be done by considering the specific needs of the users and the tasks to be supported by the mined information. First, different types of users have distinct needs, and these needs should be taken into account by tool designers. Second, the data sources available, and mined, will determine if those needs can be satisfied. Our framework is based upon three main principles: the type of user, the objective of the user, and the mined information. This framework has the following purposes: to help tool designers in the understanding and comparison of different tools, to assist users in the assessment of a potential tool; and to identify new research areas. We use this framework to describe several mining tools and to suggest future research directions.

...read moreread less

Journal Article•10.1145/1082983.1083159•

Towards a taxonomy of approaches for mining of source code repositories

[...]

Huzefa Kagdi¹, Michael L. Collard¹, Jonathan I. Maletic¹•Institutions (1)

Kent State University¹

17 May 2005

TL;DR: This work forms the basis for a taxonomic description of MSR approaches and discusses these MSR techniques in light of what changes are identified, how they are expressed, the adopted methodology, evaluation, and results.

...read moreread less

Abstract: Source code version repositories provide a treasure of information encompassing the changes introduced in the system throughout its evolution. These repositories are typically managed by tools such as CVS. However, these tools identify and express changes in terms of physical attributes i.e., file and line numbers. Recently, to help support the mining of software repositories (MSR), researchers have proposed methods to derive and express changes from source code repositories in a more source-code "aware" manner (i.e., syntax and semantic). Here, we discuss these MSR techniques in light of what changes are identified, how they are expressed, the adopted methodology, evaluation, and results. This work forms the basis for a taxonomic description of MSR approaches.

...read moreread less

Journal Article•10.1145/1082983.1083157•

Repository mining and Six Sigma for process improvement

[...]

Michael VanHilst¹, Pankaj K. Garg, Christopher Lo¹•Institutions (1)

Florida Atlantic University¹

17 May 2005

TL;DR: This paper proposes to apply artifact mining in a global development environment to support measurement based process management and improvement, such as SEI/CMMI's GQ(I)M and Six Sigma's DMAIC.

...read moreread less

Abstract: In this paper, we propose to apply artifact mining in a global development environment to support measurement based process management and improvement, such as SEI/CMMI's GQ(I)M and Six Sigma's DMAIC. CMM has its origins in managing large software projects for the government and emphasizes achieving expected outcomes. In GQM, organizational goals are identified. The appropriate questions with corresponding measurements are defined and collected. Six Sigma has its origins in manufacturing and emphasizes reducing cost and defects. In DMAIC, a major component of a Six Sigma approach, sources of waste are identified. Then changes are made in the process to reduce effort and increase the quality of the product produced. GQM and Six Sigma are complementary. Both approaches rely heavily on the measurement of input and output metrics. Mining development artifacts can provide usable metrics for the application of DMAIC and GQM in the software domain.

...read moreread less

Journal Article•10.1145/1082983.1083156•

Linear predictive coding and cepstrum coefficients for mining time variant information from software repositories

[...]

Giuliano Antoniol¹, V.F. Rollo¹, Gabriele Venturi¹•Institutions (1)

University of Sannio¹

17 May 2005

TL;DR: Inspired by time-frequency duality, this paper proposes the use of Linear Predictive Coding (LPC) and Cepstrum coefficients to model time varying software artifact histories to recover time variant information from software repositories.

...read moreread less

Abstract: This paper presents an approach to recover time variant information from software repositories. It is widely accepted that software evolves due to factors such as defect removal, market opportunity or adding new features. Software evolution details are stored in software repositories which often contain the changes history. On the other hand there is a lack of approaches, technologies and methods to efficiently extract and represent time dependent information. Disciplines such as signal and image processing or speech recognition adopt frequency domain representations to mitigate differences of signals evolving in time. Inspired by time-frequency duality, this paper proposes the use of Linear Predictive Coding (LPC) and Cepstrum coefficients to model time varying software artifact histories. LPC or Cepstrum allow obtaining very compact representations with linear complexity. These representations can be used to highlight components and artifacts evolved in the same way or with very similar evolution patterns. To assess the proposed approach we applied LPC and Cepstral analysis to 211 Linux kernel releases (i.e., from 1.0 to 1.3.100), to identify files with very similar size histories. The approach, the preliminary results and the lesson learned are presented in this paper.

...read moreread less

Journal Article•10.1145/1082983.1083154•

Analysis of signature change patterns

[...]

Sunghun Kim¹, E. James Whitehead¹, Jennifer Bevan¹•Institutions (1)

University of California, Santa Cruz¹

17 May 2005

TL;DR: A taxonomy of signature change kinds to categorize observed changes is introduced based on an analysis of eight prominent open source projects including the Apache HTTP server, GCC, and Linux 2.5 kernel.

...read moreread less

Abstract: Software continually changes due to performance improvements, new requirements, bug fixes, and adaptation to a changing operational environment. Common changes include modifications to data definitions, control flow, method/function signatures, and class/file relationships. Signature changes are notable because they require changes at all sites calling the modified function, and hence as a class they have more impact than other change kinds.We performed signature change analysis over software project histories to reveal multiple properties of signature changes, including their kind, frequency, and evolution patterns. These signature properties can be used to alleviate the impact of signature changes. In this paper we introduce a taxonomy of signature change kinds to categorize observed changes. We report multiple properties of signature changes based on an analysis of eight prominent open source projects including the Apache HTTP server, GCC, and Linux 2.5 kernel.

...read moreread less

Journal Article•10.1145/1082983.1083155•

Improving evolvability through refactoring

[...]

Jacek Ratzinger¹, Michael Fischer¹, Harald C. Gall²•Institutions (2)

Vienna University of Technology¹, University of Zurich²

17 May 2005

TL;DR: The approach enables the detection of bad smells allowing an engineer to apply refactoring on these parts of the source code to improve the evolvability of the software.

...read moreread less

Abstract: Refactoring is one means of improving the structure of existing software. Locations for the application of refactoring are often based on subjective perceptions such as "bad smells", which are vague suspicions of design shortcomings. We exploit historical data extracted from repositories such as CVS and focus on change couplings: if some software parts change at the same time very often over several releases, this data can be used to point to candidates for refactoring. We adopt the concept of bad smells and provide additional change smells. Such a smell is hardly visible in the code, but easy to spot when viewing the change history. Our approach enables the detection of such smells allowing an engineer to apply refactoring on these parts of the source code to improve the evolvability of the software. For that, we analyzed the history of a large industrial system for a period of 15 months, proposed spots for refactorings based on change couplings, and performed them with the developers. After observing the system for another 15 months we finally analyzed the effectiveness of our approach. Our results support our hypothesis that the combination of change dependency analysis and refactoring is applicable and effective.

...read moreread less

Journal Article•10.1145/1082983.1083152•

Source code that talks: an exploration of Eclipse task comments and their implication to repository mining

[...]

Annie T. T. Ying¹, James L. Wright¹, Steven Abrams¹•Institutions (1)

IBM¹

17 May 2005

TL;DR: It is found that programmers not only use comments for describing the actual source code, but also use them for many other purposes, such as "talking" to colleagues through the source code using a comment "Joan, please fix this method."

...read moreread less

Abstract: A programmer performing a change task to a system can benefit from accurate comments on the source code. As part of good programming practice described by Kernighan and Pike in the book The Practice of Programming, comments should "aid the understanding of a program by briefly pointing out salient details or by providing a larger-scale view of the proceedings." In this paper, we explore the widely varying uses of comments in source code. We find that programmers not only use comments for describing the actual source code, but also use comments for many other purposes, such as "talking" to colleagues through the source code using a comment "Joan, please fix this method." This kind of comments can complicate the mining of project information because such team communication is often perceived to reside in separate archives, such as emails or newsgroup postings, rather than in the source code. Nevertheless, these and other types of comments can be very useful inputs for mining project information.

...read moreread less

Journal Article•10.1145/1082983.1083162•

Developer identification methods for integrated data from various sources

[...]

Gregorio Robles¹, Jesus M. Gonzalez-Barahona¹•Institutions (1)

King Juan Carlos University¹

17 May 2005

TL;DR: This paper proposes an approach, based on the application of heuristics, to identify the many identities of developers in such cases, and a data structure for allowing both the anonymized distribution of information, and the tracking of identities for verification purposes.

...read moreread less

Abstract: Studying a software project by mining data from a single repository has been a very active research field in software engineering during the last years. However, few efforts have been devoted to perform studies by integrating data from various repositories, with different kinds of information, which would, for instance, track the different activities of developers. One of the main problems of these multi-repository studies is the different identities that developers use when they interact with different tools in different contexts. This makes them appear as different entities when data is mined from different repositories (and in some cases, even from a single one). In this paper we propose an approach, based on the application of heuristics, to identify the many identities of developers in such cases, and a data structure for allowing both the anonymized distribution of information, and the tracking of identities for verification purposes. The methodology will be presented in general, and applied to the GNOME project as a case example. Privacy issues and partial merging with new data sources will also be considered and discussed.

...read moreread less