Top 33 papers presented at Mining Software Repositories in 2007

Showing papers presented at "Mining Software Repositories in 2007"

Proceedings Article•10.1109/MSR.2007.13•

How Long Will It Take to Fix This Bug

[...]

Cathrin Weiss¹, Rahul Premraj¹, Thomas Zimmermann¹, Andreas Zeller¹•Institutions (1)

20 May 2007

TL;DR: This work presents an approach that automatically predicts the fixing effort, i.e., the person-hours spent on fixing an issue, using the Lucene framework to search for similar, earlier reports and use their average time as a prediction.

...read moreread less

Abstract: Predicting the time and effort for a software problem has long been a difficult task. We present an approach that automatically predicts the fixing effort, i.e., the person-hours spent on fixing an issue. Our technique leverages existing issue tracking systems: given a new issue report, we use the Lucene framework to search for similar, earlier reports and use their average time as a prediction. Our approach thus allows for early effort estimation, helping in assigning issues and scheduling stable releases. We evaluated our approach using effort data from the JBoss project. Given a sufficient number of issues reports, our automatic predictions are close to the actual effort; for issues that are bugs, we are off by only one hour, beating naive predictions by a factor of four.

...read moreread less

431 citations

Proceedings Article•10.1109/MSR.2007.14•

Identifying Changed Source Code Lines from Version Repositories

[...]

Gerardo Canfora¹, Luigi Cerulo¹, M. Di Penta¹•Institutions (1)

University of Sannio¹

20 May 2007

TL;DR: This paper shows how the evolution of changes at source code line level can be inferred from CVS repositories, by combining information retrieval techniques and the Levenshtein edit distance.

...read moreread less

Abstract: Observing the evolution of software systems at different levels of granularity has been a key issue for a number of studies, aiming at predicting defects or at studying certain phenomena, such as the presence of clones or of crosscutting concerns. Versioning systems such as CVS and SVN, however, only provide information about lines added or deleted by a contributor: any change is shown as a sequence of additions and deletions. This provides an erroneous estimate of the amount of code changed. This paper shows how the evolution of changes at source code line level can be inferred from CVS repositories, by combining information retrieval techniques and the Levenshtein edit distance. The application of the proposed approach to the ArgoUML case study indicates a high precision and recall.

...read moreread less

107 citations

Proceedings Article•10.1109/MSR.2007.26•

Prioritizing Warning Categories by Analyzing Software History

[...]

Sunghun Kim¹, Michael D. Ernst¹•Institutions (1)

Massachusetts Institute of Technology¹

20 May 2007

TL;DR: This paper proposes a preliminary algorithm for warning category prioritizing by analyzing the software change history, and indicates that different warning categories have very different lifetimes.

...read moreread less

Abstract: Automatic bug finding tools tend to have high false positive rates: most warnings do not indicate real bugs. Usually bug finding tools prioritize each warning category. For example, the priority of "overflow " is 1 and the priority of "jumbled incremental" is 3, but the tools 'prioritization is not very effective. In this paper, we prioritize warning categories by analyzing the software change history. The underlying intuition is that if warnings from a category are resolved quickly by developers, the warnings in the category are important. Experiments with three bug finding tools (FindBugs, JLint, and PMD) and two open source projects (Columba and jEdit) indicate that different warning categories have very different lifetimes. Based on that observation, we propose a preliminary algorithm for warning category prioritizing.

...read moreread less

102 citations

Proceedings Article•10.1109/MSR.2007.21•

Mining Software Repositories with iSPAROL and a Software Evolution Ontology

[...]

Christoph Kiefer¹, Abraham Bernstein¹, Jonas Tappolet¹•Institutions (1)

University of Zurich¹

20 May 2007

TL;DR: EvoOnt is presented, a software repository data exchange format based on the Web Ontology Language (OWL), which includes software, release, and bug-related information and allows to derive assertions through its inherent Description Logic reasoning capabilities.

...read moreread less

Abstract: One of the most important decisions researchers face when analyzing the evolution of software systems is the choice of a proper data analysis/exchange format. Most existing formats have to be processed with special programs written specifically for that purpose and are not easily extendible. Most scientists, therefore, use their own database( s) requiring each of them to repeat the work of writing the import/export programs to their format. We present EvoOnt, a software repository data exchange format based on the Web Ontology Language (OWL). EvoOnt includes software, release, and bug-related information. Since OWL describes the semantics of the data, EvoOnt is (1) easily extendible, (2) comes with many existing tools, and (3) allows to derive assertions through its inherent Description Logic reasoning capabilities. The paper also shows iSPARQL -- our SPARQL-based Semantic Web query engine containing similarity joins. Together with EvoOnt, iSPARQL can accomplish a sizable number of tasks sought in software repository mining projects, such as an assessment of the amount of change between versions or the detection of bad code smells. To illustrate the usefulness of EvoOnt (and iSPARQL), we perform a series of experiments with a real-world Java project. These show that a number of software analyses can be reduced to simple iSPARQL queries on an EvoOnt dataset.

...read moreread less

93 citations

Proceedings Article•10.1109/MSR.2007.6•

Detecting Patch Submission and Acceptance in OSS Projects

[...]

Christian Bird¹, Alex Gourley¹, Prem Devanbu¹•Institutions (1)

University of California, Davis¹

20 May 2007

TL;DR: It is argued that the process of patch submission and acceptance into the codebase is an important piece of the open source puzzle and that the use of patch-related data can be helpful in understanding how OSS projects work.

...read moreread less

Abstract: The success of open source software (OSS) is completely dependent on the work of volunteers who contribute their time and talents. The submission of patches is the major way that participants outside of the core group of developers make contributions. We argue that the process of patch submission and acceptance into the codebase is an important piece of the open source puzzle and that the use of patch-related data can be helpful in understanding how OSS projects work. We present our methods in identifying the submission and acceptance of patches and give results and evaluation in applying these methods to the Apache webserver, Python interpreter, Postgres SQL database, and (with limitations) MySQL database projects. In addition, we present valuable ways in which this data has been and can be used.

...read moreread less

85 citations

Proceedings Article•10.1109/MSR.2007.20•

Mining Eclipse Developer Contributions via Author-Topic Models

[...]

Erik Linstead¹, Paul Rigor¹, Sushil Bajracharya¹, Cristina V. Lopes¹, Pierre Baldi¹ - Show less +1 more•Institutions (1)

University of California, Irvine¹

20 May 2007

TL;DR: This study shows that topic models provide a meaningful, effective, and statistical basis for developer similarity analysis and provides an intuitive and automated framework to mine developer contributions and competencies from a given code base while simultaneously extracting software function in the form of topics.

...read moreread less

Abstract: We present the results of applying statistical author-topic models to a subset of the Eclipse 3.0 source code consisting of 2,119 source files and 700,000 lines of code from 59 developers. This technique provides an intuitive and automated framework with which to mine developer contributions and competencies from a given code base while simultaneously extracting software function in the form of topics. In addition to serving as a convenient summary for program function and developer activities, our study shows that topic models provide a meaningful, effective, and statistical basis for developer similarity analysis.

...read moreread less

72 citations

Proceedings Article•10.1109/MSR.2007.19•

Mining CVS Repositories to Understand Open-Source Project Developer Roles

[...]

Liguo Yu¹, Srini Ramaswamy²•Institutions (2)

Indiana University¹, University of Arkansas at Little Rock²

20 May 2007

TL;DR: A model to represent the interactions of distributed open-source software developers and utilizes data mining techniques to derive developer roles is presented and applied on case studies of ORAC-DR and Mediawiki with encouraging results.

...read moreread less

Abstract: This paper presents a model to represent the interactions of distributed open-source software developers and utilizes data mining techniques to derive developer roles. The model is then applied on case studies of two open-source projects, ORAC-DR and Mediawiki with encouraging results.

...read moreread less

63 citations

Proceedings Article•10.1109/MSR.2007.22•

Mining Workspace Updates in CVS

[...]

Thomas Zimmermann¹•Institutions (1)

Saarland University¹

20 May 2007

TL;DR: This paper analyzes the CVS activity data of four large open-source projects CCC, JBOSS, JEDIT, and PYTHON to investigate parallel development.

...read moreread less

Abstract: The version control archive CVS records not only all changes in a project but also activity data such as when developers create or update their workspaces. Furthermore, CVS records when it has to integrate changes because of parallel development. In this paper, we analyze the CVS activity data of four large open-source projects CCC, JBOSS, JEDIT, and PYTHON to investigate parallel development: What is the degree of parallel development? How frequently do conflicts occur during updates and how are they resolved? How do we identify changes that contain integrations?

...read moreread less

61 citations

Proceedings Article•10.1109/MSR.2007.1•

Analysis of the Linux Kernel Evolution Using Code Clone Coverage

[...]

Simone Livieri¹, Yoshiki Higo¹, Makoto Matsushita¹, Katsuro Inoue¹•Institutions (1)

Osaka University¹

20 May 2007

TL;DR: This paper examined 136 versions of the stable Linux kernel using a distributed extension of the code clone detection tool CCFinder and the code-clone coverage metrics.

...read moreread less

Abstract: Most studies of the evolution of software systems are based on the comparison of simple software metrics. In this paper, we present our preliminary investigation of the evolution of the Linux kernel using code-clone analysis and the code-clone coverage metrics. We examined 136 versions of the stable Linux kernel using a distributed extension of the code clone detection tool CCFinder. The result is shown as a heat map.

...read moreread less

46 citations

Proceedings Article•10.1109/MSR.2007.29•

Spam Filter Based Approach for Finding Fault-Prone Software Modules

[...]

Osamu Mizuno¹, Shiro Ikami¹, Shuya Nakaichi¹, Tohru Kikuno¹•Institutions (1)

Osaka University¹

20 May 2007

TL;DR: A novel approach to detect fault-prone modules in a way that the source code modules are considered as text files and are applied to the spam filter directly in order to show the applicability of this approach.

...read moreread less

Abstract: Because of the increase of needs for spam e-mail detection, the spam filtering technique has been improved as a convenient and effective technique for text mining. We propose a novel approach to detect fault-prone modules in a way that the source code modules are considered as text files and are applied to the spam filter directly. In order to show the applicability of our approach, we conducted experimental applications using source code repositories of Java based open source developments. The result of experiments shows that our approach can classify more than 75% of software modules correctly.

...read moreread less

44 citations

Proceedings Article•10.1109/MSR.2007.28•

Release Pattern Discovery via Partitioning: Methodology and Case Study

[...]

Abram Hindle¹, Michael W. Godfrey¹, Richard Holt¹•Institutions (1)

University of Waterloo¹

20 May 2007

TL;DR: This paper proposes an approach to characterizing a project's behavior around the time of major and minor releases by partitioning the observed activities, such as artifact check-ins, around the dates of major or minor releases, and then looks for recognizable patterns.

...read moreread less

Abstract: The development of Open Source systems produces a variety of software artifacts such as source code, version control records, bug reports, and email discussions. Since the development is distributed across different tool environments and developer practices, any analysis of project behavior must be inferred from whatever common artifacts happen to be available. In this paper, we propose an approach to characterizing a project's behavior around the time of major and minor releases; we do this by partitioning the observed activities, such as artifact check-ins, around the dates of major and minor releases, and then look for recognizable patterns. We validate this approach by means of a case study on the MySQL database system; in this case study, we found patterns which suggested MySQL was behaving consistently within itself. These patterns included testing and documenting that took place more before a release than after and that the rate of source code changes dipped around release time.

...read moreread less

Proceedings Article•10.1109/MSR.2007.4•

Correlating Social Interactions to Release History during Software Evolution

[...]

Olga Baysal¹, A.J. Malton¹•Institutions (1)

University of Waterloo¹

20 May 2007

TL;DR: An information retrieval approach is employed to find correlation between source code change history and history of social interactions surrounding these changes, and identifies a set of correlation patterns between discussion and changed code vocabularies.

...read moreread less

Abstract: In this paper, we propose a method to reason about the nature of software changes by mining and correlating discussion archives. We employ an information retrieval approach to find correlation between source code change history and history of social interactions surrounding these changes. We apply our correlation method on two software systems, LSEdit and Apache Ant. The results of these exploratory case studies demonstrate the evidence of similarity between the content of free-form text emails among developers and the actual modifications in the code. We identify a set of correlation patterns between discussion and changed code vocabularies and discover that some releases referred to as minor should instead fall under the major category. These patterns can be used to give estimations about the type of a change and time needed to implement it.

...read moreread less

Proceedings Article•10.1109/MSR.2007.5•

Defect Data Analysis Based on Extended Association Rule Mining

[...]

Shuji Morisaki¹, Akito Monden¹, Tomoko Matsumura¹, Haruaki Tamada¹, Kenichi Matsumoto¹ - Show less +1 more•Institutions (1)

Nara Institute of Science and Technology¹

20 May 2007

TL;DR: It is confirmed that you need to pay attention to types of defects having large mean effort as well as those having large standard deviation of effort since such defects sometimes cause excess effort.

...read moreread less

Abstract: This paper describes an empirical study to reveal rules associated with defect correction effort. We defined defect correction effort as a quantitative (ratio scale) variable, and extended conventional (nominal scale based) association rule mining to directly handle such quantitative variables. An extended rule describes the statistical characteristic of a ratio or interval scale variable in the consequent part of the rule by its mean value and standard deviation so that conditions producing distinctive statistics can be discovered. As an analysis target, we collected various attributes of about 1,200 defects found in a typical medium-scale, multi-vendor (distance development) information system development project in Japan. Our findings based on extracted rules include: (1)Defects detected in coding/unit testing were easily corrected (less than 7% of mean effort) when they are related to data output or validation of input data. (2)Nevertheless, they sometimes required much more effort (lift of standard deviation was 5.845) in case of low reproducibility, (3)Defects introduced in coding/unit testing often required large correction effort (mean was 12.596 staff-hours and standard deviation was 25.716) when they were related to data handing. From these findings, we confirmed that we need to pay attention to types of defects having large mean effort as well as those having large standard deviation of effort since such defects sometimes cause excess effort.

...read moreread less

Proceedings Article•10.1109/MSR.2007.2•

Combining Single-Version and Evolutionary Dependencies for Software-Change Prediction

[...]

Huzefa Kagdi¹, Jonathan I. Maletic¹•Institutions (1)

Kent State University¹

20 May 2007

TL;DR: The paper advocates the need for the investigation and development of a software-change prediction methodology that combines the change sets estimated from software dependency analysis and the actual change sets found in software version histories.

...read moreread less

Abstract: The paper advocates the need for the investigation and development of a software-change prediction methodology that combines the change sets estimated from software dependency analysis (via single-version analysis) and the actual change sets found in software version histories (via multiple-version analysis). Traditionally prescribed methodologies such as Impact Analysis (IA) are based on the former, whereas a more recent methodology, mining software repository (MSR), is based on the latter. The research hypothesis is that combining these two methodologies will result in an overall improved support for software-change prediction.

...read moreread less

Proceedings Article•10.1109/MSR.2007.9•

Finding Relevant Applications for Prototyping

[...]

Mark Grechanik¹, Kevin Michael Conroy¹, Katharina Probst¹•Institutions (1)

Accenture¹

20 May 2007

TL;DR: This work proposes a novel approach called Exemplar (EXEcutable exaMPLes ARchive) for finding highly relevant software projects from a large archive of executable applications that implement high-level concepts.

...read moreread less

Abstract: When gathering requirements for new software projects, it is often cost-effective to find similar applications that can be used as the basis for prototypes rather than building them from scratch. However, finding such sample applications can be difficult, often making prototyping time-consuming and expensive. We offer a novel approach called Exemplar (EXEcutable exaMPLes ARchive) for finding highly relevant software projects from a large archive of executable applications. Af- ter a programmer enters a query that contains high-level concepts (e.g., toolbar, download, smart card), Exemplar uses information retrieval and program analysis to retrieve applications that implement these concepts. We hypothe- size that Exemplar will be effective and efficient in helping programmers to quickly find highly relevant applications to support prototyping.

...read moreread less

Proceedings Article•10.1109/MSR.2007.32•

Using Software Distributions to Understand the Relationship among Free and Open Source Software Projects

[...]

Daniel M. German¹•Institutions (1)

University of Victoria¹

20 May 2007

TL;DR: It is demonstrated that some applications that are invisible to the final user (such as libraries) are widely used by end-user applications and can be used as a proxy to measure success of small, slowly evolving free and open source software.

...read moreread less

Abstract: Success in the open source software world has been measured in terms of metrics such as number of downloads, number of commits, number of lines of code, number of participants, etc. These metrics tend to discriminate towards applications that are small and tend to evolve slowly. A problem is, however, how to identify applications in these latter categories that are important. Software distributions specify the dependencies needed to build and to run a given software application. We use this information to create a dependency graph of the applications contained in such a distribution. We explore the characteristics of this graph, and use it to define some metrics to quantify the dependencies (and dependents) of a given software application. We demonstrate that some applications that are invisible to the final user (such as libraries) are widely used by end-user applications. This graph can be used as a proxy to measure success of small, slowly evolving free and open source software.

...read moreread less

Proceedings Article•10.1109/MSR.2007.3•

Comparing Approaches to Mining Source Code for Call-Usage Patterns

[...]

Huzefa Kagdi¹, Michael L. Collard², Jonathan I. Maletic¹•Institutions (2)

Kent State University¹, Ashland University²

20 May 2007

TL;DR: The trade-off between the additional ordering context given by sequential-pattern mining and the efficiency of itemset mining is investigated and results show that mining ordered patterns is worth the additional cost.

...read moreread less

Abstract: Two approaches for mining function-call usage patterns from source code are compared. The first approach, itemset mining, has recently been applied to this problem. The other approach, sequential-pattern mining, has not been previously applied to this problem. Here, a call-usage pattern is a composition of function calls that occur in a function definition. Both approaches look for frequently occurring patterns that represent standard usage of functions and identify possible errors. Itemset mining produces unordered patterns, i.e., sets of function calls, whereas, sequential-pattern mining produces partially ordered patterns, i.e., sequences of function calls. The trade-off between the additional ordering context given by sequential-pattern mining and the efficiency of itemset mining is investigated. The two approaches are applied to the Linux kernel v2.6.14 and results show that mining ordered patterns is worth the additional cost.

...read moreread less

Proceedings Article•10.1109/MSR.2007.17•

Local and Global Recency Weighting Approach to Bug Prediction

[...]

Hemant Joshi¹, Chuanlei Zhang¹, Srini Ramaswamy¹, Coskun Bayrak¹•Institutions (1)

University of Arkansas at Little Rock¹

20 May 2007

TL;DR: In this paper, the Eclipse project's recorded software bug history is used to predict occurrence of future bugs.

...read moreread less

Abstract: Finding and fixing software bugs is a challenging maintenance task, and a significant amount of effort is invested by software development companies on this issue. In this paper, we use the Eclipse project's recorded software bug history to predict occurrence of future bugs. The history contains information on when bugs have been reported and subsequently fixed.

...read moreread less

Proceedings Article•10.1109/MSR.2007.30•

Studying Versioning Information to Understand Inheritance Hierarchy Changes

[...]

Filip Van Rysselberghe¹, Serge Demeyer¹•Institutions (1)

University of Antwerp¹

20 May 2007

TL;DR: A study of the hierarchy changes stored in a versioning system to explore the answers to three research questions and formulate 7 hypotheses which should be investigated further to make conclusive interpretations on how hierarchy changes fit in the actual change process.

...read moreread less

Abstract: With the widespread adoption of object-oriented programming, changing the inheritance hierarchy became an inherent part of today's software maintenance activities. Unfortunately, little is known about the "state-of-the-practice " with respect to changing an application's inheritance hierarchy, and consequently we do not know how the change process can be improved. In this paper, we report on a study of the hierarchy changes stored in a versioning system to explore the answers to three research questions: (1) why are hierarchy changes made? (2) what kind of hierarchy changes are made? (3) what is the impact of these changes? Based on the results of this study, we formulate 7 hypotheses which should be investigated further to make conclusive interpretations on how hierarchy changes fit in the actual change process.

...read moreread less

Proceedings Article•10.1109/MSR.2007.15•

Impact of the Creation of the Mozilla Foundation in the Activity of Developers

[...]

Jesus M. Gonzalez-Barahona, Gregorio Robles, Israel Herraiz

20 May 2007

TL;DR: An analysis of the CVS repository of Mozilla is performed, using theCVSAnalY tool, finding little on activity, but dramatic changes in the the composition of the development team.

...read moreread less

Abstract: During 2003, the Mozilla project transitioned from company-promoted (sponsored by AOL) to community-promoted (sponsored by the Mozilla Foundation). What happened to the group of developers during this transition? There was any significant impact on its activity or composition? To answer these questions, we have performed an analysis of the CVS repository of Mozilla, using the CVSAnalY tool, finding little on activity, but dramatic changes in the the composition of the development team.

...read moreread less

Proceedings Article•10.1109/MSR.2007.24•

Predicting Defects and Changes with Import Relations

[...]

Adrian Schroter¹•Institutions (1)

Saarland University¹

20 May 2007

TL;DR: This paper presents a method to train models with import relations to decrease the number of defects by more efficient testing and to assess the effort needed in respect to thenumber of changes.

...read moreread less

Abstract: Lowering the number of defects and estimating the development time of a software project are two important goals of software engineering. To predict the number of defects and changes we train models with import relations. This enables us to decrease the number of defects by more efficient testing and to assess the effort needed in respect to the number of changes.

...read moreread less

Proceedings Article•10.1109/MSR.2007.16•

Lightweight Risk Mitigation for Software Development Projects Using Repository Mining

[...]

Stephen P. Masticola¹•Institutions (1)

Princeton University¹

20 May 2007

TL;DR: An approach of lightweight risk mitigation is proposed: mine risk data from configuration management and defect tracking systems, integrate this data with project-cost data in a flexible dashboard, and facilitate strategic refactoring with semi-custom transforms where necessary.

...read moreread less

Abstract: Many software projects fail to deliver their needed results on-time and on-budget. There are a variety of reasons why this may occur. For some of these reasons (notably deterioration of the codebase), corrective action is often difficult to cost-justify or to implement efficiently in practice. To address this, an approach of lightweight risk mitigation is proposed: mine risk data from configuration management and defect tracking systems, integrate this data with project-cost data in a flexible dashboard, and facilitate strategic refactoring with semi-custom transforms where necessary. This prescriptive information would simultaneously help the project manager to cost-justify repair efforts and lowers the cost of finding and fixing hot spots.

...read moreread less

Proceedings Article•10.1109/MSR.2007.35•

What Can OSS Mailing Lists Tell Us? A Preliminary Psychometric Text Analysis of the Apache Developer Mailing List

[...]

Peter C. Rigby¹, Ahmed E. Hassan¹•Institutions (1)

University of Victoria¹

20 May 2007

TL;DR: A psychometrically-based linguistic analysis tool, the LIWC, is used to examine the Apache httpd server developer mailing list and shows promise in understanding why developers join and leave a project.

...read moreread less

Abstract: Developer mailing lists are a rich source of information about Open Source Software (OSS) development. The unstructured nature of email makes extracting information difficult. We use a psychometrically-based linguistic analysis tool, the LIWC, to examine the Apache httpd server developer mailing list. We conduct three preliminary experiments to assess the appropriateness of this tool for information extraction from mailing lists. First, using LIWC dimensions that are correlated with the big five personality traits, we assess the personality of four top developers against a baseline for the entire mailing list. The two developers that were responsible for the major Apache releases had similar personalities. Their personalities were different from the baseline and the other developers. Second, the first and last 50 emails for two top developers who have left the project are examined. The analysis shows promise in understanding why developers join and leave a project. Third, we examine word usage on the mailing list for two major Apache releases. The differences may reflect the relative success of each release.

...read moreread less

Proceedings Article•10.1109/MSR.2007.18•

Mining a Change-Based Software Repository

[...]

Romain Robbes¹•Institutions (1)

University of Lugano¹

20 May 2007

TL;DR: This paper presents an alternative information repository which stores incremental changes to the system under study, retrieved from the IDE used to build the software, and uses this change-based model of system evolution to assess when refactorings happen, and compares the findings with refactoring detection approaches on classical versioning system repositories.

...read moreread less

Abstract: Although state-of-the-art software repositories based on versioning system information are useful to assess the evolution of a software system, the information they contain is limited in several ways. Versioning systems such as CVS or subversion store only snapshots of text files, leading to a loss of information: The exact sequence of changes between two versions is hard to recover. In this paper we present an alternative information repository which stores incremental changes to the system under study, retrieved from the IDE used to build the software. We then use this change-based model of system evolution to assess when refactorings happen in two case studies, and compare our findings with refactoring detection approaches on classical versioning system repositories.

...read moreread less

Proceedings Article•10.1109/MSR.2007.34•

Visual Data Mining in Software Archives to Detect How Developers Work Together

[...]

Peter Weissgerber¹, Mathias Pohl¹, Michael Burch¹•Institutions (1)

University of Trier¹

20 May 2007

TL;DR: Three visualization techniques that help to examine how programmers work together, e.g. if they work as a team or if they develop their part of the software separate from each other are described.

...read moreread less

Abstract: Analyzing the check-in information of open source software projects which use a version control system such as CVS or SUBVERSION can yield interesting and important insights into the programming behavior of developers. As in every major project tasks are assigned to many developers, the development must be coordinated between these programmers. This paper describes three visualization techniques that help to examine how programmers work together, e.g. if they work as a team or if they develop their part of the software separate from each other. Furthermore, phases of stagnation in the lifetime of a project can be uncovered and thus, possible problems are revealed. To demonstrate the usefulness of these visualization techniques we performed case studies on two open source projects. In these studies interesting patterns of developers' behavior, e.g. the specialization on a certain module can be observed. Moreover, modules that have been changed by many developers can be identified as well as such ones that have been altered by only one programmer,.

...read moreread less

Proceedings Article•10.1109/MSR.2007.25•

Predicting Eclipse Bug Lifetimes

[...]

Lucas D. Panjer¹•Institutions (1)

University of Victoria¹

20 May 2007

TL;DR: This research explores the viability of using data mining tools to predict the time to fix a bug given only the basic information known at the beginning of a bug's lifetime.

...read moreread less

Abstract: In non-trivial software development projects planning and allocation of resources is an important and difficult task. Estimation of work time to fix a bug is commonly used to support this process. This research explores the viability of using data mining tools to predict the time to fix a bug given only the basic information known at the beginning of a bug's lifetime. To address this question, a historical portion of the Eclipse Bugzilla database is used for modeling and predicting bug lifetimes. A bug history transformation process is described and several data mining models are built and tested. Interesting behaviours derived from the models are documented. The models can correctly predict up to 34.9% of the bugs into a discretized log scaled lifetime class.

...read moreread less

Proceedings Article•10.1109/MSR.2007.27•

Recommending Emergent Teams

[...]

Shawn Minto¹, Gail C. Murphy¹•Institutions (1)

University of British Columbia¹

20 May 2007

TL;DR: This paper introduces the emergent expertise locator (EEL) that uses emergent team information to propose experts to a developer within their development environment as the developer works and finds that EEL produces, on average, results with higher precision and higher recall than an existing heuristic for expertise recommendation.

...read moreread less

Abstract: To build successful complex software systems, developers must collaborate with each other to solve issues. To facilitate this collaboration, specialized tools, such as chat and screen sharing, are being integrated into development environments. Currently, these tools require a developer to maintain a list of other developers with whom they may wish to communicate and to determine who within this list has expertise for a specific situation. For large, dynamic projects, like several successful open-source projects, these requirements place an unreasonable burden on the developer. In this paper, we show how the structure of a team emerges from how developers change software artifacts. We introduce the emergent expertise locator (EEL) that uses emergent team information to propose experts to a developer within their development environment as the developer works. We found that EEL produces, on average, results with higher precision and higher recall than an existing heuristic for expertise recommendation.

...read moreread less

Proceedings Article•10.1109/MSR.2007.8•

Evaluating the Harmfulness of Cloning: A Change Based Experiment

[...]

Angela Lozano¹, Michel Wermelinger¹, Bashar Nuseibeh¹•Institutions (1)

Open University¹

20 May 2007

TL;DR: A prototype tool, CloneTracker, is developed in order to study the rate of change of applications containing clones and its preliminary application on a case study is illustrated.

...read moreread less

Abstract: Cloning is considered a harmful practice for software maintenance because it requires consistent changes of the entities that share a cloned fragment. However this claim has not been refuted or confirmed empirically. Therefore, we have developed a prototype tool, CloneTracker, in order to study the rate of change of applications containing clones. This paper describes CloneTracker and illustrates its preliminary application on a case study.

...read moreread less

Proceedings Article•10.1109/MSR.2007.10•

Forecasting the Number of Changes in Eclipse Using Time Series Analysis

[...]

Israel Herraiz, Jesus M. Gonzalez-Barahona, Gregorio Robles

20 May 2007

TL;DR: In order to predict the number of changes in the following months for the project Eclipse, a statistical (non-explanatory) model based on time series analysis is applied, using the CVSAnalY tool.

...read moreread less

Abstract: In order to predict the number of changes in the follow- ing months for the project Eclipse, we have applied a statis- tical (non-explanatory) model based on time series analy- sis. We have obtained the monthly number of changes in the CVS repository of Eclipse, using the CVSAnalY tool. The input to our model was the filtered series of the num- ber of changes per month, and the output was the number of changes per month for the next three months. Then we aggregated the results of the three months to obtain the to- tal number of changes in the given period in the challenge.

...read moreread less

Proceedings Article•10.1109/MSR.2007.31•

Towards a Theoretical Model for Software Growth

[...]

Israel Herraiz, Jesus M. Gonzalez-Barahona, Gregorio Robles

20 May 2007

TL;DR: A comprehensive study, based on the analysis of about 700,000 C source code files, calculating several size and complexity metrics for all of them, finds double Pareto statistical distributions for all metrics considered, and a high correlation between any two of them.

...read moreread less

Abstract: Software growth (and more broadly, software evolution) is usually considered in terms of size or complexity of source code. However in different studies, usually different metrics are used, which make it difficult to compare approaches and results. In addition, not all metrics are equally easy to cal- culate for a given source code, which leads to the ques- tion of which one is the easiest to calculate without losing too much information. To address both issues, in this pa- per present a comprehensive study, based on the analysis of about 700,000 C source code files, calculating several size and complexity metrics for all of them. For this sample, we have found double Pareto statistical distributions for all metrics considered, and a high correlation between any two of them. This would imply that any model addressing soft- ware growth should produce this Pareto distributions, and that analysis based on any of the considered metrics should show a similar pattern, provided the sample of files consid- ered is large enough.

...read moreread less