Top 31 papers presented at Mining Software Repositories in 2009

Showing papers presented at "Mining Software Repositories in 2009"

Proceedings Article•10.1145/2597073.2597074•

The promises and perils of mining GitHub

[...]

Eirini Kalliamvakou¹, Georgios Gousios², Kelly Blincoe¹, Leif Singer¹, Daniel M. German¹, Daniela Damian¹ - Show less +2 more•Institutions (2)

University of Victoria¹, Delft University of Technology²

16 May 2009

TL;DR: It is shown, for example, that the majority of the projects are personal and inactive; that GitHub is also being used for free storage and as a Web hosting service; and that almost 40% of all pull requests do not appear as merged, even though they were.

...read moreread less

Abstract: We are now witnessing the rapid growth of decentralized source code management (DSCM) systems, in which every developer has her own repository. DSCMs facilitate a style of collaboration in which work output can flow sideways (and privately) between collaborators, rather than always up and down (and publicly) via a central repository. Decentralization comes with both the promise of new data and the peril of its misinterpretation. We focus on git, a very popular DSCM used in high-profile projects. Decentralization, and other features of git, such as automatically recorded contributor attribution, lead to richer content histories, giving rise to new questions such as “How do contributions flow between developers to the official project repository?” However, there are pitfalls. Commits may be reordered, deleted, or edited as they move between repositories. The semantics of terms common to SCMs and DSCMs sometimes differ markedly, potentially creating confusion. For example, a commit is immediately visible to all developers in centralized SCMs, but not in DSCMs. Our goal is to help researchers interested in DSCMs avoid these and other perils when mining and analyzing git data.

...read moreread less

979 citations

Proceedings Article•10.1109/MSR.2009.5069475•

The promises and perils of mining git

[...]

Rigby, Barr, Hamilton, German, Devanbu - Show less +1 more

1 Jan 2009

309 citations

Proceedings Article•10.1109/MSR.2009.5069491•

Assigning bug reports using a vocabulary-based expertise model of developers

[...]

Dominique Matter¹, Adrian Kuhn¹, Oscar Nierstrasz¹•Institutions (1)

University of Bern¹

16 May 2009

TL;DR: This paper presents an approach to automatically suggest developers who have the appropriate expertise for handling a bug report, model developer expertise using the vocabulary found in their source code contributions and compare this vocabulary to the vocabulary of bug reports.

...read moreread less

Abstract: For popular software systems, the number of daily submitted bug reports is high Triaging these incoming reports is a time consuming task Part of the bug triage is the assignment of a report to a developer with the appropriate expertise In this paper, we present an approach to automatically suggest developers who have the appropriate expertise for handling a bug report We model developer expertise using the vocabulary found in their source code contributions and compare this vocabulary to the vocabulary of bug reports We evaluate our approach by comparing the suggested experts to the persons who eventually worked on the bug Using eight years of Eclipse development as a case study, we achieve 336% top-1 precision and 710% top-10 recall

...read moreread less

264 citations

Proceedings Article•10.1109/MSR.2009.5069476•

Amassing and indexing a large sample of version control systems: Towards the census of public source code history

[...]

Audris Mockus¹•Institutions (1)

Avaya¹

16 May 2009

TL;DR: This work describes methods developed over the last six years to gather, index, and update an approximation of such a universal repository for publicly accessible version control systems and for the source code inside a large corporation.

...read moreread less

Abstract: The source code and its history represent the output and process of software development activities and are an invaluable resource for study and improvement of software development practice. While individual projects and groups of projects have been extensively analyzed, some fundamental questions, such as the spread of innovation or genealogy of the source code, can be answered only by considering the entire universe of publicly available source code and its history. We describe methods we developed over the last six years to gather, index, and update an approximation of such a universal repository for publicly accessible version control systems and for the source code inside a large corporation. While challenging, the task is achievable with limited resources. The bottlenecks in network bandwidth, processing, and disk access can be dealt with using inherent parallelism of the tasks and suitable tradeoffs between the amount of storage and computations, but a completely automated discovery of public version control systems may require enticing participation of the sampled projects. Such universal repository would allow studies of global properties and origins of the source code that are not possible through other means.

...read moreread less

127 citations

Proceedings Article•10.1109/MSR.2009.5069488•

On the use of Internet Relay Chat (IRC) meetings by developers of the GNOME GTK+ project

[...]

Emad Shihab¹, Zhen Ming Jiang¹, Ahmed E. Hassan¹•Institutions (1)

Queen's University¹

16 May 2009

TL;DR: The findings show that IRC meetings are gaining popularity among open source developers and maintainers: the IRC meeting discussions are increasing in volume, have increasing attendance levels, and the participants actively contribute to the meetings.

...read moreread less

Abstract: Developers of open source projects are distributed across the world. They rely on email, mailing lists, instant messaging, IRC channels and more recently IRC meetings to communicate. Most of the studies thus far focus on the use of mailing lists by OSS developers, however, an increasing number of open source projects are using IRC meetings to hold developer meetings. In this paper, we mine the #gtk-devel IRC meeting channel and study the usage of the IRC meetings held by the GNOME GTK+ core developers and maintainers. We look at three different dimensions: the discussion volume of the meetings, the number of participants attending the meetings and the activity of these participants. Our findings show that IRC meetings are gaining popularity among open source developers and maintainers: the IRC meeting discussions are increasing in volume, have increasing attendance levels, and the participants actively contribute to the meetings. To the best of our knowledge, this is the first study on the use of developer IRC meetings by OSS developers.

...read moreread less

57 citations

Proceedings Article•10.1109/MSR.2009.5069501•

SourcererDB: An aggregated repository of statically analyzed and cross-linked open source Java projects

[...]

Joel Ossher¹, Sushil Bajracharya¹, Erik Linstead¹, Pierre Baldi¹, Cristina V. Lopes¹ - Show less +1 more•Institutions (1)

University of California, Irvine¹

16 May 2009

TL;DR: The goal in building SourcererDB is to provide a rich dataset of source code to facilitate the sharing of extracted data and to encourage reuse and repeatability of experiments.

...read moreread less

Abstract: The open source movement has made vast quantities of source code available online for free, providing an extremely large dataset for empirical study and potential resuse. A major difficulty in exploiting this potential fully is that the data are currently scattered between competing source code repositories, none of which are structured for empirical analysis and cross-project comparison. As a result, software researchers and developers are left to compile their own datasets, resulting in duplicated effort and limited results. To address this challenge, we built SourcererDB, an aggregated repository of statically analyzed and cross-linked open source Java projects. SourcererDB contains local snapshots of 2,852 Java projects taken from Sourceforge, Apache and Java.net. These projects are statically analyzed to extract rich structural information, which is then stored in a relational database. References to entities in the 16,058 external jars are resolved and grouped, allowing for cross-project usage information to be accessed easily. This paper describes: (a) the mechanism for resolving and grouping these cross-project references, (b) the structure of and the metamodel for the SourcererDB repository, and (d) end-user dataset access mechanisms. Our goal in building SourcererDB is to provide a rich dataset of source code to facilitate the sharing of extracted data and to encourage reuse and repeatability of experiments.

...read moreread less

55 citations

Proceedings Article•10.1109/MSR.2009.5069477•

MapReduce as a general framework to support research in Mining Software Repositories (MSR)

[...]

Weiyi Shang¹, Zhen Ming Jiang¹, Bram Adams¹, Ahmed E. Hassan¹•Institutions (1)

Queen's University¹

16 May 2009

TL;DR: This paper migrates J-REX, an optimized evolutionary code extractor, to run on Hadoop, an open source implementation of MapReduce, and highlights the benefits and challenges of the Map Reduce framework in the MSR community.

...read moreread less

Abstract: Researchers continue to demonstrate the benefits of Mining Software Repositories (MSR) for supporting software development and research activities. However, as the mining process is time and resource intensive, they often create their own distributed platforms and use various optimizations to speed up and scale up their analysis. These platforms are project-specific, hard to reuse, and offer minimal debugging and deployment support. In this paper, we propose the use of MapReduce, a distributed computing platform, to support research in MSR. As a proof-of-concept, we migrate J-REX, an optimized evolutionary code extractor, to run on Hadoop, an open source implementation of MapReduce. Through a case study on the source control repositories of the Eclipse, BIRT and Datatools projects, we demonstrate that the migration effort to MapReduce is minimal and that the benefits are significant, as running time of the migrated J-REX is only 30% to 50% of the original J-REX's. This paper documents our experience with the migration, and highlights the benefits and challenges of the MapReduce framework in the MSR community.

...read moreread less

53 citations

Proceedings Article•10.1109/MSR.2009.5069493•

Using association rules to study the co-evolution of production & test code

[...]

Zeeger Lubsen, Andy Zaidman¹, Martin Pinzger¹•Institutions (1)

Delft University of Technology¹

16 May 2009

TL;DR: This paper shows that the association rule mining approach allows one to assess the co-evolution of product and test code in a software project and to uncover the distribution of programmer effort over pure coding, pure testing, or a more test-driven-like practice.

...read moreread less

Abstract: Unit tests are generally acknowledged as an important aid to produce high quality code, as they provide quick feedback to developers on the correctness of their code. In order to achieve high quality, well-maintained tests are needed. Ideally, tests co-evolve with the production code to test changes as soon as possible. In this paper, we explore an approach based on association rule mining to determine whether production and test code co-evolve synchronously. Through two case studies, one with an open source and another one with an industrial software system, we show that our association rule mining approach allows one to assess the co-evolution of product and test code in a software project and, moreover, to uncover the distribution of programmer effort over pure coding, pure testing, or a more test-driven-like practice.

...read moreread less

48 citations

Proceedings Article•10.1109/MSR.2009.5069481•

Does calling structure information improve the accuracy of fault prediction

[...]

Yonghee Shin¹, Robert M. Bell², Thomas J. Ostrand², Elaine J. Weyuker²•Institutions (2)

North Carolina State University¹, AT&T Labs²

16 May 2009

TL;DR: The addition of calling structure information to a model based solely on non-calling structure code attributes provided noticeable improvement in prediction accuracy, but only marginally improved the best model based on history and non- calling structure code Attributes.

...read moreread less

Abstract: Previous studies have shown that software code attributes, such as lines of source code, and history information, such as the number of code changes and the number of faults in prior releases of software, are useful for predicting where faults will occur. In this study of an industrial software system, we investigate the effectiveness of adding information about calling structure to fault prediction models. The addition of calling structure information to a model based solely on non-calling structure code attributes provided noticeable improvement in prediction accuracy, but only marginally improved the best model based on history and non-calling structure code attributes. The best model based on history and non-calling structure code attributes outperformed the best model based on calling and non-calling structure code attributes.

...read moreread less

41 citations

Proceedings Article•10.1109/MSR.2009.5069486•

Mining the coherence of GNOME bug reports with statistical topic models

[...]

Erik Linstead¹, Pierre Baldi¹•Institutions (1)

University of California, Irvine¹

16 May 2009

TL;DR: This work adapts Latent Dirichlet Allocation to the problem of mining bug reports in order to define a new information-theoretic measure of coherence and applies this technique to a snapshot of the GNOME Bugzilla database.

...read moreread less

Abstract: We adapt Latent Dirichlet Allocation to the problem of mining bug reports in order to define a new information-theoretic measure of coherence. We then apply our technique to a snapshot of the GNOME Bugzilla database consisting of 431,863 bug reports for multiple software projects. In addition to providing an unsupervised means for modeling report content, our results indicate substantial promise in applying statistical text mining algorithms for estimating bug report quality. Complete results are available from our supplementary materials website at http://sourcerer.ics.uci.edu/msr2009/gnome_coherence.html.

...read moreread less

39 citations

Proceedings Article•10.1109/MSR.2009.5069478•

A platform for software engineering research

[...]

Georgios Gousios¹, Diomidis Spinellis¹•Institutions (1)

Athens University of Economics and Business¹

16 May 2009

TL;DR: The Alitheia Core platform is presented in detail and its usefulness in mining software repositories is demonstrated by guiding the reader through the steps required to execute a simple experiment.

...read moreread less

Abstract: Research in the fields of software quality, maintainability and evolution requires the analysis of large quantities of data, which often originate from open source software projects. Collecting and preprocessing data, calculating metrics, and synthesizing composite results from a large corpus of project artifacts is a tedious and error prone task lacking direct scientific value. The Alitheia Core tool is an extensible platform for software quality analysis that is designed specifically to facilitate software engineering research on large and diverse data sources, by integrating data collection and preprocessing phases with an array of analysis services, and presenting the researcher with an easy to use extension mechanism. Alitheia Core aims to be the basis of an ecosystem of shared tools and research data that will enable researchers to focus on their research questions at hand, rather than spend time on re-implementing analysis tools. In this paper, we present the Alitheia Core platform in detail and demonstrate its usefulness in mining software repositories by guiding the reader through the steps required to execute a simple experiment.

...read moreread less

Proceedings Article•10.1109/MSR.2009.5069499•

Automatic labeling of software components and their evolution using log-likelihood ratio of word frequencies in source code

[...]

Adrian Kuhn¹•Institutions (1)

University of Bern¹

16 May 2009

TL;DR: A lexical approach that uses the log-likelihood ratios of word frequencies to automatically provide labels for software components and applies the approach to detect trends in the evolution of a software system.

...read moreread less

Abstract: As more and more open-source software components become available on the internet we need automatic ways to label and compare them. For example, a developer who searches for reusable software must be able to quickly gain an understanding of retrieved components. This understanding cannot be gained at the level of source code due to the semantic gap between source code and the domain model. In this paper we present a lexical approach that uses the log-likelihood ratios of word frequencies to automatically provide labels for software components. We present a prototype implementation of our labeling/comparison algorithm and provide examples of its application. In particular, we apply the approach to detect trends in the evolution of a software system.

...read moreread less

Proceedings Article•10.1109/MSR.2009.5069487•

Visualizing Gnome with the Small Project Observatory

[...]

Mircea Lungu¹, Jacopo Malnati¹, Michele Lanza¹•Institutions (1)

University of Lugano¹

16 May 2009

TL;DR: This work analyzes the Gnome family of systems with the Small Project Observatory, the authors' online ecosystem visualization platform, and introduces the model of SPO, looking at how the contributors are distributed between writing source code and doing other activities such as internationalization.

...read moreread less

Abstract: We analyzed the Gnome family of systems with the Small Project Observatory, our online ecosystem visualization platform. We begin by briefly introducing the model of SPO. We then observe and discuss several phases in the activity of the Gnome ecosystem. We follow and look at how the contributors are distributed between writing source code and doing other activities such as internationalization. We end with a visual overview of the activity of more than 900 contributors in the 10 years of existence of Gnome.

...read moreread less

Proceedings Article•10.1109/MSR.2009.5069495•

Mining the Jazz repository: Challenges and opportunities

[...]

Kim Herzig¹, Andreas Zeller¹•Institutions (1)

Saarland University¹

16 May 2009

TL;DR: The initial experiences from mining the Jazz repository are shared and a short overview of the retrieved data sets are given and possible problems of the Jazz platform and the platform itself are discussed.

...read moreread less

Abstract: By integrating various development and collaboration tools into one single platform, the Jazz environment offers several opportunities for software repository miners. In particular, Jazz offers full traceability from the initial requirements via work packages and work assignments to the final changes and tests; all these features can be easily accessed and leveraged for better prediction and recommendation systems. In this paper, we share our initial experiences from mining the Jazz repository. We also give a short overview of the retrieved data sets and discuss possible problems of the Jazz repository and the platform itself.

...read moreread less

Proceedings Article•10.1109/MSR.2009.5069498•

On mining data across software repositories

[...]

Prasanth Anbalagan¹, Mladen A. Vouk¹•Institutions (1)

North Carolina State University¹

16 May 2009

TL;DR: A framework that uses web scraping to automatically mine repositories and link information across repositories and the percentage of security bugs identified using this tool is consistent with that reported by other researchers.

...read moreread less

Abstract: Software repositories provide abundance of valuable information about open source projects. With the increase in the size of the data maintained by the repositories, automated extraction of such data from individual repositories, as well as of linked information across repositories, has become a necessity. In this paper we describe a framework that uses web scraping to automatically mine repositories and link information across repositories. We discuss two implementations of the framework. In the first implementation, we automatically identify and collect security problem reports from project repositories that deploy the Bugzilla bug tracker using related vulnerability information from the National Vulnerability Database. In the second, we collect security problem reports for projects that deploy the Launchpad bug tracker along with related vulnerability information from the National Vulnerability Database. We have evaluated our tool on various releases of Fedora, Ubuntu, Suse, RedHat, and Firefox projects. The percentage of security bugs identified using our tool is consistent with that reported by other researchers.

...read moreread less

Proceedings Article•10.1109/MSR.2009.5069485•

Evaluating process quality in GNOME based on change request data

[...]

Holger Schackmann¹, Horst Lichter¹•Institutions (1)

RWTH Aachen University¹

16 May 2009

TL;DR: A quality model for the analysis of quality characteristics that is based on evaluating metrics on the Bugzilla database is presented, and a comparative evaluation for 25 of the largest products within GNOME is illustrated.

...read moreread less

Abstract: The lifecycle of defects reports and enhancement requests collected in the Bugzilla database of the GNOME project provides valuable information on the evolution of the change request process and for the assessment of process quality in the GNOME sub projects. We present a quality model for the analysis of quality characteristics that is based on evaluating metrics on the Bugzilla database, and illustrate it with a comparative evaluation for 25 of the largest products within GNOME.

...read moreread less

Proceedings Article•10.1109/MSR.2009.5069494•

On what basis to recommend: Changesets or interactions?

[...]

Sarah Rastkar¹, Gail C. Murphy¹•Institutions (1)

University of British Columbia¹

16 May 2009

TL;DR: There is no direct relationship between the bug reports found similar with the different methods, suggesting that each comparison methods captures a different aspect of the problem.

...read moreread less

Abstract: Different flavours of recommendation systems have been proposed to help software developers perform software evolution tasks. A number of these recommendation systems are based on changesets. When changeset information is used, recommendations are based on only the end result of the activity undertaken to complete a task. In this paper, we report on an investigation that compared how recommendations based on changesets compare to recommendations based on interactions collected as a programmer performed the task that resulted in a changeset. To provide a common basis for the comparison, our investigation considered how bug reports considered similar based on changeset information compare to bug reports considered similar based on interaction information. We found that there is no direct relationship between the bug reports found similar with the different methods, suggesting that each comparison methods captures a different aspect of the problem.

...read moreread less

Book•10.1007/978-3-642-04590-5•

Metadata and Semantic Research

[...]

Fabio Sartori¹, Miguel-Angel Sicilia², Nikos Manouselis³•Institutions (3)

University of Milan¹, University of Alcalá², Greek Research and Technology Network³

1 Jan 2009

TL;DR: This volume constitutes the selected paqpers of the third international conference on Metadata and Semantic Research, MTSR 2009, held in Milan, Italy, in September/October 2009, and mirrors the structure of the Congress.

...read moreread less

Proceedings Article•10.1109/MSR.2009.5069500•

Learning from defect removals

[...]

Nathaniel Ayewah¹, William Pugh¹•Institutions (1)

University of Maryland, College Park¹

16 May 2009

TL;DR: Not all changes linked to bug tracking systems are fixing bugs; some are enhancing the code; and not all fixes are applied at the point in the code where the bug was originally introduced.

...read moreread less

Abstract: Recent research has tried to identify changes in source code repositories that fix bugs by linking these changes to reports in issue tracking systems. These changes have been traced back to the point in time when they were previously modified as a way of identifying bug introducing changes. But we observe that not all changes linked to bug tracking systems are fixing bugs; some are enhancing the code. Furthermore, not all fixes are applied at the point in the code where the bug was originally introduced. We flesh out these observations with a manual review of several software projects, and use this opportunity to see how many defects are in the scope of static analysis tools.

...read moreread less

Proceedings Article•10.1109/MSR.2009.5069502•

On the transfer of evolutionary couplings to industry

[...]

Piërre van de Laar

16 May 2009

TL;DR: A case study at Philips Healthcare MRI focusing on evolutionary coupling, i.e., a technique to infer relationships among modules by analyzing their history of changes in the source code archive, fails to transfer CouplingViewer, a tool implementing the current state-of-art in evolutionary couplings to industry.

...read moreread less

Abstract: In this paper, we describe a case study at Philips Healthcare MRI focusing on evolutionary couplings, i.e., a technique to infer relationships among modules by analyzing their history of changes in the source code archive. In this case study, we failed to transfer CouplingViewer, a tool implementing the current state-of-art in evolutionary couplings, to industry. According to the industrial experts an important industrial requirement was not met: the signal-to-noise ratio was too low.

...read moreread less

Proceedings Article•10.1109/MSR.2009.5069474•

A brief history of software — from Bell Labs to Microsoft Research

[...]

Thomas Ball¹•Institutions (1)

Microsoft¹

16 May 2009

TL;DR: In the mid 1990s, I was (tangentially) part of an effort in Bell Labs called the “Code Decay” project, which awakened me to the power of combining statistical expertise with software engineering expertise to address pressing problems of software production in a statistically valid manner.

...read moreread less

Abstract: In the mid 1990s, I was (tangentially) part of an effort in Bell Labs called the “Code Decay” project. The hypothesis of this project was that over time code becomes fragile (more difficult to change without introducing problems), and that this process of decay could be empirically validated. This effort awakened me to the power of combining statistical expertise with software engineering expertise to address pressing problems of software production in a statistically valid manner. I will revisit some of the work we did in the Code Decay project at Bell Labs and then turn to what has been happening in this area in Microsoft in the last five years. In particular, I will trace how we have progressed from studying the data produced by product teams to validate hypotheses, to being actively involved with the product groups in creating and evaluating new tools and techniques for empirically-based software production.

...read moreread less

Proceedings Article•10.1109/MSR.2009.5069479•

Evaluating the relation between coding standard violations and faultswithin and across software versions

[...]

C. Boogerd¹, Leon Moonen²•Institutions (2)

Delft University of Technology¹, Simula Research Laboratory²

16 May 2009

TL;DR: It is found that 10 rules in the standard are significant predictors of fault location, and three different aspects of the relation between violations and faults on a larger case study are investigated.

...read moreread less

Abstract: In spite of the widespread use of coding standards and tools enforcing their rules, there is little empirical evidence supporting the intuition that they prevent the introduction of faults in software. In previous work, we performed a pilot study to assess the relation between rule violations and actual faults, using the MISRA C 2004 standard on an industrial case. In this paper, we investigate three different aspects of the relation between violations and faults on a larger case study, and compare the results across the two projects. We find that 10 rules in the standard are significant predictors of fault location.

...read moreread less

Proceedings Article•10.1109/MSR.2009.5069480•

Tracking concept drift of software projects using defect prediction quality

[...]

Jayalath Ekanayake¹, Jonas Tappolet¹, Harald C. Gall¹, Abraham Bernstein¹•Institutions (1)

University of Zurich¹

16 May 2009

TL;DR: The experiments uncover that software systems are subject to considerable concept drifts in their evolution history, and suggest that project managers using defect prediction models for decision making should be aware of the actual phase of stability or instability due to a potential concept drift.

...read moreread less

Abstract: Defect prediction is an important task in the mining of software repositories, but the quality of predictions varies strongly within and across software projects. In this paper we investigate the reasons why the prediction quality is so fluctuating due to the altering nature of the bug (or defect) fixing process. Therefore, we adopt the notion of a concept drift, which denotes that the defect prediction model has become unsuitable as set of influencing features has changed - usually due to a change in the underlying bug generation process (i.e., the concept). We explore four open source projects (Eclipse, OpenOffice, Netbeans and Mozilla) and construct file-level and project-level features for each of them from their respective CVS and Bugzilla repositories. We then use this data to build defect prediction models and visualize the prediction quality along the time axis. These visualizations allow us to identify concept drifts and - as a consequence - phases of stability and instability expressed in the level of defect prediction quality. Further, we identify those project features, which are influencing the defect prediction quality using both a tree induction-algorithm and a linear regression model. Our experiments uncover that software systems are subject to considerable concept drifts in their evolution history. Specifically, we observe that the change in number of authors editing a file and the number of defects fixed by them contribute to a project's concept drift and therefore influence the defect prediction quality. Our findings suggest that project managers using defect prediction models for decision making should be aware of the actual phase of stability or instability due to a potential concept drift.

...read moreread less

Proceedings Article•10.1109/MSR.2009.5069489•

Mining search topics from a code search engine usage log

[...]

Sushil Bajracharya¹, Cristina V. Lopes¹•Institutions (1)

University of California, Irvine¹

16 May 2009

TL;DR: A general categorization of these topics that provides insights on the different ways code search engine users express their queries is presented, supporting the conclusion that existing code search engines provide only a subset of the various information needs of the users when compared to the categories of queries they look at.

...read moreread less

Abstract: We present a topic modeling analysis of a year long usage log of Koders, one of the major commercial code search engines. This analysis contributes to the understanding of what users of code search engines are looking for. Observations on the prevalence of these topics among the users, and on how search and download activities vary across topics, leads to the conclusion that users who find code search engines usable are those who already know to a high level of specificity what to look for. This paper presents a general categorization of these topics that provides insights on the different ways code search engine users express their queries. The findings support the conclusion that existing code search engines provide only a subset of the various information needs of the users when compared to the categories of queries they look at.

...read moreread less

Proceedings Article•10.1109/MSR.2009.5069492•

Mining the history of synchronous changes to refine code ownership

[...]

Lile Hattori¹, Michele Lanza¹•Institutions (1)

University of Lugano¹

16 May 2009

TL;DR: This paper illustrates how the information mined by the Syde tool can help to provide a refined notion of code ownership and breaks new ground in terms of how such information can assist developers.

...read moreread less

Abstract: When software repositories are mined, two distinct sources of information are usually explored: the history log and snapshots of the system. Results of analyses derived from these two sources are biased by the frequency with which developers commit their changes. We argue that the usage of mainstream SCM systems influences the way that developers work. For example, since it is tedious to resolve conflicts due to parallel commits, developers tend to minimize conflicts by not contemporarily modifying the same file. This however defeats one of the purposes of such systems. We mine repositories created by our Syde tool, which records every change by every developer in multi-developer projects. This new source of information can augment the accuracy of analyses and breaks new ground in terms of how such information can assist developers. In this paper we illustrate how the information we mine can help to provide a refined notion of code ownership. As a case study, we analyze the developers' activities of the development of a commercial system.

...read moreread less

Proceedings Article•10.1109/MSR.2009.5069482•

Mining source code to automatically split identifiers for software analysis

[...]

Eric Enslen¹, Emily Hill¹, Lori Pollock¹, K. Vijay-Shanker¹•Institutions (1)

University UCINF¹

16 May 2009

TL;DR: An algorithm to automatically split identifiers into sequences of words by mining word frequencies in source code and using a scoring technique to automatically select the most appropriate partitioning for an identifier is presented.

...read moreread less

Abstract: Automated software engineering tools (e.g., program search, concern location, code reuse, quality assessment, etc.) increasingly rely on natural language information from comments and identifiers in code. The first step in analyzing words from identifiers requires splitting identifiers into their constituent words. Unlike natural languages, where space and punctuation are used to delineate words, identifiers cannot contain spaces. One common way to split identifiers is to follow programming language naming conventions. For example, Java programmers often use camel case, where words are delineated by uppercase letters or non-alphabetic characters. However, programmers also create identifiers by concatenating sequences of words together with no discernible delineation, which poses challenges to automatic identifier splitting. In this paper, we present an algorithm to automatically split identifiers into sequences of words by mining word frequencies in source code. With these word frequencies, our identifier splitter uses a scoring technique to automatically select the most appropriate partitioning for an identifier. In an evaluation of over 8000 identifiers from open source Java programs, our Samurai approach outperforms the existing state of the art techniques.

...read moreread less

Proceedings Article•10.1109/MSR.2009.5069497•

Evolution of the core team of developers in libre software projects

[...]

Gregorio Robles¹, Jesus M. Gonzalez-Barahona¹, Israel Herraiz¹•Institutions (1)

King Juan Carlos University¹

16 May 2009

TL;DR: This position paper proposes a quantitative methodology to study the evolution of core teams by analyzing information from source code management repositories, and identifies the most active developers in different periods of development.

...read moreread less

Abstract: In many libre (free, open source) software projects, most of the development is performed by a relatively small number of persons, the “core team”. The stability and permanence of this group of most active developers is of great importance for the evolution and sustainability of the project. In this position paper we propose a quantitative methodology to study the evolution of core teams by analyzing information from source code management repositories. The most active developers in different periods are identified, and their activity is calculated over time, looking for core team evolution patterns.

...read moreread less

Proceedings Article•10.1109/MSR.2009.5069496•

Using Latent Dirichlet Allocation for automatic categorization of software

[...]

Kai Tian¹, Meghan Revelle¹, Denys Poshyvanyk¹•Institutions (1)

College of William & Mary¹

16 May 2009

TL;DR: The results indicate that LACT can be used effectively for the automatic categorization of software systems regardless of the underlying programming language or paradigm and can identify several new categories that are based on libraries, architectures, or programming languages.

...read moreread less

Abstract: In this paper, we propose a technique called LACT for automatically categorizing software systems in open-source repositories. LACT is based on Latent Dirichlet Allocation, an information retrieval method which is used to index and analyze source code documents as mixtures of probabilistic topics. For an initial evaluation, we performed two studies. In the first study, LACT was compared against an existing tool, MUDABlue, for classifying 41 software systems written in C into problem domain categories. The results indicate that LACT can automatically produce meaningful category names and yield classification results comparable to MUDABlue. In the second study, we applied LACT to 43 software systems written in different programming languages such as C/C++, Java, C#, PHP, and Perl. The results indicate that LACT can be used effectively for the automatic categorization of software systems regardless of the underlying programming language or paradigm. Moreover, both studies indicate that LACT can identify several new categories that are based on libraries, architectures, or programming languages, which is a promising improvement as compared to manual categorization and existing techniques.

...read moreread less

Proceedings Article•10.1109/MSR.2009.5069484•

Author entropy vs. file size in the gnome suite of applications

[...]

Jason R. Casebolt¹, Jonathan L. Krein¹, Alexander C. MacLean¹, Charles D. Knutson¹, Daniel P. Delorey² - Show less +1 more•Institutions (2)

Brigham Young University¹, Google²

16 May 2009

TL;DR: The results suggest that when two authors contribute to a file, large files are more likely to have a dominant author than smaller files.

...read moreread less

Abstract: We present the results of a study in which author entropy was used to characterize author contributions per file. Our analysis reveals three patterns: banding in the data, uneven distribution of data across bands, and file size dependent distributions within bands. Our results suggest that when two authors contribute to a file, large files are more likely to have a dominant author than smaller files.

...read moreread less

Proceedings Article•10.1109/MSR.2009.5069483•

Code siblings: Technical and legal implications of copying code between applications

[...]

Daniel M. German¹, Massimiliano Di Penta², Yann-Gaël Guéhéneuc², Giuliano Antoniol³•Institutions (3)

University of Victoria¹, University of Sannio², École Polytechnique de Montréal³

16 May 2009

TL;DR: This paper uses clone detection, license mining and classification, and change history techniques to understand how code siblings—under different licenses—flow in one direction or the other between Linux and two BSD Unixes, FreeBSD and OpenBSD.

...read moreread less

Abstract: Source code cloning does not happen within a single system only. It can also occur between one system and another. We use the term code sibling to refer to a code clone that evolves in a different system than the code from which it originates. Code siblings can only occur when the source code copyright owner allows it and when the conditions imposed by such license are not incompatible with the license of the destination system. In some situations copying of source code fragments are allowed—legally—in one direction, but not in the other. In this paper, we use clone detection, license mining and classification, and change history techniques to understand how code siblings—under different licenses—flow in one direction or the other between Linux and two BSD Unixes, FreeBSD and OpenBSD. Our results show that, in most cases, this migration appears to happen according to the terms of the license of the original code being copied, favoring always copying from less restrictive licenses towards more restrictive ones. We also discovered that sometimes code is inserted to the kernels from an outside source.

...read moreread less