Source Code Author Identification Based on N-gram Author Profiles
Georgia Frantzeskou,Efstathios Stamatatos,Stefanos Gritzalis,Sokratis K. Katsikas +3 more
- 07 Jun 2006
- pp 508-515
TL;DR: The SCAP approach is language independent and that n-gram author profiles are better able to capture the idiosyncrasies of the source code authors, and is able to deal surprisingly well with cases where only a limited amount of very short programs per programmer is available for training.
read more
Abstract: Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. This is usually. based on the analysis of other program samples of undisputed authorship by the same programmer. There are several cases where the application of such a method could be of a major benefit, such as authorship disputes, proof of authorship in court, tracing the source of code left in the system after a cyber attack, etc. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles in order to represent a source code author’s style. Experiments on data sets of different programming language (Java or C++) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach. A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idiosyncrasies of the source code authors. Moreover the SCAP approach is able to deal surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of comments in the source code, a condition usually met in cyber-crime cases.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Code Authorship Attribution: Methods and Challenges
TL;DR: This article presents the first comprehensive review of research on code authorship attribution, and summarizes various methods of authorship attributions, and highlights challenges in the field.
83
DroidKin: Lightweight Detection of Android Apps Similarity
Hugo Gonzalez,Natalia Stakhanova,Ali A. Ghorbani +2 more
- 24 Sep 2014
TL;DR: The appearance of the Android platform and its popularity has resulted in a sharp rise in the number of reported vulnerabilities and consequently in theNumber of mobile threats, and malware authors commonly plagiarize Android applications boosting the amount of malware on the markets and consequently the infection rate.
Comparing techniques for authorship attribution of source code
TL;DR: All previous techniques to source code authorship attribution are summarized, feature sets that are motivated by the literature are implemented, and information retrieval ranking methods or machine classifiers for each approach are applied.
68
Application of Information Retrieval Techniques for Source Code Authorship Attribution
Steven Burrows,Alexandra L. Uitdenbogerd,Andrew Turpin +2 more
- 16 Mar 2009
TL;DR: This paper explores novel methods for converting C code into documents suitable for retrieval systems, and investigates several possible program derivations, partition attribution results by original program length to measure effectiveness of modest and lengthy programs separately.
61
Source code authorship analysis for supporting the cybercrime investigation process
Georgia Frantzeskou,Stefanos Gritzalis,Stephen G. MacDonell +2 more
- 01 Jan 2004
TL;DR: In this paper, the authors present a set of tools and techniques used to achieve the goal of authorship identification, a review of the research efforts in the area and a new taxonomy on source code authorship analysis.
References
Refereed paper: Authorship analysis: identifying the author of a program
Ivan Krsul,Eugene H. Spafford +1 more
TL;DR: The goal is to show that it is possible to identify the author of a program by examining programming style characteristics, and to find a set of characteristics that remain constant for a significant portion of the programs that this programmer might produce.
196
Extraction of Java program fingerprints for software authorship identification
Haibiao Ding,M. H. Samadzadeh +1 more
TL;DR: This study investigated the extraction of a set of software metrics of a given Java source code--by a program written in Visual C++ that could be used as a fingerprint to identify the author of the Java code.
85
Software forensics applied to the task of discriminating between program authors
Stephen G. MacDonell,Andrew R. Gray +1 more
- 01 Jan 2001
TL;DR: All of the examined modeling techniques have prediction accuracy rates over 80%, supporting the claim that it is feasible to use such techniques for the task of discriminating program authors based on source-code measurements in a majority of cases.
16
Supporting the Cybercrime Investigation Process: Effective Discrimination of Source Code Authors Based on Byte-Level Information
Georgia Frantzeskou,Efstathios Stamatatos,Stefanos Gritzalis +2 more
- 03 Oct 2005
TL;DR: This paper proposes a simplified profile and a new similarity measure which is less complicated than the algorithm followed in text authorship attribution and it seems more suitable for source code identification since is better able to deal with very small training sets.
15
N-gram-based author profiles for authorship attribution
Vlado Ke,Fuchun Peng,Nick Cercone,Calvin Thomas +3 more
- 01 Jan 2003
TL;DR: This work presents a novel method for computer-assisted authorship attribution based on characterlevel n-gram author proles, which is motivated by an almost-forgotten, pioneering method in 1976.