Education > Doctor of Philosophy

Summary

I commenced my PhD at RMIT city campus in February 2007, submitted the thesis in November 2010, and had the thesis passed subject to minor amendments in February 2011. The final version of my thesis was submitted in April 2011. I graduated on 14 December 2011.

Highlights include chairing the 2007 CS&IT Research Students Conference (CRSC) and four awards for various conference papers and presentations. Publications can be found on my publications page. More information about the current Doctor of Philosophy program program can be found here.

SEG BBQ 2007
Above: Search Engines Group barbecue 2007: Milad, Sarvnaz, Nik, Falk, Steven, Ying, Jelita and Ranjan.

Achievements & Activities

  • Best Student Paper Award (Computer Software & Applications Conference, IEEE, 2009).
  • Best Engineering Oral Presentation Honourable Mention (College of Science, Engineering and Health Higher Degree by Research Student Conference, RMIT University, 2009).
  • Best Paper Award Second Place (Computer Science and Information Technology Research Student Conference, RMIT University, 2008).
  • Best First Year Paper Award (Computer Science and Information Technology Research Student Conference, RMIT University, 2007).
  • Australian Postgraduate Award scholarship for PhD research (RMIT University, 2007).
  • Chair of the RMIT University Computer Science Research Student Conference 2007.
  • Expert witness in a software copyright infringement legal case.
  • Maintenance and improvement of research group computational resources.
  • Informal mentor for less experienced research students in my research group.
  • Reviewing work.

Thesis

Title
Source Code Authorship Attribution

Abstract
To attribute authorship means to identify the true author among many candidates for samples of work of unknown or contentious authorship. Authorship attribution is a prolific research area for natural language, but much less so for source code, with eight other research groups having published empirical results concerning the accuracy of their approaches to date. Authorship attribution of source code is the focus of this thesis.

We first review, reimplement, and benchmark all existing published methods to establish a consistent set of accuracy scores. This is done using four newly constructed and significant source code collections comprising samples from academic sources, freelance sources, and multiple programming languages. The collections developed are the most comprehensive to date in the field.

We then propose a novel information retrieval method for source code authorship attribution. In this method, source code features from the collection samples are tokenised, converted into n-grams, and indexed for stylistic comparison to query samples using the Okapi BM25 similarity measure. Authorship of the top ranked sample is used to classify authorship of each query, and the proportion of times that this is correct determines overall accuracy. The results show that this approach is more accurate than the best approach from the previous work for three of the four collections.

The accuracy of the new method is then explored in the context of author style evolving over time, by experimenting with a collection of student programming assignments that spans three semesters with established relative timestamps. We find that it takes one full semester for individual coding styles to stabilise, which is essential knowledge for ongoing authorship attribution studies and quality control in general.

We conclude the research by extending both the new information retrieval method and previous methods to provide a complete set of benchmarks for advancing the field. In the final evaluation, we show that the n-gram approaches are leading the field, with accuracy scores for some collections around 90% for a one-in-ten classification problem.