Research > Corpora


This page provides collections described in my publications that are free to the public.


Collection Webis-SDMbridge-12 provides the simulation data mining community with a collection of 14,641 bridge models and simulated behavior. Refer to this webpage.


Collection Webis-CPC-11 is a corpus of 7,859 document pairs comprising a roughly equal balance of paraphrased or non-paraphrased content. Refer to this webpage and the following publication:

  • Steven Burrows, Martin Potthast, and Benno Stein. Paraphrase Acquisition via Crowdsourcing and Machine Learning. ACM Transactions on Intelligent Systems and Technology, 4(3):43:1-43:21, June 2013. 

Coll-A, Coll-T, Coll-P, Coll-PO, Coll-J, and Coll-JO

Collections Coll-A, Coll-T, Coll-P, Coll-PO, Coll-J, and Coll-JO are resources for researchers working in the field of source code authorship attribution and related areas. A description of these collections and associated experiments are given in the following publications as summarized here.

All collections:

  • Steven Burrows, Alexandra L. Uitdenbogerd, and Andrew Turpin. "Comparing Techniques for Authorship Attribution of Source Code". Software: Practice and Experience. 44(1):1-32, January 2014.
  • Steven Burrows. "Source Code Authorship Attribution". PhD thesis, School of Computer Science and Information Technology, RMIT University, Melbourne, Australia, November 2010.

Coll-T only:

  • Steven Burrows, Alexandra L. Uitdenbogerd, and Andrew Turpin. Temporally Robust Software Features for Authorship Attribution. In Tony Hey, Elisa Bertino, Vladimir Getov, and Lin Liu, editors, Proceedings of the Thirty-Third Annual IEEE International Computer Software and Applications Conference , pages 599-606, Seattle, Washington, July 2009.

Coll-A only:

  • Steven Burrows, Alexandra L. Uitdenbogerd, and Andrew Turpin. Application of Information Retrieval Techniques for Source Code Authorship Attribution. In Xiaofang Zhou, Haruo Yokota, Ramamohanarao Kotagiri, and Xuemin Lin, editors, Proceedings of the Fourteenth International Conference on Database Systems for Advanced Applications , pages 699-713, Brisbane, Australia, April 2009.

Coll-A only (early version):

  • Steven Burrows and Seyed M. M. Tahaghoghi. Source Code Authorship Attribution using N-Grams. In Amanda Spink, Andrew Turpin, and Mingfang Wu, editors, Proceedings of the Twelfth Australasian Document Computing Symposium , pages 32-39, Melbourne, Australia, December 2007.

The following resources are provided for this work:

Decompress with "gtar -xzvf filename.gtz". Refer to the README.txt files in the above archives for information on the included files.

The following persistent URL redirects to this web page. Please cite this URL in publications:

The data is also recorded at Research Data Australia: