Research > Interests

Summary

This page describes my research interests in detail as summarised on my research page.


Broad Fields

Information Retrieval can refer to in part adversarial information retrieval or applied information retrieval. Adversarial Information Retrieval uses the science of search for managing often huge volumes of content when dealing with criminality and fraud. Search engines and advanced data structures for indexing play a big part in this domain. Adversarial topics within information retrieval aim to assist with academic misconduct investigations, copyright infringements, identifying authors of malicious software, and resolving authorship disputes. Applied Information Retrieval involves solving real-world information problems with practical information solutions. For example, the Zettair search engine implements well-known information retrieval similarity metrics such as Okapi BM25 and language models using highly scalable index structures. This search engine has been extended to demonstrate efficient plagiarism detection and effective authorship attribution for source code in previous work.

Data Mining refers to finding patterns in large data sets. This can involve clustering data into groups, classifying data as part of a group based upon patterns in existing groups, finding functions to model patterns in data, and identifying common rules. Much data mining research has been conducted on text documents, but this can also be applied to non-text data as in the Matilda project.

Machine Learning uses computers to find patterns in data to assist with decision making. A typical machine learning experiment involves training a classification algorithm using existing known cases to make decisions about new unseen cases. Machine learning has many applications including machine perception, authorship misrepresentation, and learning overlap optimization.

Natural Language Processing involves the use of computers to better understand natural languages. The field is closely related to computational linguistics. An example is paraphrasing student assignments, newswire articles, and free books for the purposes of simulating plagiarism. Another example is grading essays and short answer responses to exam and test questions with automated methods.


Specific Topics

Plagiarism Detection involves finding misdemeanors where work is passed off in the name of others either deliberately or in ignorance. Natural language and source code solutions exist to address the recurring problem in both academia and industry. Algorithms to detect plagiarism need to identify chunks of content copied verbatim and with local modifications.

Authorship Attribution involves identifying the author of work samples of unknown or uncertain authorship. To do this, previous samples of work by candidate authors are examined for stylistic traits consistent with the sample under contention, hence authorship attribution is synonymous with stylometry. Authorship attribution solutions can help to deter academic misconduct incidents whereby a change in writing style could detect assignments completed by external tutors for hire. Authorship attribution has been applied to both natural language and source code.

Digital Engineering refers to applying computer science expertise to solve engineering problems. Examples include simulation data mining for predicting behavior of simulated civil engineering models, or applying machine learning to learning overlap optimization for domain decomposition methods.

Crowdsourced Paraphrasing refers to the use of Mechanical Turk or other crowdsourcing platforms for outsourcing work to generate paraphrased samples of various texts. Such samples have been used in the PAN-PC-10 plagiarism corpus as part of the Second International Competition on Plagiarism Detection in 2010. These samples in particular allow researchers to evaluate their intrinsic plagiarism detection algorithms.

Experiment Software is useful for making information retrieval experiments reproducible. With such as system, experiments can be published on the web and co-located with publications. With such a platform, small groups can collaborate more easily, the research is given more visibility, and new communities can be developed such as the PAN competition series that uses the TIRA experiment software.

Automatic Free Text Grading can be applied to automate the assessment of short answer and essay questions for tests and exams. For summative assessment, automatic grading methods can eliminate human error, fatigue, bias, and ordering effects. There is also broader applications in e-learning and intelligent tutoring systems for formative assessment. Concerning effectiveness, various systems are known to be very competitive with manual grading.

Online Assessment Software is particularly important for large class teaching and courses with several staff for maintaining consistency and developing efficiency with assessment practices. Previous work in this area has dealt with evaluating many collaborative and semi-automated marking and feedback systems that facilitate online assessment. The work has included user analysis testing and focus group discussions for short-listing candidate solutions and identifying features of interest for RMIT University.

Computer Science Education research in previous work has involved reporting on teaching approaches for managing large class teaching a complex setting comprising based on a course spread across two campuses and taught by multiple staff. Continuing the advancement in effective teaching practices is vitally important for tomorrow's information technology professionals and computer science researchers.