Project (CS 1671, Fall 2018)

Introduction

The project for this class will be to design, build, and evaluate a Native Language Identification (NLI) system. NLI is the task of automatically classifying the native language (L1) of a writer based solely on an essay that the writer wrote in another language. This project will give you exposure to a cutting edge research area, and experience in building a real NLP system.

For this project, we will use the the "TOEFL11 corpus" that was created by ETS and specifically designed to support NLI. The corpus consists of English essays written by non-native speakers during a high-stakes college entrance test. The essays have already been split into three data sets for evaluation purposes: training (TOEFL11-TRAIN), development (TOEFL11-DEV), and test (TOEFL11-TEST).

IMPORTANT: The corpus cannot be distributed to anyone else or used for any purpose other than this class. If you wish to use this data for other purposes, please contact me and I will tell you what you need to do.

Development Phase. You will be given the training and development sets to use in developing your NLI systems. You may use these essays and the correct L1 answers in any way that you wish. The training and development data sets can be found in CourseWeb. WARNING: You will be given the answer keys for the training and development sets, but your system is not allowed to use them when predicting L1! The answer keys are being distributed only to show you what the correct answers should be, and to allow you to evaluate your NLI systems automatically if you wish.

Preliminary Evaluation. Each team will first submit the code for their NLI system and we will run the NLI systems on the essays in TOEFLL11-DEV. We will score the accuracy of each system and post the results on CourseWeb. The results of the preliminary evaluation will not count toward your final project grade, but should be useful for assessing your progress and seeing how well your system works compared to others in the class.

Final Evaluation. Each team will hand in the final code for their NLI system. We will evaluate the NLI systems on both TOEFL11-TEST (which you will not have access to in advance), and using 10-fold cross-validation on the union of essays in TOEFLL11-TRAIN and TOEFLL11-DEV. The purpose of evaluating your systems both ways is to balance specificity with generality. You will have several weeks to try to get your NLI system to perform well using cross-validation. Hopefully, everyone will be able to do fairly well in that evaluation. TOEFL11-TEST will be a blind test set that no one will see until the final evaluation. A system that uses general techniques should work just as well here as in cross-validation. But a system that has lots of hacks and tweaks based on the data you've seen probably will perform more poorly on TOEFL11-TEST.

The Gory Details

The input is a .csv file whose column assignment is the same with index-*.csv in the training and development data. You may (1) put the corresponding text files in a different path than the default path in the training and development data, and (2) use either original or tokenized texts, but please clarify this in the README of your submission.

The output is a .csv file with only one column which is the predicted language. The evaluation will consider presicion, recall and F1 on all the languages.

The performance of each NLI system will be scored using the program evaluate.py. The evaluation script (evaluate.py) is for Python 2, which can be executed as follows:

"py -2 ./evaluate.py -a <TO> -p <PO> <SF> <DF>"
where
<TO> is the column (starting from 1) of the true language in the input csv
<PO> is the column of the predicted language in the output csv
<SF> is the path of the input csv
<DF> is the path of the output csv

If there are packages missing, use "py -2 -m pip install <package_name>" to install missing packages.

Use the same submission instructions as for the homeworks. However, if you use third party applications (see Resources below), talk to Yanbing for additional guidelines.

Teams

Ideally, the class will be divided into 3-person or 2-person teams for the project. You may form your own team if you know people with whom you'd like to work. Otherwise, I can randomly assign you to a team. If you really want to work by yourself that is also possible.

Schedule

October 30: TOEFL11-TRAIN and TOEFL11-DEV are available in courseWeb.

November 1: Teams are formed and emailed to Professor Litman.

November 15: Preliminary evaluation on TOEFL11-DEV.

November 29: Final evaluation using 1) TOEFL11-TEST, 2) cross-validation on union of training and development data. TOEFL11-TEST will will be released for use in your project presentation and report.

December 4 Project reports due.

December 6: Project presentations.

Note that we expect a working NLI system by November 15! It might not work well and may still be missing some components that you plan to incorporate, but it should be able to process an essay and produce a LI. Participation in the preliminary evaluation is mandatory. Failure to participate will result in a 10% deduction off your final project grade. This policy is to ensure that everyone is making adequate progress.

Grading

33% of the grade will be based on your NLI system's performance on TOEFL11-TEST during the final evaluation

33% of the grade will be based on your NLI system's cross-validated performance on the training/development combined data during the final evaluation .

33% of the grade will be based on your project report and presentation

Each NLI system will be ranked relative to the other systems in the class. The grade for the report and presentation will be based on clarity, as well as the creativity and ambitiousness shown in the design of your system. Thus, like the Olympics, difficulty can in effect boost your raw performance scores.

Note that the final grading is on a relative, not an absolute, scale. However, this does not mean that the team with the highest average ranking automatically gets an A (e.g., if the best score was no better than chance performance), or that the lowest scoring team fails. If every team produces a good and interesting system, I will be happy to give every team an A.

Resources

NLI Shared Tasks

NLP

Feel free to write your own NLP code, but also feel free to use existing code such as from the Natural Language Toolkit (NLTK) or from the Stanford Natural Language Processing Group.

Machine Learning

Introduction*.pptx are tutorials that Yanbing previously made for CS1675 (Undergrad ML) and CS2750 (Grad ML) and are available in courseWeb. He also suggests this Python library tutorial. Also feel free to implement your own text categorization system using the Naive Bayes methods we learned, and/or to take a rule-based approach instead of using machine learning.

Encouragement

NLP is not a solved problem, and effective NLI is HARD! Randomly choosing a LI should yield low accuracy, so anything higher means that you are doing something good!!