The project for this class will be to design, build, and evaluate a Native Language Identification (NLI) system. NLI is the task of automatically classifying the native language (L1) of a writer based solely on an essay that the writer wrote in another language. This project will give you exposure to a cutting edge research area, and experience in building a real NLP system.
For this project, we will use the the "TOEFL11 corpus" that was created by ETS and specifically designed to support NLI. The corpus consists of English essays written by non-native speakers during a high-stakes college entrance test. The essays have already been split into three data sets for evaluation purposes: training (TOEFL11-TRAIN), development (TOEFL11-DEV), and test (TOEFL11-TEST).
IMPORTANT: The corpus cannot be distributed to anyone else or used for any purpose other than this class. If you wish to use this data for other purposes, please contact me and I will tell you what you need to do.
Development Phase. You will be given the training and development sets to use in developing your NLI systems. You may use these essays and the correct L1 answers in any way that you wish. The training and development data sets can be found in CourseWeb. WARNING: You will be given the answer keys for the training and development sets, but your system is not allowed to use them when predicting L1! The answer keys are being distributed only to show you what the correct answers should be, and to allow you to evaluate your NLI systems automatically if you wish.
Preliminary Evaluation. Each team will first submit the code for their NLI system and we will run the NLI systems on the essays in TOEFLL11-DEV. We will score the accuracy of each system and post the results on CourseWeb. The results of the preliminary evaluation will not count toward your final project grade, but should be useful for assessing your progress and seeing how well your system works compared to others in the class.
Final Evaluation. Each team will hand in the final code for their NLI system. We will evaluate the NLI systems on both TOEFL11-TEST (which you will not have access to in advance), and using 10-fold cross-validation on the union of essays in TOEFLL11-TRAIN and TOEFLL11-DEV. The purpose of evaluating your systems both ways is to balance specificity with generality. You will have several weeks to try to get your NLI system to perform well using cross-validation. Hopefully, everyone will be able to do fairly well in that evaluation. TOEFL11-TEST will be a blind test set that no one will see until the final evaluation. A system that uses general techniques should work just as well here as in cross-validation. But a system that has lots of hacks and tweaks based on the data you've seen probably will perform more poorly on TOEFL11-TEST.
The input is a .csv file whose column assignment is the same with index-*.csv in the training and development data. You may (1) put the corresponding text files in a different path than the default path in the training and development data, and (2) use either original or tokenized texts, but please clarify this in the README of your submission.
The output is a .csv file with only one column which is the predicted language. The evaluation will consider presicion, recall and F1 on all the languages.
The performance of each NLI system will be scored using the program evaluate.py. The evaluation script (evaluate.py) is for Python 2, which can be executed as follows:
Use the same submission instructions as for the homeworks. However, if you use third party applications (see Resources below), talk to Yanbing for additional guidelines.
Ideally, the class will be divided into 3-person or 2-person teams for the project. You may form your own team if you know people with whom you'd like to work. Otherwise, I can randomly assign you to a team. If you really want to work by yourself that is also possible.
Note that we expect a working NLI system by November 15! It might not work well and may still be missing some components that you plan to incorporate, but it should be able to process an essay and produce a LI. Participation in the preliminary evaluation is mandatory. Failure to participate will result in a 10% deduction off your final project grade. This policy is to ensure that everyone is making adequate progress.
Note that the final grading is on a relative, not an absolute, scale. However, this does not mean that the team with the highest average ranking automatically gets an A (e.g., if the best score was no better than chance performance), or that the lowest scoring team fails. If every team produces a good and interesting system, I will be happy to give every team an A.
NLP is not a solved problem, and effective NLI is HARD! Randomly choosing a LI should yield low accuracy, so anything higher means that you are doing something good!!