HOMEWORK 1 (CS 2731 / ISSP 2230)

Assigned: Jan 24, 2017
Due: Feb 9, 2017, 23:59

In this assignment, you will use language modeling to determine the genre of sentences. You will build unigram and trigram language models of English texts, then score a set of sentences with each, and determine the genre of the sentence based on perplexity. You will critically examine your results.

If you have any questions, please contact the TA via

Email: yuhuan@cs.pitt.edu
Office Hours: Wed 3PM-6PM @ Room 5422 of SENSQ

Tasks

There are three tasks in this homework:

  1. Language modeling: build the following language models for both corpora:
  2. Intrinsic evaluation: evaluate the language models by perplexity.
  3. Extrinsic evaluation: evaluate the language models by the performance of a downstream task that relies on language models. In this homework, the downstream task will be text genre detection.

Files Provided

Download the data and starting code here. The archive contains the following files:

├── data/
│   ├── genre-task/
│   │   ├── gold.txt
│   │   └── mixed-sentences.txt
│   ├── sb/
│   │   ├── train.txt
│   │   └── test.txt
│   └── wsj/
│       ├── train.txt
│       └── test.txt
│
├── accuracy.py
├── train.py
├── test.py
├── log-prob.py
└── run-task.py

You will find 4 Python scripts (train.py, test.py, log-prob.py, and run-task.py) at the root directory that can help you understand the required input/output, and help you start writing your own implementation (see detail description in the next section). The other Python script, accuracy.py, is a utility script that you can use to evaluate your genre detection accuracy.

The two corpora provided are as follows:

For both corpora, the sentences have already been tokenized for you. We now introduce exactly what data will be used in each task:

Task I: Language Modeling

For this part, you need to train three LMs (as listed above) for both the Wall Street Journal (WSJ) and Switchboard corpora. For WSJ, use data/wsj/train.txt. Similarly, for the Switchboard LMs, use data/sb/train.txt. The script train.py should load the corpus, do the necessary counts, and save any information that is necessary for re-creating the language model for later use. Describe your implementation details in your report.

We will use your log-prob.py to check various aspects of your LMs when grading. For example, we may check whether \(P(w|h)\) is normalized for a given \(h\), so be sure to implement log-prob.py too.

Task II: Intrinsic Evaluation (Perplexity)

Implement the computation of perplexity in test.py, then use it to compute the perplexity of your WSJ LMs and Switchboard LMs on their respective test files (data/wsj/test.txt and data/sb/test.txt). Report the perplexity scores for your LMs in the report. Use Eq. (4.15) from the 3rd version of the textbook to compute perplexity.

Task III: Extrinsic Evaluation (Genre Detection)

The problem of genre detection is defined as

Input: a sentence
Output: Is the sentence from wsj or sb

The file data/genre-task/mixed-sentences.txt contains sentences from the WSJ and Switchboard corpora (randomly shuffled). You will use an LM of WSJ and an LM of Switchboard and their perplexity scores on the sentences to decide which corpus the sentences are from.

Implement run-task.py to perform this task. Describe how what your decision strategy is, and the performance (accuracy) of your method in the report.

Input/Output Requirements

⚠️ NOTICE
Read this section very carefully, as we introduce the input and output of the scripts you will submit. Failing to conform to the requirements will result in point deductions.

To students who uses Java:
We will describe the input/output requirements using Python 2 as an example. If you use Java, please refer to the end of this document, or consult the TA for help.

The scripts you need to submit are:

The following diagram shows the input and output of each script. The general idea is that train.py is responsible for

This means that when your train.py finishes running, it must dump the LM to some place on the disk. When your other scripts (such as test.py) runs, it can fetch the trained LM from the disk. This allows us to train each LM only once, and then do the various kinds of evaluation without having to train the LM again.

An overview of the input/output of the scripts is summarized in the following figure

Command Line Arguments

NOTE: [A Dummy Implementation]
The code provided includes a dummy implementation as a example of how to implement a language model that meets all the input/output requirements. Follow the instructions in the README.md of the provided code after you read the following, to better understand the input/output requirements.

train.py

The script train.py should take the following command line arguments:

💡 HINT:
You can always run python train.py -h to find out the input requirements. This applies to other scripts as well.

A typical call to this script looks like

python train.py -t 3s -i data/wsj/train.txt -m model_wsj_3s/

Upon executing this line, your script should start to train an add-one smoothed trigram LM (because of -t 3s) using the sentences found at the user specified path data/wsj/train.txt. Finally, it should dump the trained LM in the directory model_wsj_3s/ for fast re-creating the LM later in other scripts.

⚠️ NOTICE
Do not hard code the paths to the input and output files in your code. It is considered undesired to hard code a string such as /Users/myname/data/input.txt in your code, and then ask the grader to change that string.
Always read from the command line arguments. If you use the Python starter code, the command line arguments are parsed and stored in convenient variables for you already.

test.py

The script test.py is responsible for computing the perplexity of a model on a test set. It should take the following command line arguments:

A typical call to this script looks like:

python test.py -m model_wsj_3s -i data/wsj/test.txt -o wsj-perplexity.txt

Upon executing this line, your code should fetch the files cached in model_wsj_3s and recreate the LM. Then, use the LM to compute the perplexity on the test sentences found at data/wsj/test.txt. Finally, the script will write the perplexity number in the file wsj-perplexity.txt.

log-prob.py

The script log-prob.py should take the following command line arguments:

A typical call to this script looks like

python log-prob.py -m model_wsj_3s/ -i data/wsj/trigrams-to-check.txt -o log-probs.txt

Upon executing this, your script should load the model in model_wsj_3s/, read in the list of n-grams in data/log-prob/simple.txt, then outputs the log probabilities to the file log-probs.txt

run-task.py

The script run-task.py is for running the genre detection task, which should take the following command line arguments:

A typical call to this script would be

python run-task.py --wsjmodel model_wsj_3s/ --sbmodel model_sb_3s/ -i data/genre-task/mixed-sentences.txt -o answer.txt

Upon execution, your script should load the two language models from model_wsj_3s/ and model_sb_3s/, use them to decide on each sentence in data/genre-task/mixed-sentences.txt which corpus it came from. Finally, the script should output the answer to the file answer.txt.

Utility Scripts

accuracy.py

This is a script already implemented. Use this script to evaluate the performance of your genre detection implementation.

To use this script, run-task.py must be first run. Let answer.txt be the path to the automatic answer of your implementation, then call

python accuracy.py -g data/genre-task/gold.txt -a answer.txt

The script will print the accuracy on the screen. Using the inputs from the example above, the accuracy will be printed on the screen:

Accuracy = 3/4 (75%)

Handling OOV Words

All the language models you implement should be open-vocabulary. This means that, to handle out-of-vocabulary (OOV) words, you need to do the following:

Please use <unk> and not <UNK> or other variations in your implementation, as we will check the probabilities your LM leaned with <unk>.

Java Implementations

If you use Java, it is recommended to submit a single jar that has 4 main classes. So the four scripts will become

java -cp your.jar your.code.Train -t 3s -i data/wsj/train.txt -m model_wsj_3s/
java -cp your.jar your.code.Test -m model_wsj_3s -i data/wsj/test.txt -o wsj-perplexity.txt
java -cp your.jar your.code.LogProb -m model_wsj_3s/ -i data/wsj/trigrams-to-check.txt -o log-probs.txt
java -cp your.jar your.code.RunTask --wsjmodel model_wsj_3s/ --sbmodel model_sb_3s/ -i data/genre-task/mixed-sentences.txt -o answer.txt

In this case, you will need to implement the command line argument parsing by yourself. Again, consult the TA when in doubt.

Grading

The maximum possible points for this homework is 100, consisting of:

Submission

Submit an archive containing: