# HOMEWORK 1 (CS 2731 / ISSP 2230)

Assigned: Jan 24, 2017
Due: Feb 9, 2017, 23:59

In this assignment, you will use language modeling to determine the genre of sentences. You will build unigram and trigram language models of English texts, then score a set of sentences with each, and determine the genre of the sentence based on perplexity. You will critically examine your results.

If you have any questions, please contact the TA via

Email: yuhuan@cs.pitt.edu
Office Hours: Wed 3PM-6PM @ Room 5422 of SENSQ

There are three tasks in this homework:

1. Language modeling: build the following language models for both corpora:
• Unsmoothed unigram
• Unsmoothed trigram
• Smoothed trigram
2. Intrinsic evaluation: evaluate the language models by perplexity.
3. Extrinsic evaluation: evaluate the language models by the performance of a downstream task that relies on language models. In this homework, the downstream task will be text genre detection.

## Files Provided

Download the data and starting code here. The archive contains the following files:

├── data/
│   │   ├── gold.txt
│   │   └── mixed-sentences.txt
│   ├── sb/
│   │   ├── train.txt
│   │   └── test.txt
│   └── wsj/
│       ├── train.txt
│       └── test.txt
│
├── accuracy.py
├── train.py
├── test.py
├── log-prob.py
└── run-task.py

You will find 4 Python scripts (train.py, test.py, log-prob.py, and run-task.py) at the root directory that can help you understand the required input/output, and help you start writing your own implementation (see detail description in the next section). The other Python script, accuracy.py, is a utility script that you can use to evaluate your genre detection accuracy.

The two corpora provided are as follows:

• The Wall Street Journal (WSJ): news text. We provide 2000 sentences for training, 500 sentences for testing.
• Switchboard: transcripts of speech. We provide 2000 sentences for training, 500 sentences for testing.

For both corpora, the sentences have already been tokenized for you. We now introduce exactly what data will be used in each task:

### Task I: Language Modeling

For this part, you need to train three LMs (as listed above) for both the Wall Street Journal (WSJ) and Switchboard corpora. For WSJ, use data/wsj/train.txt. Similarly, for the Switchboard LMs, use data/sb/train.txt. The script train.py should load the corpus, do the necessary counts, and save any information that is necessary for re-creating the language model for later use. Describe your implementation details in your report.

We will use your log-prob.py to check various aspects of your LMs when grading. For example, we may check whether $$P(w|h)$$ is normalized for a given $$h$$, so be sure to implement log-prob.py too.

### Task II: Intrinsic Evaluation (Perplexity)

Implement the computation of perplexity in test.py, then use it to compute the perplexity of your WSJ LMs and Switchboard LMs on their respective test files (data/wsj/test.txt and data/sb/test.txt). Report the perplexity scores for your LMs in the report. Use Eq. (4.15) from the 3rd version of the textbook to compute perplexity.

### Task III: Extrinsic Evaluation (Genre Detection)

The problem of genre detection is defined as

Input: a sentence
Output: Is the sentence from wsj or sb

The file data/genre-task/mixed-sentences.txt contains sentences from the WSJ and Switchboard corpora (randomly shuffled). You will use an LM of WSJ and an LM of Switchboard and their perplexity scores on the sentences to decide which corpus the sentences are from.

Implement run-task.py to perform this task. Describe how what your decision strategy is, and the performance (accuracy) of your method in the report.

## Input/Output Requirements

⚠️ NOTICE
Read this section very carefully, as we introduce the input and output of the scripts you will submit. Failing to conform to the requirements will result in point deductions.

To students who uses Java:
We will describe the input/output requirements using Python 2 as an example. If you use Java, please refer to the end of this document, or consult the TA for help.

The scripts you need to submit are:

• train.py trains a language model. It should be able to handle the training of any of the three LMs on any of the two corpora.
• test.py: intrinsically evaluates a trained language model using perplexity.
• log-prob.py computes log-probabilities given a trained model, and a list of n-grams.
• run-task.py runs the genre detection task. You may use the provided accuracy.py to evaluate the performance of your implementation.

The following diagram shows the input and output of each script. The general idea is that train.py is responsible for

• Training the LM, and
• Saving to a directory everything related to the LM that is necessary to re-create the LM when the other scripts need the LM.

This means that when your train.py finishes running, it must dump the LM to some place on the disk. When your other scripts (such as test.py) runs, it can fetch the trained LM from the disk. This allows us to train each LM only once, and then do the various kinds of evaluation without having to train the LM again.

An overview of the input/output of the scripts is summarized in the following figure

### Command Line Arguments

NOTE: [A Dummy Implementation]
The code provided includes a dummy implementation as a example of how to implement a language model that meets all the input/output requirements. Follow the instructions in the README.md of the provided code after you read the following, to better understand the input/output requirements.

#### train.py

The script train.py should take the following command line arguments:

• --type <TYPE> or -t <TYPE>: the type of the model to train, where <TYPE> can be 1, 3, or 3s (meaning unsmoothed unigram LM, unsmoothed trigram LM, trigram LM with add-one smoothing, respectively).
• --input <FILE> or -i <FILE>: the path to the tokenized training text file. Your implementation can safely assume that <FILE> is already tokenized. An example of the training file is as follows:

this is the first sentence .
the is n't the first sentence .
this ca n't be the first sentence .
• --model <DIR> or-m <DIR>: the output directory of this script. Save any file that you think is necessary for re-creating the language model in the directory <DIR>. We ask you to cache all information about the trained language model because we don't want to train the LM over and over again in later use. Typically, what you need to save are

• The vocabulary file
• Raw count files: the counts of n-grams, the counts of contexts
• Smoothed counts files: the smoothed counts (or you can skip this, and compute them on the fly)
• A property file saving information such as the type of the LM, whether it is smoothed, etc.

We do not care the format of the files your train.py saves. It can be anything as long as your other scripts (i.e., log-prob.py, test.py, and run-task.py) can read them. You may use serialization tools such as Pickle (which is built-in in Python), or you may simply save text files with a format defined by you.

💡 HINT:
You can always run python train.py -h to find out the input requirements. This applies to other scripts as well.

A typical call to this script looks like

python train.py -t 3s -i data/wsj/train.txt -m model_wsj_3s/

Upon executing this line, your script should start to train an add-one smoothed trigram LM (because of -t 3s) using the sentences found at the user specified path data/wsj/train.txt. Finally, it should dump the trained LM in the directory model_wsj_3s/ for fast re-creating the LM later in other scripts.

⚠️ NOTICE
Do not hard code the paths to the input and output files in your code. It is considered undesired to hard code a string such as /Users/myname/data/input.txt in your code, and then ask the grader to change that string.
Always read from the command line arguments. If you use the Python starter code, the command line arguments are parsed and stored in convenient variables for you already.

#### test.py

The script test.py is responsible for computing the perplexity of a model on a test set. It should take the following command line arguments:

• --model <DIR> or -m <DIR>: the directory containing the model files. Your implementation will read the files you saved earlier to recreate the language model.
• --input <FILE> or -i <FILE>: the path to the text file containing testing sentences. This will have exactly the same format as the input file to train.py.
• --output <FILE> or -o <FILE>: the path to the text file containing the perplexity of the test sentences. The output file should have just one line, which contains the perplexity number:

2.123456789

A typical call to this script looks like:

python test.py -m model_wsj_3s -i data/wsj/test.txt -o wsj-perplexity.txt

Upon executing this line, your code should fetch the files cached in model_wsj_3s and recreate the LM. Then, use the LM to compute the perplexity on the test sentences found at data/wsj/test.txt. Finally, the script will write the perplexity number in the file wsj-perplexity.txt.

#### log-prob.py

The script log-prob.py should take the following command line arguments:

• --model <DIR> or -m <DIR>: the directory containing the model files. Your implementation will read the files you saved earlier to recreate the language model.
• --input <FILE> or -i <FILE>: the path to the text file containing the n-grams to query. A example input file containing 5 trigrams will look like

the united states
winning the election
repealing Obamacare is
have does will
have had had

Note that the trigrams are separated by new lines. In each line, the words in the trigram are separated by spaces.

• --output <FILE> or -o <FILE>: the path to the text file containing the log-probabilities of the queried n-grams. The output of the example input file above should look like

-0.6931471805599453
-1.6094379124341003
-0.2231435513142097
-4.605170185988091
-2.3025850929940455

A typical call to this script looks like

python log-prob.py -m model_wsj_3s/ -i data/wsj/trigrams-to-check.txt -o log-probs.txt

Upon executing this, your script should load the model in model_wsj_3s/, read in the list of n-grams in data/log-prob/simple.txt, then outputs the log probabilities to the file log-probs.txt

#### run-task.py

The script run-task.py is for running the genre detection task, which should take the following command line arguments:

• --wsjmodel <DIR> or -m <DIR>: path to the directory of the WSJ LM to use.
• --sbmodel <DIR> or -m <DIR>: path to the directory of the Switchboard LM to use.
• --input <FILE> or -i <FILE>: the input file consisting of sentences separated by new lines (same format as the training data). However, this time, each line may come from either one of the two genres.

uh , do you have a pet randy ?
the united states is trying to negate the great efforts made by china .
i 'm gonna do it.
obama will exit office with the republican party resurgent on the state and federal levels .
• --output <FILE> or -o <FILE>: the output file containing your answers. Put the label for a sentence on their corresponding lines, i.e., the i-th line in the output file should be the answer for the i-th sentence in the input.

sb
wsj
wsj
wsj

A typical call to this script would be

python run-task.py --wsjmodel model_wsj_3s/ --sbmodel model_sb_3s/ -i data/genre-task/mixed-sentences.txt -o answer.txt

Upon execution, your script should load the two language models from model_wsj_3s/ and model_sb_3s/, use them to decide on each sentence in data/genre-task/mixed-sentences.txt which corpus it came from. Finally, the script should output the answer to the file answer.txt.

### Utility Scripts

#### accuracy.py

This is a script already implemented. Use this script to evaluate the performance of your genre detection implementation.

• --gold <FILE> or -g <FILE>: the automatic answer provided by your implementation.

sb
wsj
sb
wsj
• --auto <FILE> or -a <FILE>: the gold-standard answer.

sb
wsj
wsj
wsj

To use this script, run-task.py must be first run. Let answer.txt be the path to the automatic answer of your implementation, then call

python accuracy.py -g data/genre-task/gold.txt -a answer.txt

The script will print the accuracy on the screen. Using the inputs from the example above, the accuracy will be printed on the screen:

Accuracy = 3/4 (75%)

## Handling OOV Words

All the language models you implement should be open-vocabulary. This means that, to handle out-of-vocabulary (OOV) words, you need to do the following:

• Treat any token with only 1 occurrence in the training data as an OOV, and map them all to a special token, <unk>.
• Train your LM in the usual way, as if <unk> is token naturally occurring in the training data.
• During test time, map any token that your LM did not see in the training data to <unk>.

Please use <unk> and not <UNK> or other variations in your implementation, as we will check the probabilities your LM leaned with <unk>.

## Java Implementations

If you use Java, it is recommended to submit a single jar that has 4 main classes. So the four scripts will become

java -cp your.jar your.code.Train -t 3s -i data/wsj/train.txt -m model_wsj_3s/
java -cp your.jar your.code.Test -m model_wsj_3s -i data/wsj/test.txt -o wsj-perplexity.txt
java -cp your.jar your.code.LogProb -m model_wsj_3s/ -i data/wsj/trigrams-to-check.txt -o log-probs.txt
java -cp your.jar your.code.RunTask --wsjmodel model_wsj_3s/ --sbmodel model_sb_3s/ -i data/genre-task/mixed-sentences.txt -o answer.txt

In this case, you will need to implement the command line argument parsing by yourself. Again, consult the TA when in doubt.

The maximum possible points for this homework is 100, consisting of:

• Code (60 points)
• 20 points for correctly implementing unsmoothed unigram, and trigram language models
• 20 points for correctly implementing the add-one smoothing for trigram models.
• 10 points for correctly implementing the intrinsic evaluation (perplexity evaluation).
• 10 points for effectively implementing the extrinsic evaluation (genre detection).
• Report (40 points)
• 25 points for your program description and critical analysis
• 15 points for presenting the requested supporting data

## Submission

Submit an archive containing:

• The 4 scripts: train.py, test.py, log-prob.py, run-task.py. There is no need to submit any data file, or any file that is the output of your scripts.
• A report (1 - 2 pages) containing
• Description of how you implemented your language models.
• Perplexity scores of your three LMs on the two corpora.
• Which WSJ model and which SB language model performs the best in terms of genre detection performance. (This item was added on Feb 1, 2017.)
• Description of how your run-task.py decides whether a sentence is from WSJ or Switchboard based on the LM perplexities.
• Critical analysis of your results, e.g.,
• Why do your perplexity scores tell you what genre the test data is written in?
• How does the OOV handling strategy impact the genre detection performance?
• What does a comparison of your unigram and trigram scores tell you?
• What does a comparison of your unsmoothed versus smoothed scores tell you?
• etc.
• A readme file describing:
• Version of Python (Python 2 is recommended, as you can begin directly with the provided code)
• Any known issues your code has (e.g., under what condition will it crash, if it does)