HOMEWORK 1: Character-Level Language Models

Assigned: September 3, 2020

Due: September 22, 2020 (before midnight)

In this assignment, you will build unigram, bigram, and trigram character language models (both unsmoothed and smoothed versions) for three languages, score a test document with each, and determine the language it is written in based on perplexity. You will also use your English language models to generate texts. You will critically examine all results. The learning goals of this assignment are to:

Understand how to compute language model probabilities using maximum likelihood estimation.
Implement basic and tuned smoothing and interpolation.
Use the perplexity of a language model to perform language identification.
Use a language model to probabilistically generate texts.

Train Language Models

To complete the assignment, you will need to write a program (from scratch) that:

builds the models: reads in training data, collects counts for all character 1, 2, and 3-grams, estimates probabilities, and writes out the unigram, bigram, and trigram models into files
adjusts the counts: rebuilds the bigram and trigram language models using two different methods: add-one smoothing and linear interpolation with lambdas equally weighted
adjusts the counts using tuned methods: rebuilds the bigram and trigram language models using add-k smoothing (where k is tuned) and with linear interpolation (where lambdas are tuned); tune by choosing from a set of values using held-out data

You may make any additional assumptions and design decisions, but state them in your report (see below). For example, some design choices that could be made are how you want to handle uppercase and lowercase letters or how you want to handle digits. The choice made is up to you, we only require that you detail these decisions in your report and consider any implications of them in your results. There is no wrong choice here, and these decisions are typically made by NLP researchers when pre-processing data.

You may write your program in any TA-approved programming language (Python, Java, C/C++).

For this assignment you must implement the model generation from scratch. You are allowed to use any resources or packages that help you manage your project, i.e. Github or any file i/o packages. If you have questions about this please ask.

Use Language Models

Language identification: For all trained language models, read in the test document, apply the language model to all sentences in it, and output perplexity. Based on the results, identify the language of the test document.

Text generation: For only the bigram and trigram language models trained on English, extend your programs so that you can generate sentences. That is, given any English letter(s) as input and based on the n-gram model you should continue the sentence with the most likely characters. More specifically, given a letter to begin a sentence with, you should choose as the next character that character that yields the highest n-gram count when composed with previous characters into a n-gram. Thus, you will use the previous character for bigrams and previous two characters for trigrams. You should continue generating new characters until you have generated a 100 character sentence or you have reached a dead end (all n-gram counts are zero). The beginning of the sentence (one character for bigrams and two characters for trigrams) should be command line parameter(s). Your program should output the resulting sentence.

Data

The data for this project is available here. It consists of:

training.en - English training data
training.es - Spanish training data
training.de - German training data
test - test document

Report

Your report should include:

a description of how you wrote your program, including all assumptions and design decisions (1 - 2 pages)
an excerpt of the two untuned trigram language models for English, displaying all n-grams and their probability with the two-character history t h
documentation that your probability distributions are valid (sum to 1)
documentation that your tuning did not train on the test set
for your best performing language model, the perplexity scores for each sentence (i.e., line) in the test document, as well as the document average. For all other unsmoothed and smoothed models, you just need to show the document average.
generated text outputs for the following inputs: bigrams starting with each of the 26 letters, and trigrams using the 26 letters as the first character with a second meaningful character of your choice.
critical analysis of your language identification results: e.g., why do your perplexity scores tell you what language the test data is written in? what does a comparison of your unsmoothed versus smoothed scores tell you about which performs best? what does a comparison of your unigram, bigram, and trigram scores tell you about which performs best? etc. (1 - 2 pages)
criticial analysis of your generation results: e.g., are there any difference between the sentences generated by bigrams and trigrams, or by the unsmoothed versus smoothed models? (1 - 2 pages)

Submission Procedure

Your full submission should include not only your report, but also include:

the code of your program(s)
a README file explaining
- how to run your code and the computing environment you used; for Python users, please indicate the version of the compiler
- any additional resources, references, or web pages you've consulted
- any person with whom you've discussed the assignment and describe the nature of your discussions
- any unresolved issues or problems

Make sure your submission works from the command line because the TA will be running your submissions.

The submission should be done using Canvas The file should have the following naming convention: yourfullname_hw1.zip (ex: DianeLitman_hw1.zip). The report, the code, and your README file should be submitted inside the archived folder.

The date in Canvas will be used to determine when your assignment was submitted (to implement the late policy).

Grading

Code (75 points)

25 points for correctly implementing unsmoothed unigram, bigram, and trigram language models
20 points for correctly implementing basic smoothing and interpolation for bigram and trigram models
10 points for improving your smoothing and interpolation results with tuned methods
10 points for correctly implementing evaluation via perplexity
10 points for correctly implementing text generation

Report (25 points)

20 points for your program description and critical analysis
5 points for presenting the requested supporting data

Extra Credit (10 points)

for training n-gram models with higher values of n until you can generate text that actually seems like English