HOMEWORK 1: Character-Level Language Models
Assigned: September 3, 2020
Due: September 22, 2020 (before midnight)
In this assignment, you will build unigram,
bigram, and trigram
character language models (both unsmoothed and
smoothed versions) for three languages, score a test document with
each, and determine the language it is written in based on
perplexity. You will also use your English language models to
generate texts. You will critically examine all results.
The learning goals of this assignment are to:
- Understand how to compute language model probabilities using
maximum likelihood estimation.
- Implement basic and tuned smoothing and interpolation.
- Use the perplexity of a language model to perform language identification.
- Use a language model to probabilistically generate texts.
Train Language Models
To complete the assignment, you will need to write
a program (from scratch) that:
- builds the models: reads in training data, collects
counts for all character 1, 2, and 3-grams, estimates probabilities, and writes out
the unigram, bigram, and trigram models into files
- adjusts the counts: rebuilds the bigram and trigram language models using two
different methods: add-one smoothing and
linear interpolation with lambdas equally weighted
- adjusts the counts using tuned methods: rebuilds the bigram
and trigram language models using add-k
smoothing (where k is tuned) and with linear interpolation (where
lambdas are tuned); tune by choosing from a set of values using held-out data
You may make any
additional assumptions and design decisions, but state them in your
report (see below).
For example, some design choices that could be made are how you want
to handle uppercase and lowercase letters or how you want to handle
digits. The choice made is up to you, we only require that you
detail these decisions in your report and consider any implications
of them in your results. There is no wrong choice here, and these
decisions are typically made by NLP researchers when pre-processing
data.
You may write your program in
any TA-approved programming language (Python, Java, C/C++).
For this assignment you must implement the model generation from
scratch. You are allowed to use any resources or packages that help
you manage your project, i.e. Github or any file i/o packages. If
you have questions about this please ask.
Use Language Models
Language identification:
For all trained language models, read in the
test document, apply the language model to all sentences in it,
and output perplexity. Based on the results, identify the language
of the test document.
Text generation:
For only the bigram and trigram language models trained on English,
extend your programs so that you can
generate sentences. That is, given
any English letter(s) as input and based on the n-gram model you
should continue the sentence with the most likely characters.
More specifically, given a letter to begin a
sentence with, you should choose as the next character that character that yields
the highest n-gram count when composed with previous characters into a
n-gram. Thus, you will use the previous character for bigrams and
previous two characters for trigrams. You should continue generating
new characters until you have generated a 100 character sentence or you have
reached a dead end (all n-gram counts are zero). The beginning of
the sentence (one character for bigrams and two characters for trigrams)
should be command line parameter(s).
Your program should output the resulting sentence.
Data
The data for this project is available
here.
It consists of:
- training.en - English training data
- training.es - Spanish training data
- training.de - German training data
- test - test document
Report
Your report should include:
- a description of how you wrote your program, including all
assumptions and design decisions (1 - 2 pages)
- an excerpt of the two untuned trigram language models for English, displaying all
n-grams and their probability with the two-character history t h
- documentation that your probability distributions are valid (sum
to 1)
- documentation that your tuning did not train on the test set
- for your best performing language model, the perplexity scores for each sentence (i.e., line) in the test document, as well as the
document average. For all other unsmoothed and smoothed models, you
just need to show the document average.
- generated text outputs for the following inputs: bigrams starting with
each of the 26 letters, and trigrams using the 26 letters as the
first character with a second meaningful character of your choice.
- critical analysis of your language identification results: e.g.,
why do your perplexity scores tell you what language the test data is
written in?
what does a comparison of your unsmoothed versus smoothed scores
tell you about which performs best?
what does a comparison of your unigram, bigram, and trigram scores
tell you about which performs best?
etc. (1 - 2 pages)
- criticial analysis of your generation results: e.g.,
are there any difference between the sentences generated by bigrams
and trigrams, or by the unsmoothed versus smoothed models?
(1 - 2 pages)
Submission Procedure
Your full submission should include not only your report, but also include:
- the code of your program(s)
- a README file explaining
- how to run your code and the computing environment you used; for Python users, please indicate the version of the compiler
- any additional resources, references, or web pages you've consulted
- any person with whom you've discussed the assignment and describe
the nature of your discussions
- any unresolved issues or problems
Make sure your submission works from the command line because the TA
will be running your submissions.
The submission should be done using Canvas The file
should have the following naming convention: yourfullname_hw1.zip (ex:
DianeLitman_hw1.zip). The report, the code, and your README file should be
submitted inside the archived folder.
The date in Canvas will be used to determine when your
assignment was submitted (to implement the late policy).
Grading
Code (75 points)
- 25 points for correctly implementing unsmoothed unigram, bigram,
and trigram language models
- 20 points for correctly implementing basic smoothing and interpolation for
bigram and trigram models
- 10 points for improving your smoothing and interpolation results with tuned methods
- 10 points for correctly implementing evaluation via
perplexity
- 10 points for correctly implementing text generation
Report (25 points)
- 20 points for your program description and critical
analysis
- 5 points for presenting the requested supporting data
Extra Credit (10 points)
- for training n-gram models with higher values of n until you can generate text
that actually seems like English