HOMEWORK 1 (CS 1671)
Assigned: January 16, 2020
Due: February 4, 2020 (before midnight)
In this assignment, you will use language modeling to detect which of
three languages a document is written in. You will build unigram,
bigram, and trigram
letter language models (both unsmoothed and
smoothed versions) for three languages, score a test document with
each model, and determine the language it is written in based on
perplexity. You will critically examine your results.
Tasks
To complete the assignment, you will need to write
a program (from scratch) that:
- builds the models: reads in a text, collects
counts for all letter-grams of size 1, 2, and 3, estimates probabilities, and writes out
the unigram, bigram, and trigram models into files
- adjusts the counts: rebuilds the trigram language model using two
different methods. LaPlace smoothing and linear interpolation with equally weighted lambdas
- evaluates all unsmoothed and smoothed models: reads in a
test document, applies the language models to all sentences in it, outputs their perplexity, and determines the language of the test document
You may make any additional assumptions and design decisions, but state them in your report (see below). For example, some design choices that could be made are how you want to handle uppercase and lowercase letters or how you want to handle digits. The choice made is up to you, we only require that you detail these decisions in your report and consider any implications of them in your results. There is no wrong choice here, and these decisions are typically made by NLP researchers when pre-processing data.
You may write your program in any TA-approved programming language (so far, java or python).
For this assignment you must implement the model generation from scratch. You are allowed to use any resources or packages that help you manage your project, i.e. Github or any file i/o packages. If you have questions about this please ask.
Data
A tar file containing the data for this project is available
here.
It consists of:
- training.en - English training data
- training.es - Spanish training data
- training.de - German training data
- test - test document
Report
Your report should include:
- a description of how you wrote your program, including all assumptions and design decisions (1 - 2 pages)
- documentation that your probability distributions are valid (sum
to 1)
- the perplexity scores for all unsmoothed and smoothed language
models for each sentence (i.e., line) in the test document, as well as the
document average
- critical analysis of your results: e.g.,
why do your perplexity scores tell you what language the test data is written in?
what does a comparison of your unsmoothed versus smoothed scores tell you about which performs best?
what does a comparison of your unigram, bigram, and trigram scores tell you about which performs best?
why do you think your models performed the way they did?
etc. (1 - 2 pages)
Submission Details
Your full submission should include not only your report, but also include:
- the code of your program(s)
- a README file explaining
- how to run your code
- the computing environment you used; what programming language you used and the major and minor version of that language
- any additional resources, references, or web pages you've consulted
- any person with whom you've discussed the assignment and describe the nature of your discussions
- any unresolved issues or problems
Make sure your submission works from the command line because the TA will be running your submissions.
The submission should be done using the Assignment Tool in CourseWeb/ Blackboard. The file
should be a zip file with the following naming convention: yourfullname_hw1.zip (ex:
DianeLitman_hw1.zip). The report, the code, and your README file should be
submitted inside the archived folder.
The date in CourseWeb will be used to determine when your
assignment was submitted (to implement the late policy).
Grading
Code (70 points):
- 30 points for correctly implementing unsmoothed unigram, bigram,
and trigram language models
- 30 points for correctly implementing smoothing and interpolation for
trigram models
- 10 points for correctly implementing evaluation via
perplexity
Report (30 points):
- 20 points for your program description and critical
analysis
- 10 points for presenting the requested supporting data