HOMEWORK 1: Character-Level Language Models

Assigned: September 3, 2020

Due: September 22, 2020 (before midnight)

In this assignment, you will build unigram, bigram, and trigram character language models (both unsmoothed and smoothed versions) for three languages, score a test document with each, and determine the language it is written in based on perplexity. You will also use your English language models to generate texts. You will critically examine all results. The learning goals of this assignment are to:

Train Language Models

To complete the assignment, you will need to write a program (from scratch) that:

You may make any additional assumptions and design decisions, but state them in your report (see below). For example, some design choices that could be made are how you want to handle uppercase and lowercase letters or how you want to handle digits. The choice made is up to you, we only require that you detail these decisions in your report and consider any implications of them in your results. There is no wrong choice here, and these decisions are typically made by NLP researchers when pre-processing data.

You may write your program in any TA-approved programming language (Python, Java, C/C++).

For this assignment you must implement the model generation from scratch. You are allowed to use any resources or packages that help you manage your project, i.e. Github or any file i/o packages. If you have questions about this please ask.

Use Language Models

  • Language identification: For all trained language models, read in the test document, apply the language model to all sentences in it, and output perplexity. Based on the results, identify the language of the test document.
  • Text generation: For only the bigram and trigram language models trained on English, extend your programs so that you can generate sentences. That is, given any English letter(s) as input and based on the n-gram model you should continue the sentence with the most likely characters. More specifically, given a letter to begin a sentence with, you should choose as the next character that character that yields the highest n-gram count when composed with previous characters into a n-gram. Thus, you will use the previous character for bigrams and previous two characters for trigrams. You should continue generating new characters until you have generated a 100 character sentence or you have reached a dead end (all n-gram counts are zero). The beginning of the sentence (one character for bigrams and two characters for trigrams) should be command line parameter(s). Your program should output the resulting sentence.

    Data

    The data for this project is available here. It consists of:

    Report

    Your report should include:

    Submission Procedure

    Your full submission should include not only your report, but also include: Make sure your submission works from the command line because the TA will be running your submissions.

    The submission should be done using Canvas The file should have the following naming convention: yourfullname_hw1.zip (ex: DianeLitman_hw1.zip). The report, the code, and your README file should be submitted inside the archived folder.

    The date in Canvas will be used to determine when your assignment was submitted (to implement the late policy).

    Grading

    Code (75 points)

    Report (25 points)

    Extra Credit (10 points)