HOMEWORK 1 (CS 1671)

Assigned: January 16, 2020

Due: February 4, 2020 (before midnight)

In this assignment, you will use language modeling to detect which of three languages a document is written in. You will build unigram, bigram, and trigram letter language models (both unsmoothed and smoothed versions) for three languages, score a test document with each model, and determine the language it is written in based on perplexity. You will critically examine your results.

Tasks

To complete the assignment, you will need to write a program (from scratch) that:

You may make any additional assumptions and design decisions, but state them in your report (see below). For example, some design choices that could be made are how you want to handle uppercase and lowercase letters or how you want to handle digits. The choice made is up to you, we only require that you detail these decisions in your report and consider any implications of them in your results. There is no wrong choice here, and these decisions are typically made by NLP researchers when pre-processing data.

You may write your program in any TA-approved programming language (so far, java or python).

For this assignment you must implement the model generation from scratch. You are allowed to use any resources or packages that help you manage your project, i.e. Github or any file i/o packages. If you have questions about this please ask.

Data

A tar file containing the data for this project is available here. It consists of:

Report

Your report should include:

Submission Details

Your full submission should include not only your report, but also include:

Make sure your submission works from the command line because the TA will be running your submissions.

The submission should be done using the Assignment Tool in CourseWeb/ Blackboard. The file should be a zip file with the following naming convention: yourfullname_hw1.zip (ex: DianeLitman_hw1.zip). The report, the code, and your README file should be submitted inside the archived folder.

The date in CourseWeb will be used to determine when your assignment was submitted (to implement the late policy).

Grading

Code (70 points):

Report (30 points):