HOMEWORK 1 (CS 1671)

Assigned: September 11, 2018

Due: September 27, 2018 (before midnight)

In this assignment, you will use language modeling to detect which of three languages a document is written in. You will build unigram, bigram, and trigram letter language models (both unsmoothed and smoothed versions) for three languages, score a test document with each, and determine the language it is written in based on perplexity. You will critically examine your results.

Tasks

To complete the assignment, you will need to write a program that:

You may make any additional assumptions and design decisions, but state them in your report (see below).

You may write your program in any TA-approved programming language (so far, c++, java, python, matlab).

Data

The data for this project is available here. It consists of:

Report

Your report should include: Your full submission should include not only your report, but also include:

Submission Procedure

The submission should be done using the Assignment Tool in CourseWeb/ Blackboard. The file should have the following naming convention: yourfullname_hw1.zip (ex: DianeLitman_hw1.zip). The report, the code, and your README file should be submitted inside the archived folder.

The date in CourseWeb will be used to determine when your assignment was submitted (to implement the late policy).

Grading

Code (70 points):

Report (30 points):