CS 1671: Human Language Technologies

Assignment 1

This assignment will give you practice with text processing and calculating document similarity.

Please create a separate file, one for each question, using the following naming convention: yourLastName questionNumber "HW1." extension. Thus, Smith's program for question 2, assuming it is in Python, should be named "smithQ2HW1.py".

The teaching assistant will run your code on the data given to you in http://www.cs.pitt.edu/~wiebe/courses/CS1671/Sp2014/Assign1Files He will also run your code on different data you will not have seen before (unseen test data).

Since the TA has to run your code, it is important that you follow the guidelines. To make the task feasible, he has to be able to run them all in a standard way. A "TA-Grief" penalty of up to 20% will be assessed if the TA cannot run your code as specified.

Please see the teaching assistant's webpage for submission instructions: http://www.cs.pitt.edu/~hux10/CS1671.html

Calculate Frequency Counts and Measure Document Similarity

Please use doc.01, doc.02, doc.03, and doc.04 as the text corpus for the remaining questions. They are available in http://www.cs.pitt.edu/~wiebe/courses/CS1671/Sp2014/Assign1Files

The words "type" and "token" are used in NLP to distinguish words from word instances, respectively. Thus, the sentence "The little brown bunny ate the little brown carrot" has 6 types ("The", "little", "brown", "bunny", "ate", "carrot") and 9 tokens.

Questions:

For these questions, we define a word as any (maximal) sequence of letters and '-'s. (This is not a correct definition of a word, but we will use it for this assignment.) All letters should be mapped to lower case.

Q1. Find the most frequent 10 words (types) in the corpus. Take as input a file listing the names of the documents in the corpus (one file name per line) and print the 10 most frequent words to standard output. The input file should be a command line argument.

Q2. Calculate document similarity. In particular:

Represent each document as a vector. The total dimension of each vector should equal the number of unique words (types) in the entire corpus (i.e., doc.01, doc.02, doc.03, and doc.04 combined). Each dimension in a vector for a given document is a word frequency (i.e., the number of tokens of a word type in the document). All letters should be mapped to lower case.

For example, suppose we have the following four (short) documents:

doc.01: The little brown bunny ate the little brown carrot

doc.02: Carrots are easy to grow.

doc.03: May I please have a carrot?

doc.04: A bunny ate my carrots.

We would have the following vectors:

doc.01: a:0 are:0 ate:1 brown:2 bunny:1 carrot:1 carrots:0 easy:0 grow:0 have:0 i:0 little:2 may:0 my:0 please:0 to:0 the:2

doc.02: a:0 are:1 ate:0 brown:0 bunny:0 carrot:0 carrots:1 easy:1 grow:1 have:0 i:0 little:0 may:0 my:0 please:0 to:1 the:0

doc.03: a:1 are:0 ate:0 brown:0 bunny:0 carrot:1 carrots:0 easy:0 grow:0 have:1 i:1 little:0 may:1 my:0 please:1 to:0 the:0

doc.04: a:1 are:0 ate:1 brown:0 bunny:1 carrot:0 carrots:1 easy:0 grow:0 have:0 i:0 little:0 may:0 my:1 please:0 to:0 the:0

Take as input a file listing the names of the documents in the corpus (one file name per line) and print to standard output the document similarity between each pair of documents in the corpus. The input file should be a command line argument. Do this by computing the cosine similarity between their vector representations. The Cosine similarity between two document vectors d1 and d2 is:

S(d1, d2) = d1 . d2 / (|d1| |d2|)

The Cosine measure calculates the angle between vectors in a high dimensional space. The numerator in the given equation is a dot product of two vectors d1 and d2.

Please see this presentation by Peter Burden for more background and an example of computing the Cosine measure.

Please see this file for the formula and another example.