Unigrams, Bigrams and Trigrams


Description: In this exercise, we apply basic counts to a small corpus to concretely explore some of the ideas discussed in class (impact of value of n, choice of training corpus, definition of "word", etc.).

To turn in: You should do everything that's suggested in this assignment and answer all questions in writing -- if a sentence ends with a question mark, it's a question you need to answer.

Credits: Part of this exercise was written by Philip Resnik of the University of Maryland.


A. Examining the Corpus

  1. Go to corpora.

  2. You will find there 3 corpora. We will start with hound.dyl. This contains the Sherlock Holmes story The Hound of the Baskervilles (hound.dyl). It is a small corpus, by current standards -- somewhere on the order of 60,000 words. What words (unigrams) would you expect to have high frequency in this corpus? What bigrams do you think might be frequent?
    (You will not be graded for this part, so try to be honest with yourself and don't use the results from the next part.)

B. Computing Unigram, Bigram and Trigram Counts

  1. Write a program (script) in the language of your choice (or find one on the web) that computes the counts for each of the following n-gram models: unigrams, bigrams and trigrams. Your program should read input (the corpus) from standard input and output to standard output the n-gram counts. Each line you output will contain the n-gram count, a tab character and the n-gram (for bigrams and trigrams separate the words using space character). You should output n-grams in decreasing order of the count.
    Note that we haven't defined yet what a word means. To keep things simple, you should assume that a word is a sequence of letters (a-zA-Z). You should treat all other characters as separators. Please note that words should be treated case-sensitive ("And" and "and" should be treated as two different words).

  2. Examine the output for unigrams. Are the high frequency words what you would expect?

  3. Analogously, look at the bigram and trigram counts. Again, are the high frequency bigrams and trigrams what you would expect?

    Answer the questions in writing and submit your programs for computing ngram counts.


C. Corpus Impact

  1. One thing you may have noticed is that there's data sparseness because uppercase and lowercase are distinct, e.g. "Door" is treated as a different word from "door". Create a lowercase version of hound.dyl called hound.lc. Redo B.2 and B.3 for this corpus. What, if anything changes?

  2. Corpora contains a second Sherlock Holmes story already in lowercase: A Study in Scarlet (study.lc). Same author, same main character, same genre ... Redo B.2. and B.3 for study.lc. How do the unigrams, bigrams, and trigrams compare between the two Holmes (lowercase) cases?

  3. The remaining file in corpora is an annotated version of the book of Genesis, King James version, already in lowercase (GEN.EN.lc). Note, however, that v (verse), c (chapter), id, and GEN are part of the markup in the file, for identifying verse boundaries. Compute the same statistics for this very different corpus. Other than issues resulting from markup (which is a good example of why we need pre-processing to handle markup), how do the three n-gram models compare with the ones from the previous Holmes corpora?