Unigrams, Bigrams and Trigrams
Description: In this exercise, we apply basic counts to a
small corpus to concretely explore some of the ideas discussed in class (impact
of value of n, choice of training corpus, definition of "word", etc.).
To turn in: You should do
everything that's suggested in this assignment and answer all
questions in writing -- if a sentence ends with a question mark, it's a question you need
to answer.
Credits: Part of this exercise was written
by Philip Resnik of the University of Maryland.
A. Examining the Corpus
- Go to corpora.
- You will find there 3 corpora. We will start with hound.dyl.
This contains the Sherlock Holmes story The Hound of the Baskervilles (hound.dyl).
It is a small corpus, by current standards -- somewhere on the order of 60,000
words. What words (unigrams) would you expect to have high frequency in this
corpus? What bigrams do you think might be frequent?
(You will not be graded for this part, so try to be honest with yourself and
don't use the results from the next part.)
B. Computing Unigram, Bigram and Trigram Counts
- Write a program (script) in the language of your choice (or find
one on the web) that computes the
counts for each of the following n-gram models: unigrams, bigrams and trigrams. Your program should read input (the
corpus) from standard input and output to standard output the n-gram counts.
Each line you output will contain the n-gram count, a tab character and the
n-gram (for bigrams and trigrams separate the words using space character). You should
output n-grams in decreasing order of the count.
Note that we haven't defined yet what a word means. To keep things simple, you
should assume that a word is a sequence of letters (a-zA-Z). You should treat
all other characters as separators. Please note that words should be treated
case-sensitive ("And" and "and" should be treated as two different words).
- Examine the output for unigrams. Are the high frequency words what you would
expect?
- Analogously, look at the bigram and trigram counts. Again, are the high
frequency bigrams and trigrams what you would expect?
Answer the questions in writing and submit your programs for computing ngram
counts.
C. Corpus Impact
- One thing you may have noticed is that there's data sparseness because
uppercase and lowercase are distinct, e.g. "Door" is treated as a different
word from "door". Create a lowercase version
of hound.dyl called hound.lc. Redo B.2 and B.3 for this corpus. What, if
anything changes?
- Corpora
contains a second Sherlock Holmes story already in lowercase: A Study in Scarlet (study.lc).
Same
author, same main character, same genre ... Redo
B.2. and B.3 for study.lc. How do the unigrams, bigrams, and trigrams
compare between the two Holmes (lowercase) cases?
- The remaining file in corpora
is an annotated version of the book of Genesis, King James
version, already in lowercase (GEN.EN.lc).
Note, however, that v (verse),
c (chapter), id, and GEN
are part of the markup in the file, for identifying verse
boundaries.
Compute the same statistics for this very different corpus.
Other than issues resulting from markup (which is a good example of why we need
pre-processing to handle markup), how do the three n-gram models
compare with the ones from the previous Holmes corpora?