Unigrams, Bigrams and Trigrams
Description: In this exercise, we apply basic counts to a
small corpus to concretely explore some of the ideas discussed in class (impact
of value of n, choice of training corpus, definition of "word", etc.). We will:
- Count unigrams, bigrams and trigrams
- Use the bigrams and trigrams to
generate sentences
To turn in: You should do
everything that's suggested in this assignment and answer all
questions in writing -- if a sentence ends with a question mark, it's a question you need
to answer.
Credits: Part of this exercise was written
by Philip Resnik of the University of Maryland.
A. Examining the Corpus
- Go to /afs/cs.pitt.edu/usr0/mrotaru/public/cs2731/hw2/ngram
cd /afs/cs.pitt.edu/usr0/mrotaru/public/cs2731/hw2/ngram
- You will find there 4 corpora. We will concentrate on GEN.EN. Take a look at the file. You can do this as
follows:
more GEN.EN
(Type spacebar for more pages, and "q" for "quit".) This contains an
annotated version of the book of Genesis, King James version. It is a small
corpus, by current standards -- somewhere on the order of 40,000 or 50,000
words. What words (unigrams) would you expect to have high frequency in this
corpus? What bigrams do you think might be frequent?
(You will not be graded for this part, so try to be honest with yourself and
don't use the results from the next part.)
B. Computing Unigram, Bigram and Trigram Counts
- Write a program (script) in the language of your choice that computes the
counts for each of the following n-gram models: unigrams, bigrams and trigrams. Your program should read input (the
corpus) from standard input and output to standard output the n-gram counts.
Each line you output will contain the n-gram count, a tab character and the
n-gram (for bigrams and trigrams separate the words using space character). You should
output n-grams in decreasing order of the count.
Note that we haven't defined yet what a word means. To keep things simple, you
should assume that a word is a sequence of letters (a-zA-Z). You should treat
all other characters as separators. Please note that words should be treated
case-sensitive ("And" and "and" should be treated as two different words).
- Examine the output for unigrams. Note that v (verse),
c (chapter), id, and GEN
are part of the markup in file GEN.EN, for identifying verse
boundaries. Other than those (which are a good example of why we need
pre-processing to handle markup), are the high frequency words what you would
expect?
- Analogously, look at the bigram and trigram counts. Markup aside, again, are the high
frequency bigrams and trigrams what you would expect?
Answer the questions in writing and submit your programs for computing ngram
counts.
C. Time for Fun
- Extend your programs from part C so that you can generate sentences based
on the n-gram model. You will be given the beginning of the sentence and based
on the n-gram model you should continue the sentence with the most likely
words (similar to the example in the book and lecture - see Approximating
Shakespeare slides). More specifically, given the beginning of the
sentence you should choose as the next word that word that yields the highest
n-gram count when composed with previous words into a n-gram. Thus, you will
use the previous word for bigrams and previous two words for trigrams. You
should continue generating new words until you have generated a 15 words
sentence or you have reached a dead end (all n-gram counts are zero). The beginning of
the sentence (one word for bigrams and two words for trigrams) will be given
to you as command line parameter(s). For example:
bigram_sent God
bigram_sent And
trigram_sent God said
Your program should output the resulting sentence.
- Submit your output for the following inputs:
- Bigrams: "he", "And", "father", "God".
- Trigrams: use the first word generated for each of the above bigram
start words and create the start words for trigrams.
- Is there any difference between the sentences generated by bigrams and
trigrams? Which one of the models do you think will generate more reasonable
sentences?
- Do you think that the sentence you generated has the highest probability
among the sentences that have the same start given the n-gram model? Why? Or
why not?
Answer the questions in writing and submit your programs for generating
sentences.
D. Corpus Impact
- One thing you may have noticed is that there's data sparseness because
uppercase and lowercase are distinct, e.g. "Door" is treated as a different
word from "door". In the corpora directory, you will find a lowercase version
of GEN.EN in the file GEN.EN.lc. Redo B.2 and B.3, C.2.I and C.2.II for this corpus. What, if
anything changes?
- The corpora
subdirectory, contains the Sherlock Holmes stories A Study in Scarlet (study.dyl)
and The Hound of the Baskervilles (hound.dyl). Redo B.2. and B.3 for study.dyl. How
do the n-gram models compare with the ones from the previous corpus?
- Compute the same statistics for the second Holmes story (hound.dyl). Same
author, same main character, same genre ... How do the unigrams, bigrams, and trigrams
compare between the two Holmes cases?