Unigrams, Bigrams and Trigrams

Description: In this exercise, we apply basic counts to a small corpus to concretely explore some of the ideas discussed in class (impact of value of n, choice of training corpus, definition of "word", etc.). We will:

Count unigrams, bigrams and trigrams
Use the bigrams and trigrams to generate sentences

To turn in: You should do everything that's suggested in this assignment and answer all questions in writing -- if a sentence ends with a question mark, it's a question you need to answer.

Credits: Part of this exercise was written by Philip Resnik of the University of Maryland.

A. Examining the Corpus

Go to /afs/cs.pitt.edu/usr0/mrotaru/public/cs2731/hw2/ngram

    	cd /afs/cs.pitt.edu/usr0/mrotaru/public/cs2731/hw2/ngram

You will find there 4 corpora. We will concentrate on GEN.EN. Take a look at the file. You can do this as follows:
```
  	more GEN.EN
```
(Type spacebar for more pages, and "q" for "quit".) This contains an annotated version of the book of Genesis, King James version. It is a small corpus, by current standards -- somewhere on the order of 40,000 or 50,000 words. What words (unigrams) would you expect to have high frequency in this corpus? What bigrams do you think might be frequent?
(You will not be graded for this part, so try to be honest with yourself and don't use the results from the next part.)

B. Computing Unigram, Bigram and Trigram Counts

Write a program (script) in the language of your choice that computes the counts for each of the following n-gram models: unigrams, bigrams and trigrams. Your program should read input (the corpus) from standard input and output to standard output the n-gram counts. Each line you output will contain the n-gram count, a tab character and the n-gram (for bigrams and trigrams separate the words using space character). You should output n-grams in decreasing order of the count.
Note that we haven't defined yet what a word means. To keep things simple, you should assume that a word is a sequence of letters (a-zA-Z). You should treat all other characters as separators. Please note that words should be treated case-sensitive ("And" and "and" should be treated as two different words).
Examine the output for unigrams. Note that v (verse), c (chapter), id, and GEN are part of the markup in file GEN.EN, for identifying verse boundaries. Other than those (which are a good example of why we need pre-processing to handle markup), are the high frequency words what you would expect?
Analogously, look at the bigram and trigram counts. Markup aside, again, are the high frequency bigrams and trigrams what you would expect?

Answer the questions in writing and submit your programs for computing ngram counts.

C. Time for Fun

Extend your programs from part C so that you can generate sentences based on the n-gram model. You will be given the beginning of the sentence and based on the n-gram model you should continue the sentence with the most likely words (similar to the example in the book and lecture - see Approximating Shakespeare slides). More specifically, given the beginning of the sentence you should choose as the next word that word that yields the highest n-gram count when composed with previous words into a n-gram. Thus, you will use the previous word for bigrams and previous two words for trigrams. You should continue generating new words until you have generated a 15 words sentence or you have reached a dead end (all n-gram counts are zero). The beginning of the sentence (one word for bigrams and two words for trigrams) will be given to you as command line parameter(s). For example:
```
    	bigram_sent God
	bigram_sent And
```
```
	trigram_sent God said
```
Your program should output the resulting sentence.
Submit your output for the following inputs:
1. Bigrams: "he", "And", "father", "God".
2. Trigrams: use the first word generated for each of the above bigram start words and create the start words for trigrams.
3. Is there any difference between the sentences generated by bigrams and trigrams? Which one of the models do you think will generate more reasonable sentences?
4. Do you think that the sentence you generated has the highest probability among the sentences that have the same start given the n-gram model? Why? Or why not?
Answer the questions in writing and submit your programs for generating sentences.

D. Corpus Impact

One thing you may have noticed is that there's data sparseness because uppercase and lowercase are distinct, e.g. "Door" is treated as a different word from "door". In the corpora directory, you will find a lowercase version of GEN.EN in the file GEN.EN.lc. Redo B.2 and B.3, C.2.I and C.2.II for this corpus. What, if anything changes?
The corpora subdirectory, contains the Sherlock Holmes stories A Study in Scarlet (study.dyl) and The Hound of the Baskervilles (hound.dyl). Redo B.2. and B.3 for study.dyl. How do the n-gram models compare with the ones from the previous corpus?
Compute the same statistics for the second Holmes story (hound.dyl). Same author, same main character, same genre ... How do the unigrams, bigrams, and trigrams compare between the two Holmes cases?