Unigram and Bigram Counts

Description: In this exercise, we apply basic counts to a small corpus. We will:

Count unigrams
Count bigrams

To turn in: You should do everything that's suggested in this assignment and answer all questions -- if a sentence ends with a question mark, it's a question you need to answer.

Credits: The scripts and most of this exercise were written by Philip Resnik of the University of Maryland. The scripts were derived from Ken Church's "NGRAMS" tutorial at ACL-1995.

Prerequisites: This exercise assumes basic familiarity with typical Unix commands, and the ability to create text files (e.g. using a text editor such as vi or emacs). No programming is required.

Notational Convention: The symbols <== will be used to identify a comment from the instructor, on lines where you're typing something in. So, for example, in

    
    %  cp file1.txt file2.txt   <== The "cp" is short for "copy"

what you're supposed to type at the prompt (identified by the percent sign, here) is

    cp file1.txt file2.txt

followed by a carriage return.

Getting the code

You will use ftp (file transfer protocol) to get the software for this exercise. You'll be downloading source code and compiling it.

Here are the steps:

  % mkdir stats               <== Create a subdirectory called "stats"
  % cd stats                  <== Go into that directory
  % ftp umiacs.umd.edu        <== Invoke the "ftp" program

  Name (yourname): anonymous    <==   Type "anonymous" (without quotes)
  Password: name@address        <==   Type your e-mail address

  ftp> cd pub/resnik/723        <==   Go to directory pub/resnik/723
  ftp> binary                   <==   USe binary transfer mode
  ftp> get ngrams.tar          <==   Download the file
  ftp> bye                      <==   Exit from ftp

  % tar xvf ngrams.tar       <== Extract code from the file
  % rm ngrams.tar            <== Delete to conserve space
  % cd ngrams                   <==   Go into the directory with the code
  % chmod u+x *.pl            <== Make perl scripts executable
  % gcc -o filter_stopwords filter_stopwords.c   <== Compile filter_stopwords
  % gcc -o lr_simple lr_simple.c -lm             <== Compile lr_simple

Don't forget the -lm flag when you're compiling lr_simple! If you forget it you'll get a warning that the log function is unresolved, since that flag tells the compiler to link to the built-in math library.

Execute the command

  % which perl

to see the full path name for the version of perl you're running on your system. Edit file lr_filter.pl and replace /usr/imports/bin/perl with that path name (on most Unix systems you'll be replacing it with /usr/local/bin/perl).

Generating Statistics for a Corpus

Take a look at file corpora/GEN.EN. You can do this as follows:
```
  %  more corpora/GEN.EN
```
(Type spacebar for more pages, and "q" for "quit".) This contains an annotated version of the book of Genesis, King James version. It is a small corpus, by current standards -- somewhere on the order of 40,000 or 50,000 words. What words (unigrams) would you expect to have high frequency in this corpus? What bigrams do you think might be frequent?
Create a subdirectory called genesis to contain the files with statistics generated from this corpus:
```
  %  mkdir genesis
```
Then run the Stats program to analyze the corpus. The program requires an input file, and a "prefix" to be used in creating output files. The input file will be corpora/GEN.EN, and the prefix will be genesis/out, so that output files will be created in the genesis subdirectory. That is, you should execute the following:
```
  %  Stats corpora/GEN.EN genesis/out
```
The program will tell you what it's doing, as it counts unigrams, counts bigrams. (It will also compute mutual information and likelihood ratio statistics, which you should ignore for this homework). This may take some time to run.
You should now have a subdirectory called genesis containing a bunch of files that begin with out.

Examining Unigram and Bigram Counts

Go into directory genesis.
```
  %  cd genesis
```
Look at file out.unigrams:
```
  %  more out.unigrams
```
Seeing the vocabulary in alphabetical order isn't very useful, so let's sort the file by the unigram frequency, from highest to lowest:
```
  %  sort -nr out.unigrams > out.unigrams.sorted
  %  more out.unigrams.sorted
```
Now examine out.unigrams.sorted. Note that v (verse), c (chapter), id, and GEN are part of the markup in file GEN.EN, for identifying verse boundaries. Other than those (which are a good example of why we need pre-processing to handle markup), are the high frequency words what you would expect?
Analogously, look at the bigram counts out.bigrams:
```
  %  sort -nr out.bigrams > out.bigrams.sorted
  %  more out.bigrams.sorted
```
Markup aside, again, are the high frequency bigrams what you would expect?
There are a lot of common words of English in there, so try filtering those out using the filter_stopwords program. First, access the program so it's easy to run in this directory:
```
  %  ln -s ../filter_stopwords      <== Creates a symbolic link
  %  ln -s ../stop.wrd              <== Creates a symbolic link
```
Then run it:
```
  %  filter_stopwords stop.wrd < out.lr > out.lr.filtered
```
How does out.lr.filtered look as a file containing bigrams that are characteristic of this corpus?

Time for Fun

One thing you may have noticed is that there's more data sparseness because uppercase and lowercase are distinct, e.g. "Door" is treated as a different word from "door". In the corpora directory, you can create an all- lowercase version of GEN.EN by doing this:
```
  %   cat GEN.EN | tr "A-Z" "a-z" > GEN.EN.lc
```
To save disk space, assuming you're done with GEN.EN, delete the original:
```
  %   rm GEN.EN
```
Try re-doing the entire exercise with this version. (Yes, all of it!) What, if anything, changes?
Ok, perhaps that last one wasn't exactly fun. But this probably will be. Go into your corpora subdirectory. Then ftp to site umiacs.umd.edu and go to directory pub/resnik/723/ebooks, which contains Sherlock Holmes stories such as A Study in Scarlet and The Hound of the Baskervilles. Get these two stories. E.g.:
```
  % cd corpora                
  % ftp umiacs.umd.edu        

  Name (yourname): anonymous  
  Password: name@address      

  ftp> cd pub/resnik/723/ebooks
  ftp> dir
  ftp> get hound.dyl       
  ftp> get study.dyl
  ftp> bye                       <==   Exit from ftp
```
Now get back into your stats directory, create an output directory, say, holmes1, and run the Stats program for the file of interest, e.g.:
```
  %   cd ..
  %   mkdir holmes1
  %   Stats corpora/study.dyl holmes1/out
  %   cd holmes1
```
Again also convert to lowercase before running Stats:
```
  %   cd ../corpora
  %   cat study.dyl | tr "A-Z" "a-z" > study.lc
  %   rm study.dyl
  %   cd ..
```
Look at the various outputs. How do they compare with the previous corpus?
Now go through the same process again, but creating a directory holmes2 and using the other holmes file. Same author, same main character, same genre... how do the unigrams and bigrams compare between the two Holmes cases? If you use filter_stopwords, how do the results look -- what kinds of bigrams are you getting? What natural language processing problems might this be useful for?

When You are Done, We Need to Recover Disk Space

Corpora and data take up a lot of disk space. When you are done, PLEASE delete the output directories you have created, and even the corpus directory itself if you no longer need it. For example, if you are in your ngrams directory, you can type:

  %   /bin/rm -rf corpora genesis holmes1 holmes2

to delete the entire directories. Your housekeeping will be much appreciated.