Part of Speech Tagging


Description: In this exercise, we will evaluate the output of a part-of-speech tagger on a set of documents. We will:

To turn in: Please answer (on paper) all of the questions marked below by SUBMIT.

Credits: This exercise was developed for Johanna Moore's class at the University of Edinburgh.


Finding and Running the Software

The tagger is the Brill tagger described in Chapter 8 of the textbook, which can be downloaded from Eric Brill's Home Page. This paper describes the tagger in more detail. The tagger has already been downloaded (/afs/cs.pitt.edu/usr0/mrotaru/public/cs2731/hw2/RULE_BASED_TAGGER_V1.14) and installed for you. Here are some instructions that you might need.

You must "tokenize" the input to the tagger. In particular, you must perform the following substitutions:

The texts provided in question 1 have been tokenized for you; if you want to try tagging any other text, please make sure that it is properly tokenized.

To run the tagger, go into the appropriate directory by typing
% cd /afs/cs.pitt.edu/usr0/mrotaru/public/cs2731/RULE_BASED_TAGGER_V1.14/Bin_and_Data
Then type
% ./tagger LEXICON filename BIGRAMS LEXICALRULEFILE CONTEXTUALRULEFILE
Here filename is the name of the file to tag, and LEXICON, BIGRAMS, LEXICALRULEFILE, and CONTEXTUALRULEFILE are strings that you actually type. This will print the tagged file to standard output; if you want to save the output in a file called outfile, you can redirect it like this:
% ./tagger LEXICON filename BIGRAMS LEXICALRULEFILE CONTEXTUALRULEFILE > outfile


    Questions

    1. Run the tagger on the following texts from the Penn Treebank and compare the output to the "gold standard" texts. In each of the lines below, the link to "Text n" (e.g., "Text 1") is to a version of the text formatted with one sentence per line -- this is easier to read, but you should not use it for the actual tagging experiments. The "Tagged" link is to the tagged file from the Treebank, and the "Untagged" file is formatted the same way as the Tagged one for ease of comparison.

      Choose five tagging errors and discuss the possible reasons for these errors.

      SUBMIT: Print-outs showing the tagging errors you are discussing, and your discussion of the errors.

    2. Quantitatively evaluate the performance of the tagger. To do this, you will use this program to compute the confusion matrices comparing the tagger's output to the gold standard and to compute Kappa.

      SUBMIT: Kappa value, and answers to the above questions.

    3. Try tagging the following texts. They have already been tokenised as specified on the tagging instructions page.

      Examine the results. Do you think that having this part-of-speech information would have made the task of locating the date and time expressions on our first homework easier? Why or why not?

      SUBMIT: Your answer to the above question, along with any parts of the tagged output that support your answer.