Part of Speech Tagging

Description: In this exercise, we will evaluate the output of a part-of-speech tagger on a set of documents. We will:

Perform POS tagging.
Evaluate our results using Kappa.

To turn in: Please answer (on paper) all of the questions marked below by SUBMIT.

Credits: This exercise was developed for Johanna Moore's class at the University of Edinburgh.

Finding and Running the Software

The tagger is the Brill tagger described in Chapter 8 of the textbook, which can be downloaded from Eric Brill's Home Page. This paper describes the tagger in more detail. The tagger has already been downloaded (/afs/cs.pitt.edu/usr0/mrotaru/public/cs2731/hw2/RULE_BASED_TAGGER_V1.14) and installed for you. Here are some instructions that you might need.

You must "tokenize" the input to the tagger. In particular, you must perform the following substitutions:

Split punctuation from adjoining words
Convert double quotes (") to doubled single forward and backward quotes (`` and '')
Split verb contractions and possessive 's from the component morphemes:
- children's -> children 's
- parents -> parents '
- won't -> wo n't
- gonna -> gon na
- I'm -> I 'm

The texts provided in question 1 have been tokenized for you; if you want to try tagging any other text, please make sure that it is properly tokenized.

To run the tagger, go into the appropriate directory by typing

% cd /afs/cs.pitt.edu/usr0/mrotaru/public/cs2731/RULE_BASED_TAGGER_V1.14/Bin_and_Data

Then type

% ./tagger LEXICON filename BIGRAMS LEXICALRULEFILE CONTEXTUALRULEFILE

Here filename is the name of the file to tag, and LEXICON, BIGRAMS, LEXICALRULEFILE, and CONTEXTUALRULEFILE are strings that you actually type. This will print the tagged file to standard output; if you want to save the output in a file called outfile, you can redirect it like this:

% ./tagger LEXICON filename BIGRAMS LEXICALRULEFILE CONTEXTUALRULEFILE > outfile

Questions

Run the tagger on the following texts from the Penn Treebank and compare the output to the "gold standard" texts. In each of the lines below, the link to "Text n" (e.g., "Text 1") is to a version of the text formatted with one sentence per line -- this is easier to read, but you should not use it for the actual tagging experiments. The "Tagged" link is to the tagged file from the Treebank, and the "Untagged" file is formatted the same way as the Tagged one for ease of comparison.
- Text 1 Untagged Tagged
- Text 2 Untagged Tagged
- Text 3 Untagged Tagged
- Text 4 Untagged Tagged
- Text 5 Untagged Tagged
- Text 6 Untagged Tagged
- Text 7 Untagged Tagged
- Text 8 Untagged Tagged
- Text 9 Untagged Tagged
- Text 10 Untagged Tagged
- Text 11 Untagged Tagged
- Text 12 Untagged Tagged
- Text 13 Untagged Tagged
- All the texts in one file: Untagged Tagged
Choose five tagging errors and discuss the possible reasons for these errors.

SUBMIT: Print-outs showing the tagging errors you are discussing, and your discussion of the errors.
Quantitatively evaluate the performance of the tagger. To do this, you will use this program to compute the confusion matrices comparing the tagger's output to the gold standard and to compute Kappa.
- Compute Kappa.
- What is causing the errors? Use the confusion matrices to identify any systematic errors. Describe three of them and show an example of each.
SUBMIT: Kappa value, and answers to the above questions.
Try tagging the following texts. They have already been tokenised as specified on the tagging instructions page.
Examine the results. Do you think that having this part-of-speech information would have made the task of locating the date and time expressions on our first homework easier? Why or why not?

SUBMIT: Your answer to the above question, along with any parts of the tagged output that support your answer.