Project Description (CS 2731 / ISSP 2230)

Introduction

The project for our class will be to design, build, and evaluate a word sense disambiguation (WSD) system for English. This will give you exposure to a cutting edge research area, and experience in building a real NLP system.

More specifically, you will write a program that will disambiguate the senses of any ambiguous word. You will be provided with tagged training data, and later untagged test data, for many words. You should probably begin by making a straightforward implementation of some previously-tried technique for word sense disambiguation (there are a huge number of research papers on this topic), and then seeing if you can improve it - the aim being to make this program as good at word sense disambiguation as possible. Your final submission should consist of a single system that performs as well as possible. The program may require tagged training instances or other lexical resources that you make use of. You are free (and, where appropriate, encouraged) to make use of existing code and systems such as taggers, parsers, etc. as part of your project. You should make sure their use is proprely acknowledged, and make clear what additional value your project is adding.

The Data

The data materials that you need for this assignment can be downloaded from the following directory. Each training or testing file will provide a set of lexical items to be disambiguated (specified using the 'lexelt' tags), along with an associated set of instances for each lexical item (the 'instance' tags). Within each instance, the word to be disambiguated is specified using the 'head' tag, while the context that can be exploited is specified using the 'context' tag. Here is an example excerpt:

<lexelt item="activate.v">
<instance id="activate.v.bnc.00024693" docsrc="BNC">
<answer instance="activate.v.bnc.00024693" senseid="38201"/>
<context>
Do you know what it is , and where I can get one ? We suspect you had seen the Terrex Autospade , which is made by Wolf Tools . It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly , you should n't have to bend your back during general digging , although it wo n't lift out the soil and put in a barrow if you need to move it ! If gardening tends to give you backache , remember to take plenty of rest periods during the day , and never try to lift more than you can easily cope with .
</context>
</instance>
<instance id="activate.v.bnc.02016434" docsrc="BNC">
<answer instance="activate.v.bnc.02016434" senseid="38202"/>
<context>
Between 9.09.5d.p.c . , these patterns becomes refined with a progressive restriction leading to the appearance of stripes of expression in r2 and r4 and a stream of crest cells migrating from each ( Fig. 1 i ' ) . These dynamic changes in spatial expression are identical to those observed with Hox - B1 , therefore the combination of Hox genes expressed in r4 ( ref . 10 ) is reproduced in r2 upon exposure to retinoic acid . The Hox - B1 experiments showed that a marker for r5 ( expression in neural cell bodies ) was <head>activated</head> in r3 and we wanted to examine whether there were other changes to segmental expression in r3 . We have used a line of transgenic mice that accurately reveals the normal spatial and temporal patterns of Krox - 20 expression in r3 and r5 ( Fig. 3a - e ; ref . 23 ) . In initial stages , as reported for both Xenopus and mouse embryos treated with retinoic acid , we find that although there is a clearly defined segment in the r3 position , the r3 stripe of Krox - 20 expression is absent .
</context>
</instance>
</lexelt>

Note that, in context, the actual word to be disambiguated may be a morphologically inflected or capitalized form of the word in the training data. Also note that some words have more training data than others. (CHANGE: your test file will now not contain a word that was previously unseen in the training data). Since some techniques may work well with larger amounts of training data, but poorly with less data, you should consider such factors before choosing your algorithms.

The Answers

In the training data (but not in the test data), the 'answer' tag shown above is provided. These are to be used only for evaluating the program's final performance. The idea is that you should use each word's tagged training data to learn about the word's senses, and then predict the senses for the words tagged as 'head' in the test data.

Redundantly, correct answers are also contained in separate answer files for the training data. For example, if the excerpt above was in the file "train", then you would also be provided with the file "train.key", in the following format:

activate.v activate.v.bnc.00024693 38201
activate.v activate.v.bnc.02016434 38202

Each line contains the following items, in the specified order:

reference id for the lexical item, from the 'lexelt' tag in the data

a single space

reference number for test instance, corresponding to the 'instance_id'

a single space

a sense tag

The output of your WSD program (as well as the "gold standards" that will be used for testing) need to be in this format, so your program's performance can be evaluated using the scoring tools described below.

The file "dictionary" provides a way to interpret the sense tags in the answer keys, e.g.,

<lexelt item="activate.v">
<sense id="38201" source="ws" synset="activate actuate energize start stimulate" gloss="to initiate action in; make active."/>
<sense id="38202" source="ws" synset="activate" gloss="in chemistry, to make more reactive, as by heating."/>
<sense id="38203" source="ws" synset="activate assign ready" gloss="to assign (a military unit) to active status."/>
<sense id="38204" source="ws" synset="activate" gloss="in physics, to cause radioactive properties in (a substance)."/>
<sense id="38205" source="ws" synset="activate aerate oxygenate" gloss="to cause decomposition in (sewage) by aerating."/>
</lexelt>

Judging answers is subjective in nature, so you may sometimes disagree with the senses in the answer key. But people will never completely agree on these things, and it is necessary to choose some set of answers for evaluation purposes, so we will use the provided answers as "The Truth".

The Data Sets and Phases of the Project

You will be using three sets of data (Training Set, Test Set #1, and Test Set #2), duriing three different phases of the project:

Training phase

Preliminary evaluation

Final evaluation

Training Phase

First, you will be given the Training Set to use in developing your system. You may use the data and the answer keys in any way that you wish. In particular, you can use all the given data for the provided words in any test / train combination you like; you may also create a development/validation set out of the training data. The training material can be found in the data directory.

Preliminary Evaluation

Next, there will be a preliminary evaluation of everyone's system. Each team will hand in the code for their WSD system and we will run the systems on the stories in Test Set #1. We will score the accuracy of each system and post the results on the class web page. The results of the preliminary evaluation will not count toward your final project grade, but should be useful for assessing your progress and seeing how well your system works compared to others in the class. Once the preliminary evaluation is over, we will make Test Set #1 available to everyone.

Final Evaluation

At this point, each team will hand in the final code for their WSD system. We will run the systems on both the words in Test Set #1 and Test Set #2. Your final project grade will be based on the performance of your WSD system on both of the test sets.

The purpose of evaluating your systems on both test sets is to balance specificity with generality. You will have several weeks to try to get your systems to perform well on Test Set #1. Hopefully, everyone will be able to do fairly well on that test set. Test Set #2 will be a blind test set that no one will see until the final evaluation. A system that uses general techniques should work just as well on Test Set #2 as Test Set #1. But a system that has lots of hacks and tweaks based on Test Set #1 probably will perform very poorly on Test Set #2.

WARNING: You will be given the answer keys for Test Set #1, but your system is not allowed to use them when doing WSD! The answer keys are being distributed only to show you what the correct answers should be, and to allow you to evaluate your systems automatically if you wish. Your system should use general techniques that can apply to a wide variety of texts.

Scoring Tools

We will be using the evaluation software that was developed for the actual Senseval competitions. The scoring materials can be found in the scoring directory. Make sure you can evaluate the output of your project using this software before submitting anything to Ali! Note that the tools support more functionality than used in our class assignment, so not all options will be relevent (e.g. weighting).

Teams

Ideally, the class will be divided into 2-person teams for the project. If you really want to have a team of 3 people, or if you want to work by yourself, that is also possible. You are encouraged to work in a group, however, so you can attempt something larger and more interesting (although the amount of work should then be appropriately scaled to the size of the group).

Schedule

The schedule for the projects is shown below:

October 21: Training Set (data with answers) is released.

November 11: Preliminary evaluation on Test Set #1. The data and answer keys for Test Set #1 will be released after the preliminary evaluation is finished.

December 2: Final evaluation on Test Set #1 and Test Set #2.

December 9: Project reports due.

December 9: Project presentations.

By November 11, we expect each team to have a working WSD system! It might not work well and may still be missing some components that you plan to incorporate, but it should be able to process a training/testing file and produce a sense for each instance.

Participation in the preliminary evaluation is mandatory. Failure to participate will result in a 10% deduction off your final project grade. This policy is to ensure that everyone is making adequate progress.

Grading

For the final submission, you need to submit both your program and a written report. Your project will be graded according to the following criteria:

33% of the grade will be based on your WSD system's performance on Test Set #1 during the final evaluation

33% of the grade will be based on your WSD system's performance on Test Set #2 during the final evaluation

33% of the grade will be based on your project report and presentation

The grade for the report and presentation will be based on clarity, as well as the creativity and ambitiousness shown in the design of your system. Thus, if you incorporate novel ideas and/or complex algorithms, then I will take that into account. Like the Olympics, difficulty can in effect boost your raw performance scores.

Your report should be modeled after the conference papers that we have read throughout the semester. An "A" paper would be considered suitable for submission to a future Senseval meeting! Your report should detail the method used by your program, the architecture of your program, testing you did, dead-end paths, and model revisions you pursued. In particular, your report should contain:

A clear discussion of the algorithms/method used.

A discussion of the testing you did and results you obtained. If you redivided the provided data into training and testing portions, mention it in the report.

Data analysis, especially of errors, and insights you gained from that which you used to improve your program.

Experiments and observations on the strengths and weaknesses of the technique(s) that you implemented.

A discussion of alternatives or things you tried to improve performance, and how they fared.

Your report should also contain clear (but brief) answers to the following questions:

The WSD program works better at disambiguating some words than others. What are the factors that seem to be determining performance, and what is their relative weight?

Discuss the technique(s) that you implemented. If you implemented more than one, did they perform well on the same test sets? Did they depend on the amount of training data available? What were the major differences between the methods?

To compute the final grade for the project, each WSD system will be ranked relative to the other systems in the class. For example, if your system ranks 1st on Test Set #1 performance, 3rd on Test Set #2 performance, and 5th on the project report and presentation, then your average ranking would be (1+3+5)/3=3. Note that the final grading is thus on a relative, not an absolute, scale. However, this does not mean that the team with the highest average ranking automatically gets an A (e.g., if the best score was no better than chance performance), or that the lowest scoring team fails. If every team produces a good and interesting system, I will be happy to give every team an A.

Encouragement

NLP is not a solved problem, and effective word sense disambiguation is HARD! Top projects in other classes that have done WSD projects have achieved accuracies below 60%. Randomly choosing a sense will yield extremely low accuracy, so anything higher means that you are doing something good!!