Project Description (CS 2731 / ISSP 2230)

Introduction

The project for this class will be to design and build a question answering system. This will give you exposure to a cutting edge research area, and experience in building a real NLP system.

For this question answering (QA) project, we will use actual reading comprehension tests that are given to grade school children in the United States. These reading comprehension tests consist of short stories with 5 questions following each story. The materials come from levels 2-5 of "The 5 W's" books written by Linda Miller and obtained from Remedia Publications for research purposes. A sample reading comprehension test is shown below.

1989 Remedia Publications, Comprehension/5Ws - 5

Library of Congress Has Books for Everyone

(WASHINGTON, D.C., 1964) - It was 150 years ago this year that our nation's biggest library burned to the ground. Copies of all the written books of the time were kept in the Library of Congress. But they were destroyed by fire in 1814 during a war with the British.
That fire didn't stop book lovers. The next year, they began to rebuild the library. To give it a boost, Thomas Jefferson gave 6,457 of his books.
The first libraries in the United States could be used by members only. But the Library of Congress was built for all the people. From the start, it was our national library.
Today, the Library of Congress is one of the largest libraries in the world. People can find a copy of just about every book and magazine printed.
Libraries have been with us since people first learned to write. One of the oldest to be found dates back to about 800 years B.C. The books were written on tablets made From clay. The people who took care of the books were called "men of the written tablets."

1. Who gave 6,457 books to the new library?
2. What is the name of our national library?
3. When did this library burn down?
4. Where can this library be found?
5. Why were some early people called "men of the written tablets"?

Figure 1: A Sample Reading Comprehension Test

IMPORTANT: These on-line materials cannot be distributed to anyone else or used for any purpose other than this class. If you wish to use this data for other purposes, please contact me and I will tell you what you need to do.

The Remedia exams include an answer key that gives the correct answer for each question. Creating a Q/A system that can identify exact answers is difficult, so for this project we will focus on answer sentence identification. Your Q/A system should identify the sentence in the story that best answers each question. This is a much easier task and, from a practical perspective, nearly as useful for most real-world applications!

The Remedia books contain a key with the exact answers to each question, but we will use an answer key created by the MITRE corporation for answer sentence identification. The MITRE folks marked the sentence(s) in each story that they thought best answered each question. The MITRE answer key sometimes lists more than one correct sentence for a question, in which case either one is correct. But more importantly, the MITRE answer key contains no correct sentences for about 11% of the questions! This occurs when the answer really spans two (or more) sentences and either one alone is insufficient. Consider the following example:

Question: What is the name of our national library?

Excerpts from the original story:
SentenceA: But the Library of Congress was built for all the people.
SentenceB: From the start, it was our national library.

The exact answer is "Library of Congress". But neither SentenceA nor SentenceB is sufficient by itself to answer the question. The pronoun "it" must be resolved across these sentences to determine the correct answer. This example illustrates a problem with trying to identify answer sentences instead of exact answers. Because of this problem with the answer sentence identification task, you should be aware that 11% of the questions will be impossible to get right! So the best possible performance that your Q/A system can achieve is 89%.

The Answer Key

Library of Congress Has Books for Everyone

< ANSQ4>(WASHINGTON, D.C., 1964) - < /ANSQ4 > < ANSQ3 > It was 150 years ago this year that our nation's biggest library burned to the ground. < /ANSQ3 > Copies of all the written books of the time were kept in the Library of Congress. But they were destroyed by fire in 1814 during a war with the British.
That fire didn't stop book lovers. The next year, they began to rebuild the library. < ANSQ1 > To give it a boost, Thomas Jefferson gave 6,457 of his books. < /ANSQ1 >
The first libraries in the United States could be used by members only. But the Library of Congress was built for all the people. From the start, it was our national library.
Today, the Library of Congress is one of the largest libraries in the world. People can find a copy of just about every book and magazine printed.
Libraries have been with us since people first learned to write. One of the oldest to be found dates back to about 800 years B.C. The books were written on tablets made from clay. The people who took care of the books were called "men of the written tablets."

Figure 2: A Sample Answer Key

Figure 2 shows the answer key for the Library of Congress story. The sentence(s) that best answer each question are surrounded by sgml tags labeled with the question number. For example, the tags < ANSQ3> and < /ANSQ3 > surround the sentence that best answers question #3. Remember that some questions will not have any corresponding answer sentence because the MITRE folks judged that no single sentence was sufficient to answer the question. For example, question #5 has no corresponding answer sentence in Figure 2.

Note that in some cases the answer to a question may come from the by-line (e.g., ANSQ4 in Figure 2). Answers to WHEN and WHERE questions are often found in the by-line, so you should strip off the by-line and treat it as a separate sentence in the text. For example, in The Output section (below) you will notice that (ABERDEEN, S.D., September 14, 1963) is a legal answer to Question 4.

Judging answers is subjective in nature, so you may sometimes disagree with MITRE's decisions in the answer key. But people will never completely agree on these things, and it is necessary to choose some set of answers for evaluation purposes, so we will use MITRE's judgements as "The Truth".

The Data Sets and Phases of the Project

You will be using three sets of data at different points in the project:

Training Set: 30 stories and answer key

Test Set #1: 30 stories and answer key

Test Set #2: 55 stories and answer key

The project will involve three phases:

Development

Preliminary evaluation

Final evaluation

Development Phase

You will be given the Training Set to use in developing your Q/A systems. You may use these stories and the answer keys in any way that you wish. The training data can be found in:

/afs/cs.pitt.edu/usr0/alanjawi/public/cs2731/training set/

Preliminary Evaluation

First, there will be a preliminary evaluation of everyone's Q/A systems. Each team will hand in the code for their Q/A system and we will run the Q/A systems on the stories in Test Set #1. We will score the accuracy of each system and post the results on the class web page. The results of the preliminary evaluation will not count toward your final project grade, but should be useful for assessing your progress and seeing how well your system works compared to others in the class. Once the preliminary evaluation is over, we will make Test Set #1 available to everyone.

Final Evaluation

At this point, each team will hand in the final code for their Q/A system. We will run the Q/A systems on both the stories in Test Set #1 and Test Set #2. Your final project grade will be based on the performance of your Q/A system on both of the test sets.

The purpose of evaluating your systems on both test sets is to balance specificity with generality. You will have several weeks to try to get your Q/A systems to perform well on Test Set #1. Hopefully, everyone will be able to do fairly well on that test set. Test Set #2 will be a blind test set that no one will see until the final evaluation. A system that uses general techniques should work just as well on Test Set #2 as Test Set #1. But a system that has lots of hacks and tweaks based on Test Set #1 probably will perform very poorly on Test Set #2.

WARNING: You will be given the answer keys for Test Set #1, but your system is not allowed to use them when answering questions! The answer keys are being distributed only to show you what the correct answers should be, and to allow you to evaluate your Q/A systems automatically if you wish. Your system should use general techniques that can apply to a wide variety of texts.

The Gory Details

The Input

Your Q/A system should accept a single input file. The first line of the file will be a directory path and all subsequent lines will be story filenames. Your Q/A system should then process each story file in the list from the specified directory. A sample input file is below, which indicates that 6 story files should be processed and they can all be found in the directory /home/clinton/qa/testset/.

Each story file will be formatted like Figure 1. The first sentence is a Remedia header that can be ignored. The second sentence is the headline of the story, which in some cases may contain the answer to a question. After that, the main story begins. Each story begins with a by-line that contains the date and/or location of the story. The answers to many WHEN and WHERE questions come from the headers, so you should strip off the header and treat it as a separate sentence.

At the end of each story is a set of 5 questions. There will always be 5 questions per story, numbered 1 through 5. You can identify the first question by looking for a line that begins with "1.".

The Output

As its output, your Q/A system should produce a single file that contains the answers for all of the stories specified in the input file. The output of your system should be formatted as follows:

< filename>
Question 1
< answer >
Question 2
< answer >
...

For example, a real output file might look like this:

rm500-21.txt

Question 1
Elvis loves french fries.

Question 2
Daffy is a duck.

Question 3
The natural language processing course rocks!

Question 4
(ABERDEEN, S.D., September 14, 1963)

Question 5
John watches Sesame Street every morning.

rm502-43.txt

Question 1
Mary can't wait until the ski resorts open.

etc.

Please make sure that each answer is printed as a single line. And make sure that you print each sentence EXACTLY as it appears in the story! Otherwise, your output may not be scored correctly. Each answer should be an exact sentence from the story, or the complete by-line from the story (as illustrated by the answer to Question 4 above).

Evaluation

The performance of each Q/A system will be scored based on the percentage of questions answered correctly. An answer will be scored as correct if it exactly matches one of the sentences marked as correct for the corresponding question in the answer key.

Teams

The class will be divided into 2-person teams for the project. You may form your own team if you know someone with whom you'd like to work. Otherwise, I will randomly assign you to a team.

Schedule

The schedule for the projects is shown below:

Mid-late October: Training set of 30 stories and answer keys is released.

November 5: Preliminary evaluation on Test Set #1. The stories and answer keys for Test Set #1 will be released after the preliminary evaluation is finished.

November 26 or Dec 3: Final evaluation on Test Set #1 and Test Set #2.

December 5: Project reports due.

December 5 and 10: Project presentations.

By November 5, we expect each team to have a working Q/A system! It might not work well and may still be missing some components that you plan to incorporate, but it should be able to process a story and produce an answer for each question.

Participation in the preliminary evaluation is mandatory. Failure to participate will result in a 10% deduction off your final project grade. This policy is to ensure that everyone is making adequate progress.

Grading

Each project will be graded according to the following criteria:

33% of the grade will be based on your Q/A system's performance on Test Set #1 during the final evaluation

33% of the grade will be based on your Q/A system's performance on Test Set #2 during the final evaluation

33% of the grade will be based on your project report and presentation

To compute the final grade for the project, each Q/A system will be ranked relative to the other systems in the class. For example, if your system ranks 1st on Test Set #1 performance, 3rd on Test Set #2 performance, and 5th on the project report and presentation, then your average ranking would be (1+3+5)/3=3.

The grade for the report and presentation will be based on clarity, as well as the creativity and ambitiousness shown in the design of your system. Thus, if you incorporate novel ideas and/or complex algorithms, then I will take that into account. Like the Olympics, difficulty can in effect boost your raw performance scores.

Note that the final grading is on a relative, not an absolute, scale. However, this does not mean that the team with the highest average ranking automatically gets an A (e.g., if the best score was no better than chance performance), or that the lowest scoring team fails. If every team produces a good and interesting system, I will be happy to give every team an A.

Encouragement

NLP is not a solved problem, and effective QA'ing is HARD! The best research system performance on these exams has been only about 40% accuracy. Randomly choosing a sentence will only produce about 5% accuracy, so anything higher than this means that you are doing something good!!

Credits

This project and these instructions were developed by Professor Ellen Riloff at the University of Utah. Thanks to Mitre for the use of the Remedia data.