The project for this class will be to design and build a question answering system. This will give you exposure to a cutting edge research area, and experience in building a real NLP system.
For this question answering (QA) project, we will use actual reading comprehension tests that are given to grade school children in the United States. These reading comprehension tests consist of short stories with 5 questions following each story. The materials come from levels 2-5 of "The 5 W's" books written by Linda Miller and obtained from Remedia Publications for research purposes. A sample reading comprehension test is shown below.
1989 Remedia Publications, Comprehension/5Ws - 5
Library of Congress Has Books for Everyone
(WASHINGTON, D.C., 1964)
- It was 150 years ago this year that our nation's biggest library
burned to the ground. Copies of all the written books of the time were
kept in the Library of Congress. But they were destroyed by fire in
1814 during a war with the British.
That fire didn't stop book
lovers. The next year, they began to rebuild the library. To give it a
boost, Thomas Jefferson gave 6,457 of his books.
The first libraries
in the United States could be used by members only. But the Library of
Congress was built for all the people. From the start, it was our
national library.
Today, the Library of Congress is one of the largest
libraries in the world. People can find a copy of just about every
book and magazine printed.
Libraries have been with us since people
first learned to write. One of the oldest to be found dates back to
about 800 years B.C. The books were written on tablets made From
clay. The people who took care of the books were called "men of the
written tablets."
1. Who gave 6,457 books to the new library?
2. What is the name of our national library?
3. When did this library burn down?
4. Where can this library be found?
5. Why were some early people called "men of the written tablets"?
Figure 1: A Sample Reading Comprehension Test
IMPORTANT: These on-line materials cannot be distributed to anyone else or used for any purpose other than this class. If you wish to use this data for other purposes, please contact me and I will tell you what you need to do.
The Remedia exams include an answer key that gives the correct answer for each question. Creating a Q/A system that can identify exact answers is difficult, so for this project we will focus on answer sentence identification. Your Q/A system should identify the sentence in the story that best answers each question. This is a much easier task and, from a practical perspective, nearly as useful for most real-world applications!
The Remedia books contain a key with the exact answers to each question, but we will use an answer key created by the MITRE corporation for answer sentence identification. The MITRE folks marked the sentence(s) in each story that they thought best answered each question. The MITRE answer key sometimes lists more than one correct sentence for a question, in which case either one is correct. But more importantly, the MITRE answer key contains no correct sentences for about 11% of the questions! This occurs when the answer really spans two (or more) sentences and either one alone is insufficient. Consider the following example:
Question: What is the name of our national library?
Excerpts from the original story:
SentenceA: But the Library of Congress was built for all the people.
SentenceB: From the start, it was our national library.
The exact answer is "Library of Congress". But neither SentenceA nor SentenceB is sufficient by itself to answer the question. The pronoun "it" must be resolved across these sentences to determine the correct answer. This example illustrates a problem with trying to identify answer sentences instead of exact answers. Because of this problem with the answer sentence identification task, you should be aware that 11% of the questions will be impossible to get right! So the best possible performance that your Q/A system can achieve is 89%.
Library of Congress Has Books for Everyone
< ANSQ4>(WASHINGTON, D.C., 1964) - < /ANSQ4 > < ANSQ3 > It was 150 years ago
this year that our nation's biggest library burned to the
ground. < /ANSQ3 > Copies of all the written books of the time were kept
in the Library of Congress. But they were destroyed by fire in 1814
during a war with the British.
That fire didn't stop book lovers. The next year, they began to
rebuild the library. < ANSQ1 > To give it a boost, Thomas Jefferson gave
6,457 of his books. < /ANSQ1 >
The first libraries in the United
States could be used by members only. But the Library of Congress was
built for all the people. From the start, it was our national
library.
Today, the Library of Congress is one of the largest
libraries in the world. People can find a copy of just about every
book and magazine printed.
Libraries have been with us since people
first learned to write. One of the oldest to be found dates back to
about 800 years B.C. The books were written on tablets made from
clay. The people who took care of the books were called "men of the
written tablets."
Figure 2: A Sample Answer Key
Figure 2 shows the answer key for the Library of Congress story. The sentence(s) that best answer each question are surrounded by sgml tags labeled with the question number. For example, the tags < ANSQ3> and < /ANSQ3 > surround the sentence that best answers question #3. Remember that some questions will not have any corresponding answer sentence because the MITRE folks judged that no single sentence was sufficient to answer the question. For example, question #5 has no corresponding answer sentence in Figure 2.
Note that in some cases the answer to a question may come from the by-line (e.g., ANSQ4 in Figure 2). Answers to WHEN and WHERE questions are often found in the by-line, so you should strip off the by-line and treat it as a separate sentence in the text. For example, in The Output section (below) you will notice that (ABERDEEN, S.D., September 14, 1963) is a legal answer to Question 4.
Judging answers is subjective in nature, so you may sometimes disagree with MITRE's decisions in the answer key. But people will never completely agree on these things, and it is necessary to choose some set of answers for evaluation purposes, so we will use MITRE's judgements as "The Truth".
You will be using three sets of data at different points in the project:
The project will involve three phases:
You will be given the Training Set to use in developing your Q/A systems. You may use these stories and the answer keys in any way that you wish. The training data can be found in:
At this point, each team will hand in the final code for their Q/A system. We will run the Q/A systems on both the stories in Test Set #1 and Test Set #2. Your final project grade will be based on the performance of your Q/A system on both of the test sets.
The purpose of evaluating your systems on both test sets is to balance specificity with generality. You will have several weeks to try to get your Q/A systems to perform well on Test Set #1. Hopefully, everyone will be able to do fairly well on that test set. Test Set #2 will be a blind test set that no one will see until the final evaluation. A system that uses general techniques should work just as well on Test Set #2 as Test Set #1. But a system that has lots of hacks and tweaks based on Test Set #1 probably will perform very poorly on Test Set #2.
WARNING: You will be given the answer keys for Test Set #1, but your system is not allowed to use them when answering questions! The answer keys are being distributed only to show you what the correct answers should be, and to allow you to evaluate your Q/A systems automatically if you wish. Your system should use general techniques that can apply to a wide variety of texts.
Your Q/A system should accept a single input file. The first line of the file will be a directory path and all subsequent lines will be story filenames. Your Q/A system should then process each story file in the list from the specified directory. A sample input file is below, which indicates that 6 story files should be processed and they can all be found in the directory /home/clinton/qa/testset/.
Each story file will be formatted like Figure 1. The first sentence is a Remedia header that can be ignored. The second sentence is the headline of the story, which in some cases may contain the answer to a question. After that, the main story begins. Each story begins with a by-line that contains the date and/or location of the story. The answers to many WHEN and WHERE questions come from the headers, so you should strip off the header and treat it as a separate sentence.
At the end of each story is a set of 5 questions. There will always be 5 questions per story, numbered 1 through 5. You can identify the first question by looking for a line that begins with "1.".
< filename>
Question 1
< answer >
Question 2
< answer >
...
For example, a real output file might look like this:
rm500-21.txt
Question 1
Elvis loves french fries.
Question 2
Daffy is a duck.
Question 3
The natural language processing course rocks!
Question 4
(ABERDEEN, S.D., September 14, 1963)
Question 5
John watches Sesame Street every morning.
rm502-43.txt
Question 1
Mary can't wait until the ski resorts open.
etc.
Please make sure that each answer is printed as a single line. And make sure that you print each sentence EXACTLY as it appears in the story! Otherwise, your output may not be scored correctly. Each answer should be an exact sentence from the story, or the complete by-line from the story (as illustrated by the answer to Question 4 above).
The class will be divided into 2-person teams for the project. You may form your own team if you know someone with whom you'd like to work. Otherwise, I will randomly assign you to a team.
The schedule for the projects is shown below:
By November 5, we expect each team to have a working Q/A system! It might not work well and may still be missing some components that you plan to incorporate, but it should be able to process a story and produce an answer for each question.
Participation in the preliminary evaluation is mandatory. Failure to participate will result in a 10% deduction off your final project grade. This policy is to ensure that everyone is making adequate progress.
Each project will be graded according to the following criteria:
The grade for the report and presentation will be based on clarity, as well as the creativity and ambitiousness shown in the design of your system. Thus, if you incorporate novel ideas and/or complex algorithms, then I will take that into account. Like the Olympics, difficulty can in effect boost your raw performance scores.
Note that the final grading is on a relative, not an absolute, scale. However, this does not mean that the team with the highest average ranking automatically gets an A (e.g., if the best score was no better than chance performance), or that the lowest scoring team fails. If every team produces a good and interesting system, I will be happy to give every team an A.
NLP is not a solved problem, and effective QA'ing is HARD! The best research system performance on these exams has been only about 40% accuracy. Randomly choosing a sentence will only produce about 5% accuracy, so anything higher than this means that you are doing something good!!
This project and these instructions were developed by Professor Ellen Riloff at the University of Utah. Thanks to Mitre for the use of the Remedia data.