Project Description (CS 2731 / ISSP 2230)


Introduction

The project for this class will be to design, build, and evaluate a question answering system. This will give you exposure to a cutting edge research area, and experience in building a real NLP system.

For this question answering (QA) project, we will use the "CBC Reading Comprehension Corpus". This corpus consists of 125 news stories, each accompanied by a set of approximately 6-10 "Reading Comprehension" questions of the Who, What, When, Where, How, and Why variety. The news stories themselves were obtained from the "CBC 4 Kids" website, hosted by the Canadian Broadcast Corporation. The questions and an answer key were added by the MITRE Corporation, and are in the style of actual reading comprehension tests that are given to grade school children in the United States. A sample CBC story and reading comprehension test is shown below.

All stories have been split into sentences (one sentence per line) for you using the MXTERMINATOR sentence splitter developed by Adwait Ratnaparkhi. Paragraphs from the original story are separated by an empty line. The first line in the file is the title of story and the second is the date of the story.

IMPORTANT: These on-line materials cannot be distributed to anyone else or used for any purpose other than this class. If you wish to use this data for other purposes, please contact me and I will tell you what you need to do. The news stories are copyrighted by CBC/SRC, and were obtained by the MITRE Corporation for research purposes only.


The Answer Key

The corpus includes an answer key created by MITRE that gives the correct answer for each question. Creating a Q/A system that can identify exact answers is difficult, so for this project we will focus on answer sentence identification. The answer key that we will use for our project marks the sentence(s) in each story that contains the exact answers that MITRE thought best answered each question. The sentence answer key sometimes lists more than one correct sentence for a question, in which case either one is correct. Your Q/A system should identify the sentence in the story that best answers each question. This is a much easier task and, from a practical perspective, nearly as useful for most real-world applications!

For each set of stories (training set, test set 1 and 2) there is a a file answerkey.txt present in the training or test set directory. This file contains the answers to all the questions from all stories in that directory. Here is the part of the answer key file that refers to the story mentioned above:

<FILE>1999-W04-5.qa
<Q_NUMBER>1
<A_LINE>26
<Q_TXT>What reason did Alexi Yashin give for backing out of his promised donation?
<A_TXT>He says his decision was for personal reasons.

<Q_NUMBER>2
<A_LINE>15
<Q_TXT>How much does Alexi Yashin earn as a hockey player?
<A_TXT>Mr. Yashin makes a salary of more than three million dollars a season.

<Q_NUMBER>3
<A_LINE>11
<Q_TXT>Who does Alexi Yashin play for?
<A_TXT>The government has been looking at a charitable donation by Ottawa Senators hockey star Alexi Yashin.

<Q_NUMBER>4
<A_LINE>23
<Q_TXT>How much money did Alexi Yashin actually donate to the National Arts Centre?
<A_TXT>After giving the Centre $200,000 Alexi Yashin made an about face.

<Q_NUMBER>5
<A_LINE>59
<Q_TXT>What do Alexi Yashin's teammates think about this donation gone bad?
<A_TXT>They said it was Alexi's personal affair and not important to them.

<Q_NUMBER>6
<A_LINE>38,45,19
<Q_TXT>How would Alexi Yashin himself benefit from his donation scheme?
<A_TXT>This was a way for Alexi Yashin to give money to his parents while illegally saving thousands of dollars in taxes. -OR- Mr. Yashin would look like he was being very generous. -OR- Mr. Yashin's popularity soared.

<Q_NUMBER>7
<A_LINE>12
<Q_TXT>Why do people go to the National Arts Centre?
<A_TXT>Earlier this year he promised to give one million dollars to the National Arts Centre, a concert hall where people go to see plays and dance and to hear live music.

<Q_NUMBER>8
<A_LINE>34
<Q_TXT>How would Alexi's parents have benefited from his donation to the National Arts Centre?
<A_TXT>Instead, in a secret agreement, the National Arts Centre would have hired Mr. Yashin's parents at $85,000 a year.

<Q_NUMBER>9
<A_LINE>49,48
<Q_TXT>Where did reporters question Alexi Yashin?
<A_TXT>After the game reporters went to the locker room to ask Alexi Yashin what was going on. -OR- Yesterday evening the Ottawa Senators were playing in Boston.

</FILE>

...

Figure 2. Part of the answer key.

As you can see, for each question there are 4 lines that describe the question and the answer. In the first line, preceded by the <Q_NUMBER> tag, you will find the question number. In the second line, preceded by <A_LINE>, you will find the line of the sentence(s) that answers the question (remember that there is at most one sentence per line). The lines in the files are numbered starting from 1. Empty lines are also counted in. If there is more than one answer for a question, the lines for the answers are separated by a comma. The last two lines are for ease of reading only. One line contains the question text and the other the sentence(s) that answers the question (separated by " -OR- ").

Note that in some cases the answer to a question may come from the title or the date of the story so do not strip-off those lines (for example answers to WHEN questions are often found in the story date line).

Judging answers is subjective in nature, so you may sometimes disagree with MITRE's decisions in the answer key. But people will never completely agree on these things, and it is necessary to choose some set of answers for evaluation purposes, so we will use MITRE's judgements as "The Truth".


The Data Sets and Phases of the Project

You will be using three sets of data at different points in the project:

The project will involve three phases:

Development Phase

You will be given the Training Set to use in developing your Q/A systems. You may use these stories and the answer keys in any way that you wish. The training data can be found in:

Preliminary Evaluation

First, there will be a preliminary evaluation of everyone's Q/A systems. Each team will hand in the code for their Q/A system and we will run the Q/A systems on the stories in Test Set #1. We will score the accuracy of each system and post the results on CourseWeb. The results of the preliminary evaluation will not count toward your final project grade, but should be useful for assessing your progress and seeing how well your system works compared to others in the class. Once the preliminary evaluation is over, we will make Test Set #1 available to everyone.

Final Evaluation

At this point, each team will hand in the final code for their Q/A system. We will run the Q/A systems on both the stories in Test Set #1 and Test Set #2. Your final project grade will be based on the performance of your Q/A system on both of the test sets.

The purpose of evaluating your systems on both test sets is to balance specificity with generality. You will have several weeks to try to get your Q/A systems to perform well on Test Set #1. Hopefully, everyone will be able to do fairly well on that test set. Test Set #2 will be a blind test set that no one will see until the final evaluation. A system that uses general techniques should work just as well on Test Set #2 as Test Set #1. But a system that has lots of hacks and tweaks based on Test Set #1 probably will perform very poorly on Test Set #2.

WARNING: You will be given the answer keys for Test Set #1, but your system is not allowed to use them when answering questions! The answer keys are being distributed only to show you what the correct answers should be, and to allow you to evaluate your Q/A systems automatically if you wish. Your system should use general techniques that can apply to a wide variety of texts.


The Gory Details

Your Q/A system should accept two command line parameters. Running your program should look like:

    myQAproject input_filename outputfile_name

The Input

The first parameter is the name of the input file. The first line of the input file will be a directory path and all subsequent lines will be story filenames. Your Q/A system should then process each story file in the list from the specified directory. A sample input file is below, which indicates that 6 story files should be processed and they can all be found in the directory /afs/cs.pitt.edu/usr0/litman/public/cs2731/TrainingSet/.

Each story file will be formatted like Figure 1. The first line is the story title. The second is the date of the story followed by two empty lines. After that, the main story begins with one sentence per line. Original paragraphs are separated by an empty line.

At the end of each story is a set of 5 to 10 questions. You can identify the question section of the file by looking for a line that contains only the <QUESTIONS> tag. After that, each line will contain a question and it will start with <Qn> tag (where n is the number of the question) followed by the question.

The Output

The second parameter is the name of the output file. The output file will contain your answers for all of the stories specified in the input file. The output of your system should be formatted as the answer key. More specifically, it should have the following structure:

where:

The <Q_TXT> and <A_TXT> lines are optional and you don't need to include them in the output (though you might want to have them so that you can check your system answers faster while developing the system). You will be graded by matching your answer line against the one in the answer key. Please make sure that the <Q_NUMBER> line is followed by the <A_LINE> line. Also, do not forget to end each <FILE> section with a corresponding </FILE>.

You can have as many empty line as you like in your output (see for example Figure 2). Just make sure that you have the lines <FILE>, <Q_NUMBER>, <A_LINE>  and </FILE> in your output.

Evaluation

The performance of each Q/A system will be scored based on the percentage of questions answered correctly. An answer will be scored as correct if it exactly matches one of the sentences lines marked as correct for the corresponding question in the answer key (see the Tools section for the grader).


Tools

The scripts can be found in:

GRADER

grader.pl input_filename answerkey_filename your_answer_filename

This is the script that will be used for grading. Input_filename refers to the same file you use as input in your Q/A system. The second command line parameter has to be the answer key file name (the one provided to you). The third one is your answer file name (if you swap the parameters you will get bogus results, so be careful!!!). The script output is self explanatory.

MARKER

marker.pl input_filename answer_filename tag_name new_extension

This script will generate new story files by marking the answers from answer_file with the <tag_name question_number> </tag_name question_number> tags in the original story. Input_filename refers to the same file you use as input in your Q/A system. Answer_filename refers to an answer file (it can be your answers or the answer key file). The script will process every story by looking up in the answer file the lines that contain answers and mark them with the appropriate tags. The annotated story is saved in a file with the same name but with new_extension extension (if you use the txt extension it will overwrite the original story)

You might find this tool useful by running it twice: once with the answer key and tag name CORRECT_ANS and then run on the annotated files with MY_ANS tag. In this way you will get stories that have both the correct answers and your answers marked (hopefully overlapping as much as possible :-)).

Remark: When the answer file is used by marker to annotate the story, the match between story file name and the file name from the answer (whatever follows <FILE>) is done without taking into account the EXTENSION. This will help you when you want to apply the marker two times to annotate with both answers.

If you want to run the scripts from other machines than elements, you have to have Perl installed. To run the scripts use something like:

perl_path/perl script script_parameters


Teams

Ideally, the class will be divided into 3-person or 2-person teams for the project. You may form your own team if you know people with whom you'd like to work. Otherwise, I can randomly assign you to a team. If you really want to work by yourself, or have a larger team, that is also possible.


Schedule

The schedule for the projects is shown below:

By November 14, we expect each team to have a working Q/A system! It might not work well and may still be missing some components that you plan to incorporate, but it should be able to process a story and produce an answer for each question.

Participation in the preliminary evaluation is mandatory. Failure to participate will result in a 10% deduction off your final project grade. This policy is to ensure that everyone is making adequate progress.


Grading

Each project will be graded according to the following criteria:

To compute the final grade for the project, each Q/A system will be ranked relative to the other systems in the class. For example, if your system ranks 1st on Test Set #1 performance, 3rd on Test Set #2 performance, and 5th on the project report and presentation, then your average ranking would be (1+3+5)/3=3.

The grade for the report and presentation will be based on clarity, as well as the creativity and ambitiousness shown in the design of your system. Thus, if you incorporate novel ideas and/or complex algorithms, then I will take that into account. Like the Olympics, difficulty can in effect boost your raw performance scores.

Note that the final grading is on a relative, not an absolute, scale. However, this does not mean that the team with the highest average ranking automatically gets an A (e.g., if the best score was no better than chance performance), or that the lowest scoring team fails. If every team produces a good and interesting system, I will be happy to give every team an A.


Encouragement

NLP is not a solved problem, and effective QA'ing is HARD! Randomly choosing a sentence will yield extremely low accuracy, so anything higher means that you are doing something good!!


Credits

This project and these instructions are based on Professor Ellen Riloff's course project at the University of Utah. Thanks to Mitre for the use of the CBC data.