The project for this class will be to design, build, and evaluate a question answering (QA) system. This will give you exposure to a cutting edge research area, and experience in building a real NLP system.
QA is an ongoing research area. We have actually seen or heard of many famous QA systems. For example, the smart QA systems in Siri can answer many questions, such as weather, stock price. We've also seen IBM's QA system winning the Jeopardy! game, competing with real humans. An ideal QA system gives a succinct answer to a given question. For example, when a user queries "Who is the president of US?", the system should be able to respond "Barack Obama". Our term project will give you hands-on experience to building such an exiciting system!
QA systems are often built on top of a large set of documents. To answer a question, a QA system first locates the document/sentence that contains answers to a given question, and then extracts the answer from the sentence. These two steps correspond to two components. Our term project involves building both components.
In particular, your QA system will have two separate components. They handle the following queries:
We describe the two challenges in the following section. Your project will be graded on your system's performance in both challenges, as well as your report/presentation.
For this question answering (QA) project, we will use the "CBC Reading Comprehension Corpus". This corpus consists of 125 news stories, each accompanied by a set of approximately 6-10 "Reading Comprehension" questions of the Who, What, When, Where, How, and Why variety. The news stories themselves were obtained from the "CBC 4 Kids" website, hosted by the Canadian Broadcast Corporation. Our corpus contains keys to the first challenge, and also a subset of keys to the second challenge.
In the first challenge, your system will take an input document, which contains (1) an article (2) a list of query sentences. Your system will: for each query, find a sentence within the given article that best answers the question.
The questions and an answer key for the first challenge were added by the MITRE Corporation, and are in the style of actual reading comprehension tests that are given to grade school children in the United States. A sample CBC story and reading comprehension test is shown below.
Hockey Star's Arts Donation Sours
January 22, 1999
Ever heard the expression "don't look a gift horse in the mouth?"
It means not to be too critical if you get something for free.
Well, that is exactly what the federal government is doing concerning a million dollar gift to the National Arts Centre in Ottawa, Ontario.
And it appears the government does not like what it has found.
To continue the metaphor, it look like this horse has some major dental problems.
The government has been looking at a charitable donation by Ottawa Senators hockey star Alexi Yashin.
Earlier this year he promised to give one million dollars to the National Arts Centre, a concert hall where people go to see plays and dance and to hear live music.
Many Canadians were delighted to see that Alexi Yashin was donating so much money.
Mr. Yashin makes a salary of more than three million dollars a season.
It is very expensive to put on live performances and the National Arts Centre has recently had to raise ticket prices and cut back on performances because of a lack of funds.
It was the most famous donation ever to the National Arts Centre.
Mr. Yashin's popularity soared.
But then this week Mr. Yashin's donation turned sour.
After giving the Centre $200,000 Alexi Yashin made an about face.
He decided not to give the other $800,000 he had promised.
He says his decision was for personal reasons.
But it now looks like Alexi Yashin was not really being honest.
It seems he actually changed his mind because federal Auditor General Denis Desautels told him he was trying to break the law.
Mr. Yashin originally announced he was a lover of the arts.
He said that as a well paid hockey player he wanted to help the National Arts Centre put on new performances.
But behind the scenes his plan was not to give the Arts Centre one million dollars.
Instead, in a secret agreement, the National Arts Centre would have hired Mr. Yashin's parents at $85,000 a year.
They would have been paid out of Alexi Yashin's yearly $200,000 donation.
What's more, his parents would not have to actually work.
This was a way for Alexi Yashin to give money to his parents while illegally saving thousands of dollars in taxes.
As well, a lawyer working for Mr. Yashin was to get $15,000 out of the remaining $115,000.
So in fact the million dollar donation would really only be half of that.
It would have been a great public relations victory.
Mr. Yashin would look like he was being very generous.
But behind the scenes he was really much less so.
Yesterday evening the Ottawa Senators were playing in Boston.
After the game reporters went to the locker room to ask Alexi Yashin what was going on.
But he refused to talk to reporters about his "personal reasons" for cancelling his donation.
He said simply that "I know I didn't do anything illegal.
I know I didn't do anything wrong.
I can't control what they say.
It's a free country."
He also said he wanted to focus on hockey and nothing else.
The other players and coach of the Ottawa Senators would not comment either.
They said it was Alexi's personal affair and not important to them.
Many Ottawa-area hockey fans are deeply disappointed with what is going on.
"People here are puzzled.
They feel let down," says Rick Soweita, who owns a sports bar.
"He benefited from the publicity and now he's got to own up."
<QUESTIONS>
<Q1> What reason did Alexi Yashin give for backing out of his promised donation?
<Q2> How much does Alexi Yashin earn as a hockey player?
<Q3> Who does Alexi Yashin play for?
<Q4> How much money did Alexi Yashin actually donate to the National Arts Centre?
<Q5> What do Alexi Yashin's teammates think about this donation gone bad?
<Q6> How would Alexi Yashin himself benefit from his donation scheme?
<Q7> Why do people go to the National Arts Centre?
<Q8> How would Alexi's parents have benefited from his donation to the National Arts Centre?
<Q9> Where did reporters question Alexi Yashin?
Figure 1. Story example
All stories have been split into sentences (one sentence per line) for you using the MXTERMINATOR sentence splitter developed by Adwait Ratnaparkhi. Paragraphs from the original story are separated by an empty line. The first line in the file is the title of story and the second is the date of the story.
For each set of stories (training set, test set 1 and test set 2) there is a a file answerkey.txt present in the training or test set directory. This file contains the answers to all the questions from all stories in that directory. Here is the part of the answer key file that refers to the story mentioned above:
<FILE>1999-W04-5.qa
<Q_NUMBER>1
<A_LINE>26
<Q_TXT>What reason did Alexi Yashin give for backing out of his promised donation?
<A_TXT>He says his decision was for personal reasons.
<Q_NUMBER>2
<A_LINE>15
<Q_TXT>How much does Alexi Yashin earn as a hockey player?
<A_TXT>Mr. Yashin makes a salary of more than three million dollars a season.
<Q_NUMBER>3
<A_LINE>11
<Q_TXT>Who does Alexi Yashin play for?
<A_TXT>The government has been looking at a charitable donation by Ottawa Senators hockey star Alexi Yashin.
<Q_NUMBER>4
<A_LINE>23
<Q_TXT>How much money did Alexi Yashin actually donate to the National Arts Centre?
<A_TXT>After giving the Centre $200,000 Alexi Yashin made an about face.
<Q_NUMBER>5
<A_LINE>59
<Q_TXT>What do Alexi Yashin's teammates think about this donation gone bad?
<A_TXT>They said it was Alexi's personal affair and not important to them.
<Q_NUMBER>6
<A_LINE>38,45,19
<Q_TXT>How would Alexi Yashin himself benefit from his donation scheme?
<A_TXT>This was a way for Alexi Yashin to give money to his parents while illegally saving thousands of dollars in taxes. -OR- Mr. Yashin would look like he was being very generous. -OR- Mr. Yashin's popularity soared.
<Q_NUMBER>7
<A_LINE>12
<Q_TXT>Why do people go to the National Arts Centre?
<A_TXT>Earlier this year he promised to give one million dollars to the National Arts Centre, a concert hall where people go to see plays and dance and to hear live music.
<Q_NUMBER>8
<A_LINE>34
<Q_TXT>How would Alexi's parents have benefited from his donation to the National Arts Centre?
<A_TXT>Instead, in a secret agreement, the National Arts Centre would have hired Mr. Yashin's parents at $85,000 a year.
<Q_NUMBER>9
<A_LINE>49,48
<Q_TXT>Where did reporters question Alexi Yashin?
<A_TXT>After the game reporters went to the locker room to ask Alexi Yashin what was going on. -OR- Yesterday evening the Ottawa Senators were playing in Boston.
</FILE>
...
Figure 2. Part of the answer key.
As you can see, for each question there are 4 lines that describe the
question and the answer. In the first line, preceded by the <Q_NUMBER>
tag, you will find the question number. In the second line, preceded by
<A_LINE>
, you will find the line of the sentence(s) that answers the
question (remember that there is at most one sentence per line). The lines in
the files are numbered starting from 1. Empty lines are also counted in. If
there is more than one answer for a question, the lines for the answers are
separated by a comma. The last two lines are for ease of reading only. One line
contains the question text and the other the sentence(s) that answers the
question (separated by " -OR- ").
Note that in some cases the answer to a question may come from the title or the date of the story so do not strip-off those lines (for example answers to WHEN questions are often found in the story date line).
Judging answers is subjective in nature, so you may sometimes disagree with MITRE's decisions in the answer key. But people will never completely agree on these things, and it is necessary to choose some set of answers for evaluation purposes, so we will use MITRE's judgements as "The Truth".
The answer sentence often shares many common words with the question sentence. Consider the question "Where is South Queens Junior High School located?", we would expect the answer sentence to mention the word "school". Therefore, to find the answer sentence within a document, we may scan over all sentences, and calculate the number of overlapping words between them and the given question sentence. The sentence containing the greatest number of overlapping words is likely the answer sentence.
In the second challenge, your system will need to extract a phrase from an answer sentence that directly answers the given question. In this phase, we only consider WHEN, WHO, WHERE questions. We extracted a subset of questions and answers from the CBC corpus, and manually annotated the phrases that best answer the questions.
You will receive a file containing a list of question and answer sentences.
FILE: 1999-W02-5.txt Q_NUMBER: 1
Where is South Queens Junior High School located?
A middle school in Liverpool , Nova Scotia is pumping up bodies as well as minds .
FILE: 1999-W02-5.txt Q_NUMBER: 2
Who is the principal of South Queens Junior High School?
Principal Betty Jean Aucoin says the club is a first for a Nova Scotia public school .
FILE: 1999-W02-5.txt Q_NUMBER: 4
When did the metal shop close?
The school has turned its one-time metal shop - lost to budget cuts almost two years ago - into a money-making professional fitness club .
FILE: 1999-W02-5.txt Q_NUMBER: 5
Who runs the club?
The club , operated by a non-profit society made up of school and community volunteers , has sold more than 30 memberships and hired a full-time co-ordinator .
FILE: 1999-W04-5.txt Q_NUMBER: 1
What reason did Alexi Yashin give for backing out of his promised donation ?
He says his decision was for personal reasons .
FILE: 1999-W04-5.txt Q_NUMBER: 3
Who does Alexi Yashin play for?
The government has been looking at a charitable donation by Ottawa Senators hockey star Alexi Yashin .
Figure 3. Phrase level Q/A example
The input file contains blocks separated by blank lines. Each block contains information of the Q/A sentence pair. The first line describes the filename and question number of the Q/A sentence pair, which is used to uniquely identify this sentence pair. This line is followed by two lines, corresponding the question and the answer sentences.
Your system needs to scan over all Q/A sentence pairs. For each sentence pair, your system should try to locate the phrase within the answering sentence that best answers the question.
For each set of Q/A sentence pairs, there is a key file which contains the solutions to the phrase-level Q/As. For each Q/A sentence pair, this file contains the start and end token indices of the answer phrase. Let's take the first Q/A sentence pair in Figure 3 as example. The answer to the question Where is South Queens Junior High School located? is Liverpool, which is the 5th word in the answering sentence. Therefore, your program needs to output "5 5", which means the answer phrase starts from the 5th word, and ends at the 5th word.
The answer sentences are already tokenized. Please simply split the answer sentences on spaces. We treat every resulting token as a word (including slashes (-
), commas (,
) etc.). You may assume that the answer phrases always start and end at word boundaries.
Below we show the expected output for the input file in Figure 3.
FILE: 1999-W02-5.txt Q_NUMBER: 1
Where is South Queens Junior High School located?
A middle school in Liverpool , Nova Scotia is pumping up bodies as well as minds .
5 5
Answer: Liverpool
FILE: 1999-W02-5.txt Q_NUMBER: 2
Who is the principal of South Queens Junior High School?
Principal Betty Jean Aucoin says the club is a first for a Nova Scotia public school .
2 4
Answer: Betty Jean Aucoin
FILE: 1999-W02-5.txt Q_NUMBER: 4
When did the metal shop close?
The school has turned its one-time metal shop - lost to budget cuts almost two years ago - into a money-making professional fitness club .
15 17
Answer: two years ago
FILE: 1999-W02-5.txt Q_NUMBER: 5
Who runs the club?
The club , operated by a non-profit society made up of school and community volunteers , has sold more than 30 memberships and hired a full-time co-ordinator .
6 15
Answer: a non-profit society made up of school and community volunteers
FILE: 1999-W04-5.txt Q_NUMBER: 1
What reason did Alexi Yashin give for backing out of his promised donation ?
He says his decision was for personal reasons .
7 8
Answer: personal reasons
FILE: 1999-W04-5.txt Q_NUMBER: 3
Who does Alexi Yashin play for?
The government has been looking at a charitable donation by Ottawa Senators hockey star Alexi Yashin .
11 12
Answer: Ottawa Senators
Figure 2. Part of the answer key.
As you can see, the key to each Q/A sentence pair contains 5 lines. These keys are separated by a blank line. In the first line, you will find the document filename and the question number. The following two lines are the question and answer sentences. These are followed by one line containing two integers, indicating the start and end indices (starting from 1) of the answering phrase. This is followed by one line containing the identified answer phrase. This line is for your debugging purposes -- our evaluation scripts will skip this line. Our evaluation script only considers the line with start and end indices.
For the when/who/where questions, the answer phrase is often a noun-phrase. You may consider using a parser to find noun-phrases within a sentence. For when questions, the answering noun-phrase would describe a time; for who questions, the answering noun-phrases are often names; for the where questions, the answering noun-phrase are often locations. Named Entity recognizers are able to identify these noun-phrases for you.
You will be using three sets of data at different points in the project. The project will involve three phases:
You will be given the Training Set to use in developing your Q/A systems. You may use these stories and the answer keys in any way that you wish.
First, there will be a preliminary evaluation of everyone's Q/A systems. Each team will hand in the code for their Q/A system and we will run the Q/A systems on the stories in Test Set #1. We will score the accuracy of each system and post the results on CourseWeb. The results of the preliminary evaluation will not count toward your final project grade, but should be useful for assessing your progress and seeing how well your system works compared to others in the class. Once the preliminary evaluation is over, we will make Test Set #1 available to everyone.
By Mar 26, we expect each team to have a working Q/A system! It might not work well and may still be missing some components that you plan to incorporate, but it should be able to (1) process a story and produce an answer for each question (2) find the phrase within a sentence that best answers a question.
Participation in the preliminary evaluation is mandatory. Failure to participate will result in a 20% deduction off your final project grade. This policy is to ensure that everyone is making adequate progress.
At this point, each team will hand in the final code for their Q/A system. We will run the Q/A systems on both the stories in Test Set #1 and Test Set #2 Your final project grade will be based on the performance of your final Q/A system on both of the test sets.
The purpose of evaluating your systems on both test sets is to balance specificity with generality. You will have several weeks to try to get your Q/A systems to perform well on Test Set #1. Hopefully, everyone will be able to do fairly well on that test set. Test Set #2 will be a blind test set that no one will see until the final evaluation. A system that uses general techniques should work just as well on Test Set #2 as Test Set #1. But a system that has lots of hacks and tweaks based on Test Set #1 probably will perform very poorly on Test Set #2.
WARNING: You will be given the answer keys for Test Set #1, but your system is not allowed to use them when answering questions! The answer keys are being distributed only to show you what the correct answers should be, and to allow you to evaluate your Q/A systems automatically if you wish. Your system should use general techniques that can apply to a wide variety of texts.
Date | Project Related Events | Released Resources | Need to Turn in |
---|---|---|---|
Mar 3 | Teams are formed. Please send us your teams' names. Note that after each round of evaluation, we will publish the evaluation results on my website. | Training data for both challenges, evaluation scripts | |
Mar 26 | Start preliminary evaluation on test set #1. Test set released. | Test set #1 | |
Mar 27 | Please send your code, a one page description of your system, and your system's output to Huichao's email address. | system's code (preliminary), one page description of the system, description of each team member's contribution, system's output (by 11pm) | |
Mar 28 | Releasing keys for Test Set 1 | Test set #1's key | |
Apr 9 | Project code submission. Also please let us know your available time slots for project demo. | code (final), available time slots (by 11pm) | |
Apr 10 - 12 | Project demo -- Final evaluation on test set #2. | Test set #2 | |
Apr 18 | Report due. | Report. (by 11pm) | |
Apr 26 | Project presentations. From 8am to 9:50am. |
We distribute the data/scripts via courseweb on the release dates.
IMPORTANT: These on-line materials cannot be distributed to anyone else or used for any purpose other than this class. If you wish to use this data for other purposes, please contact me and I will tell you what you need to do. The news stories are copyrighted by CBC/SRC, and were obtained by the MITRE Corporation for research purposes only.
Your Q/A system should accept two command line parameters. Running your program should look like:
myQAprojectPart1 input_filename outputfile_name
The first parameter is the name of the input file. The first line of the
input file will be a directory path and all subsequent lines will be story
filenames. Your Q/A system should then process each story file in the list from
the specified directory. A sample input file is below, which indicates that 6
story files should be processed and they can all be found in the directory /afs/cs.pitt.edu/usr0/litman/public/cs2731/TrainingSet/
. You will need
to change this directory to the path of where your data is actually stored.
/afs/cs.pitt.edu/usr0/litman/public/cs2731/TrainingSet/
1999-W02-5.txt
1999-W04-5.txt
1999-W05-5.txt
1999-W07-5.txt
1999-W08-5.txt
1999-W10-1.txt
Each story file will be formatted like Figure 1. The first line is the story title. The second is the date of the story followed by two empty lines. After that, the main story begins with one sentence per line. Original paragraphs are separated by an empty line.
At the end of each story is a set of 5 to 10 questions. You can identify
the question section of the file by looking for a line that contains only the
<QUESTIONS>
tag. After that, each line will contain a question and it will
start with <Q_n_>
tag (where n is the number of the question)
followed by the question.
The second parameter is the name of the output file. The output file will contain your answers for all of the stories specified in the input file. The output of your system should be formatted as the answer key. More specifically, it should have the following structure:
<FILE>file_name
<Q_NUMBER>1
<A_LINE>answer_line
<Q_TXT>question_text
<A_TXT>answer_sentence
<Q_NUMBER>2
<A_LINE>answer_line
<Q_TXT>question_text
<A_TXT>answer_sentence
...
</FILE>
<FILE>file_name
<Q_NUMBER>1
<A_LINE>answer_line
<Q_TXT>question_text
<A_TXT>answer_sentence
...
where:
The <Q_TXT>
and <A_TXT>
lines are optional and you don't
need to include them in the output (though you might want to have them so
that you can check your system answers faster while developing the system). You
will be graded by matching your answer line against the one in the answer key.
Please make sure that the <Q_NUMBER>
line is followed by the
<A_LINE>
line. Also, do not forget to end each <FILE>
section with a
corresponding </FILE>
.
You can have as many empty line as you like in your output (see for example
Figure 2). Just make sure that you have the lines <FILE>
,
<Q_NUMBER>
, <A_LINE>
and </FILE>
in your output.
The performance of each Q/A system will be scored based on the percentage of questions answered correctly. An answer will be scored as correct if it exactly matches one of the sentence lines marked as correct for the corresponding question in the answer key (see the Tools section for the grader).
Your second part of the system should also accept two command line parameters. Running your program should look like:
myQAprojectPart2 input_filename outputfile_name
The input file to challenge two is one single file with the same format as shown in Figure 3. Our program needs to output a file with the same format as Figure 4:
FILE: filename Q_NUMBER: q_number
question_text
answer_text
start end
Answer: the_answer_phrase
FILE: filename Q_NUMBER: q_number
question_text
answer_text
start end
Answer: the_answer_phrase
The performance of the phrase level Q/A system is determined by the three following evaluation metrics.
Word level F_1 score. We calculate the number of word overlaps in the gold and prediction, and divide that by the total number of words.
sum = 0
for each Q:
sum += recall(Q) = percentage of the words in QA_gold that are in QA_auto
recall = sum / N
sum = 0
for each Q:
sum += precision(Q) = percentage of the words in QA_auto that are in QA_gold
precision = sum / N
f-measure = 2 * precision * recall / (precision + recall)
I am putting the potentially useful resources in a separate page Useful Resources. This page may be frequently updated, so we recommend you put that page into your bookmark.
To run the evaluation tools, you will need to install Perl and Python 2 on your computer. perl_path
and python2_path
refer to the installation paths of these two languages.
perl_path/perl grader.pl input_filename answerkey_filename your_answer_filename
This is the script that will be used for grading Part 1. Input_filename refers to the same file you use as input in your Q/A system. The second command line parameter has to be the answer key file name (the one provided to you). The third one is your answer file name (if you swap the parameters you will get bogus results, so be careful!!!). The script output is self explanatory.
python2_path/python evaluate_part2.py your_output_file gold_standard_output_file
This script will output the three evaluation scores for part 2, as we described in the Evaluation section.
perl_path/perl marker.pl input_filename answer_filename tag_name new_extension
This script will generate new story files by marking
the answers from
answer_file with the <tag_name question_number>
</tag_name
question_number>
tags in the original story. Input_filename refers
to the same file you use as input in your Q/A system. Answer_filename
refers to an answer file (it can be your answers or the answer key file). The
script will process every story by looking up in the answer file the lines that
contain answers and mark them with the appropriate tags. The annotated story is saved
in a file with the same name but with new_extension extension (if you use
the txt extension it will overwrite the original story)
You might find this tool useful by running it twice: once with the answer key and tag name CORRECT_ANS and then run on the annotated files with MY_ANS tag. In this way you will get stories that have both the correct answers and your answers marked (hopefully overlapping as much as possible :-)).
Remark: When the answer file is used by marker to annotate the story, the
match between story file name and the file name from the answer (whatever
follows <FILE>
) is done without taking into account the EXTENSION. This will
help you when you want to apply the marker two times to annotate with both
answers.
The project contributes 20 points into your final grades. Each project will be graded according to the following criteria
To compute the final grade for the project, each Q/A system will be ranked relative to the other systems in the class. For example, if your system ranks 1st on Test Set #1 sentence-level performance, 2nd on Test Set #1 phrase-level performance ... then your average ranking would be 1*15% + 2*15% + ....
Note that the final grading is on a relative, not an absolute, scale. However, this does not mean that the team with the highest average ranking automatically gets an A (e.g., if the best score was no better than chance performance), or that the lowest scoring team fails. If every team produces a good and interesting system, we will be happy to give every team an A.
The class will be divided into 3-person or 2-person teams for the project. You may form your own team if you know people with whom you'd like to work. Otherwise, we can randomly assign you to a team.
To help us evaluate each individual team member's contribution to the project, please include in your reports a description of: (1) how the project is splitted among team members (2) what did each individual member do (3) what does each member plan to do. NOTE: Please include this description in both the preliminary submission on Mar 27, and the final report on Apr 16.
Based on your contribution, we may assign you grades that are different from other team members.
We suggest all teams to early on come up with the specifications, and split team members' tasks. You may meet regularly to assess the progress, and brainstorm for what to do next.
I'm also posting the leaderboard in this page.
NLP is not a solved problem, and effective QA'ing is HARD! Randomly choosing a sentence will yield extremely low accuracy, so anything higher means that you are doing something good!!
This project and these instructions are based on Professor Ellen Riloff's course project at the University of Utah. Thanks to MITRE for the use of the CBC data.