The goal of the task is to assess student answers to exercise questions that can be useful in tutorial dialogue and/or e-learning systems. Specifically, given a question, a known correct "reference answer" and a 1- or 2-sentence student answer, the goal is to determine the student's answer accuracy. A sample question and answer is shown below.
<question qtype="Q_EXPLAIN_GENERIC" id="VOLTAGE_DEFINE_Q"
stype="QUESTION"
module="FaultFinding">
<questionText>What is voltage?</questionText>
<referenceAnswers>
<referenceAnswer category="BEST" id="answer244" fileID="VOLTAGE_DEFINE_Q_ANS1">Voltage is the difference in electrical
states between two terminals</referenceAnswer>
...
</referenceAnswers>
<studentAnswers>
<studentAnswer id="FaultFinding.VOLTAGE_DEFINE_Q.sbj3-l1.qa213" dialogue_id="sbj3-l1"
accuracy="unknown"
answerMatch="answer244"
count="1">is the difference between two terminals</studentAnswer>
...
</studentAnswers>
</question>
Figure 1. Question and Student Answer example
All question elements contain a ID unique to the question, and a module name, which defines the general "topic" of the question.
The questionText element contains the full text of the question as text content.
The referenceAnswers element contains an arbitrary number of referenceAnswer elements, one for each correct answer to the question. The data coming from SciEntsBank corpus lists exactly one of these answers for every question, but the Beetle corpus may have multiple correct answers, each with a category attribute (containing BEST, GOOD, MINIMAL, or KEYWORD) and fileID attribute that allows the answer to be referenced from a studentAnswer element.
The studentAnswers element contains an arbitrary number of studentAnswer elements, one for each student attempt to answer the question.
Each studentAnswer element has text content which is the text of the student answer, and an accuracy attribute containing one of three values: correct, contradictory (3-way data only), and unknown. such that:
- 3-way task, where the system is required to classify the student answer according to one of the following judgments:
- 2-way task, where the system is required to classify the student answer according to one of the following judgments:
The following, two datasets will be used in the task:
Element IDs and naming conventions
The format of file names and id attributes on various elements is described below. Note that there should never be need to parse names and IDs, all the data is linked through direct matching, and any information present in names can be read off corresponding atributes. However, we used a consistent naming conventions to help with readability.
The
The student answer IDs in the Beetle data are in the form
The student answer IDs in SciEntsBank corpus follow the same convention except that attemptID is omitted since each student only provided one answer to the question.
The project will involve three phases:
You will be given the Training Set to use in developing your SRA systems. You may use these questions and the answers in any way that you wish.
At this point, each team will hand in the final code for their SRA system. Your final project grade will be based on the performance of your SRA system on both the training and test sets.
The purpose of evaluating your systems on both sets of data is to balance specificity with generality. You will have several weeks to try to get your SRA systems to perform well on Training Set. Hopefully, everyone will be able to do fairly well on that set. Test Set will be a blind test set that no one will see until the final evaluation. A system that uses general techniques should work just as well on both sets. But a system that has lots of hacks and tweaks based on Training Set probably will perform very poorly on Test Set.
WARNING: You will be given the answers for Train Set, but your system is not allowed to use them when answering questions! The answer keys are being distributed only to show you what the correct answers should be, and to allow you to evaluate your SRA systems automatically if you wish. Your system should use general techniques that can apply to a wide variety of texts.
Your SRA system should accept two command line parameters. Running your program should look like:
mySRAproject PathToDataDirectory outputfile_name
The first parameter is the directory containing core data, which can be found in the training and testing data directory. Your SRA system should process each file in the specified directory. A sample input file is below.
Each file in the directory is formated as Figure 1.
Your program should assume a file input format where the first line is the header with column names, and the other lines are data. The first column must contain utterance IDs. The last column must contain the system output classes. The first line must be a header. All columns must be tab-separated, and contain an equal number of fields, the same as the header. Note that the files are sorted by the utterance ID internally, so the order of entries in the input files is ignored.
A sample output file is shown in below:ID Fold Actual Predicted FaultFinding-BURNED_BULB_LOCATE_EXPLAIN_Q.sbjb12-l1.qa186 1 partially_correct_incomplete correct SwitchesBulbsParallel-BURNED_BULB_PARALLEL_WHY_Q.sbjb22-l2.qa87 1 partially_correct_incomplete contradictory SwitchesBulbsParallel-HYBRID_BURNED_OUT_WHY_Q3.sbj19-l2.qa143 1 partially_correct_incomplete partially_correct_incomplete SwitchesBulbsSeries-CONDITIONS_FOR_BULB_TO_LIGHT.sbj14-l1.qa44 1 partially_correct_incomplete contradictory SwitchesBulbsParallel-OPT2_EXPLAIN_Q.sbjb2-l2.qa115 1 partially_correct_incomplete partially_correct_incomplete SwitchesBulbsSeries-SHORT_CIRCUIT_EXPLAIN_Q_5.sbj11-l1.qa74 1 partially_correct_incomplete correct FaultFinding-VOLTAGE_INCOMPLETE_CIRCUIT_2_Q.sbj27-l1.qa236 1 partially_correct_incomplete partially_correct_incomplete FaultFinding-VOLTAGE_GAP_EXPLAIN_WHY4.sbjb37-l1.qa138 1 partially_correct_incomplete partially_correct_incomplete
The scripts can be found in:
evaluation
./evaluation.sh [-mode 5way|3way|2way] <system> <gold>
The script takes 2 parameters: the system output (first) and the gold output. The evaluation results are printed on standard out. If mode is not specified, it defaults to 5-way task
e.g.,
Requirements
To run the baseline banchmark , you have to install the dependencies:The schedule for the projects is shown below:
By November, we expect each team to have a working SRA system! It might not work well and may still be missing some components that you plan to incorporate, but it should be able to process student answers and judge them for each file.
Participation in the preliminary evaluation is mandatory. Failure to participate will result in a 10% deduction off your final project grade. This policy is to ensure that everyone is making adequate progress.
Each project will be graded according to the following criteria:
The grade for the report and presentation will be based on clarity, as well as the creativity and ambitiousness shown in the design of your system. Thus, if you incorporate novel ideas and/or complex algorithms, then I will take that into account. Like the Olympics, difficulty can in effect boost your raw performance scores.
Note that the final grading is on a relative, not an absolute, scale. However, this does not mean that the team with the highest average ranking automatically gets an A (e.g., if the best score was no better than chance performance), or that the lowest scoring team fails. If every team produces a good and interesting system, I will be happy to give every team an A.
NLP is not a solved problem, and effective SRA'ing is HARD! Randomly choosing an answer will yield low accuracy, so anything higher means that you are doing something good!!
This project and these instructions are based on Semeval.