Project Description (CS 2731 / ISSP 2230), Fall 2013

Task Description

The goal of the task is to assess student answers to exercise questions that can be useful in tutorial dialogue and/or e-learning systems. Specifically, given a question, a known correct "reference answer" and a 1- or 2-sentence student answer, the goal is to determine the student's answer accuracy. A sample question and answer is shown below.

<question qtype="Q_EXPLAIN_GENERIC" id="VOLTAGE_DEFINE_Q"
          stype="QUESTION"
          module="FaultFinding">
    <questionText>What is voltage?</questionText>
    <referenceAnswers>
        <referenceAnswer category="BEST" id="answer244" fileID="VOLTAGE_DEFINE_Q_ANS1">Voltage is the difference in electrical 
        states between two terminals</referenceAnswer>
        ...
    </referenceAnswers>
    <studentAnswers>
        <studentAnswer id="FaultFinding.VOLTAGE_DEFINE_Q.sbj3-l1.qa213" dialogue_id="sbj3-l1"
                     accuracy="unknown"
                     answerMatch="answer244"
                     count="1">is the difference between two terminals</studentAnswer>
        ...
    </studentAnswers>
</question>

Figure 1. Question and Student Answer example

All question elements contain a ID unique to the question, and a module name, which defines the general "topic" of the question.

The questionText element contains the full text of the question as text content.

The referenceAnswers element contains an arbitrary number of referenceAnswer elements, one for each correct answer to the question. The data coming from SciEntsBank corpus lists exactly one of these answers for every question, but the Beetle corpus may have multiple correct answers, each with a category attribute (containing BEST, GOOD, MINIMAL, or KEYWORD) and fileID attribute that allows the answer to be referenced from a studentAnswer element.

The studentAnswers element contains an arbitrary number of studentAnswer elements, one for each student attempt to answer the question.

Each studentAnswer element has text content which is the text of the student answer, and an accuracy attribute containing one of three values: correct, contradictory (3-way data only), and unknown. such that:

a) the student answer and question text together entail one of the reference answers = "correct"
b) the student answer and question text together contradict one of the reference answers = "contradictory" (or "unknown" in 2-way data)
e) otherwise (the answer is partially correct, irrelevant, or does not address the scientific concepts) = "non_domain"

The task can be performed at different levels of granularity, namely:

- 5-way task, where the system is required to classify the student answer according to one of the following judgments:

correct, if the student answer is a complete and correct paraphrase of the reference answer;
partially_correct_incomplete, if the student answer is a partially correct answer containing some but not all information from the reference answer;
contradictory, if the student answer explicitly contradicts the reference answer;
irrelevant, if the student answer is "irrelevant", talking about domain content but not providing the necessary information;
non_domain, if the student answer expresses a request for help, frustration or lack of domain knowledge - e.g., "I don't know", "as the book says", "you are stupid".

- 3-way task, where the system is required to classify the student answer according to one of the following judgments:

correct
contradictory
incorrect, conflating the categories of partially_correct_ incomplete, irrelevant or non_domain in the 5-way classification

- 2-way task, where the system is required to classify the student answer according to one of the following judgments:

correct
incorrect, conflating the categories of contradictory and incorrect in the 3-way classification.

Data Set

The following, two datasets will be used in the task:

The Beetle dataset, which is a set of transcripts of students interacting with an intelligent tutorial dialogue system for teaching conceptual knowledge in the basic electricity and electronics domain (Dzikovska et al., 2010). The system is based on a course developed by instructional psychologists, centered around asking explanations questions. 73 different explanation questions asked by the system, with up to 75 student answers recorded for every question (some answers are repeated, so the actual count of unique answers will be lower for many questions). The answers were produced by students without prior knowledge of the domain participating in the dialogue system evaluation. Each question is associated with 1 to 10 different acceptable answers provided by experienced human tutors (Dzikovska et al., 2008). The availability of a range of answers is intended to reduce problems caused by difficult common sense inferences. The answers cover two topics in the domain: closed paths in series/parallel circuits, and using voltage to find faults in series circuits (Dzikovska et al., 2010).

The Science Entailments corpus (SciEntsBank) is based on the fine-grained annotations for constructed responses to science assessment questions by Nielsen at al. (2008), which were automatically mapped to the 5-way labels as described in (Dzikovska, Nielsen and Brew; 2010). The original fine-grained annotation set consits of 287 constructed response questions taken from the FOSS assessments, a proven science education system that has been in use across the United States and elsewhere for over a decade (FOSS, Berkeley, Lawrence Hall of Science, 2005). These questions had expected responses ranging in length from moderately short verb phrases to several sentences and could be assessed objectively. The answers were labelled on sub-sentence level to identify specific concepts and relationships covered or contradicted by the answers. The fine-grained labels were then automatically mapped into the task labels, and some questions unsuitable for the task filtered out. The complete list of modules involved, and examples of representative questions and student answers can be found in Nielsen et al.(2008). The mapping and filtering procedure is described in (Dzikovska, Nielsen and Brew, 2010).

Element IDs and naming conventions

The format of file names and id attributes on various elements is described below. Note that there should never be need to parse names and IDs, all the data is linked through direct matching, and any information present in names can be read off corresponding atributes. However, we used a consistent naming conventions to help with readability.

The and are linked through element IDs. All data is represented as XML, with each xml element (e.g., question, reference answer, student answer) assigned a unique ID. This ID can then be used to link elements inside XML files, and also to find the additional information corresponding to the element in the "extras" directory.

The file names for core data have have the following formats:

Beetle: <module_id>-<question_id>.xml
SciEntsBank: <module_id>-<question_type_id>-<question_id>.xml

The student answer IDs in the Beetle data are in the form

module.questionID.studentID.attemptID

where the last element attemptID changes where a student attempts the same question more than once.

The student answer IDs in SciEntsBank corpus follow the same convention except that attemptID is omitted since each student only provided one answer to the question.

The Phases of the Project

The project will involve three phases:

Development

Preliminary evaluation

Final evaluation

Development Phase

You will be given the Training Set to use in developing your SRA systems. You may use these questions and the answers in any way that you wish.

Preliminary Evaluation

First, there will be a preliminary evaluation of everyone's SRA systems. Each team will hand in the code for their SRA system and we will run the SRA systems on the Training Set. We will score the accuracy of each system and post the results on CourseWeb. The results of the preliminary evaluation will not count toward your final project grade, but should be useful for assessing your progress and seeing how well your system works compared to others in the class.

Final Evaluation

At this point, each team will hand in the final code for their SRA system. Your final project grade will be based on the performance of your SRA system on both the training and test sets.

The purpose of evaluating your systems on both sets of data is to balance specificity with generality. You will have several weeks to try to get your SRA systems to perform well on Training Set. Hopefully, everyone will be able to do fairly well on that set. Test Set will be a blind test set that no one will see until the final evaluation. A system that uses general techniques should work just as well on both sets. But a system that has lots of hacks and tweaks based on Training Set probably will perform very poorly on Test Set.

WARNING: You will be given the answers for Train Set, but your system is not allowed to use them when answering questions! The answer keys are being distributed only to show you what the correct answers should be, and to allow you to evaluate your SRA systems automatically if you wish. Your system should use general techniques that can apply to a wide variety of texts.

The Gory Details

Your SRA system should accept two command line parameters. Running your program should look like:

mySRAproject PathToDataDirectory outputfile_name

The Input

The first parameter is the directory containing core data, which can be found in the training and testing data directory. Your SRA system should process each file in the specified directory. A sample input file is below.

\\afs\cs.pitt.edu\usr0\wencan\public\cs2731\SemEval\train\semeval2013-5way\beetle\Core\

Each file in the directory is formated as Figure 1.

The Output

Your program should assume a file input format where the first line is the header with column names, and the other lines are data. The first column must contain utterance IDs. The last column must contain the system output classes. The first line must be a header. All columns must be tab-separated, and contain an equal number of fields, the same as the header. Note that the files are sorted by the utterance ID internally, so the order of entries in the input files is ignored.

A sample output file is shown in below:

ID	Fold	Actual	Predicted
FaultFinding-BURNED_BULB_LOCATE_EXPLAIN_Q.sbjb12-l1.qa186	1	partially_correct_incomplete	correct
SwitchesBulbsParallel-BURNED_BULB_PARALLEL_WHY_Q.sbjb22-l2.qa87	1	partially_correct_incomplete	contradictory
SwitchesBulbsParallel-HYBRID_BURNED_OUT_WHY_Q3.sbj19-l2.qa143	1	partially_correct_incomplete	partially_correct_incomplete
SwitchesBulbsSeries-CONDITIONS_FOR_BULB_TO_LIGHT.sbj14-l1.qa44	1	partially_correct_incomplete	contradictory
SwitchesBulbsParallel-OPT2_EXPLAIN_Q.sbjb2-l2.qa115	1	partially_correct_incomplete	partially_correct_incomplete
SwitchesBulbsSeries-SHORT_CIRCUIT_EXPLAIN_Q_5.sbj11-l1.qa74	1	partially_correct_incomplete	correct
FaultFinding-VOLTAGE_INCOMPLETE_CIRCUIT_2_Q.sbj27-l1.qa236	1	partially_correct_incomplete	partially_correct_incomplete
FaultFinding-VOLTAGE_GAP_EXPLAIN_WHY4.sbjb37-l1.qa138	1	partially_correct_incomplete	partially_correct_incomplete

Evaluation

The performance of each SRA system will be scored based on on overall accuracy. (see the Tools section for the evaluation).

Tools

The scripts can be found in:

\\afs\cs.pitt.edu\usr0\wencan\public\cs2731\SemEval\Scripts

evaluation

./evaluation.sh [-mode 5way|3way|2way] <system> <gold>

The script takes 2 parameters: the system output (first) and the gold output. The evaluation results are printed on standard out. If mode is not specified, it defaults to 5-way task

e.g.,

./evaluation.sh -mode 5way ../semevalFormatProcessing/beetleBaselineOutput.txt ../semevalFormatProcessing/trainingGold.txt

Baseline

A baseline classifier (Dzikovska at al., NAACL 2012) is implemented. See the readme under the script directory for more details.

Requirements

To run the baseline banchmark , you have to install the dependencies:

Linux
Perl
Text::Similarity
Lingua::Stem
XML::SAX
Weka
xsltproc

Schedule

The schedule for the projects is shown below:

October 17: Training set of questions and answers is released.

October 24: Teams are formed and emailed to dlitman@pitt.edu.

November 14: Preliminary evaluation on Training Set.

December 5: Final evaluation on Training Set and Test Set.

December 10: Project reports due.

December 10: Project presentations.

By November, we expect each team to have a working SRA system! It might not work well and may still be missing some components that you plan to incorporate, but it should be able to process student answers and judge them for each file.

Participation in the preliminary evaluation is mandatory. Failure to participate will result in a 10% deduction off your final project grade. This policy is to ensure that everyone is making adequate progress.

Grading

Each project will be graded according to the following criteria:

33% of the grade will be based on your SRA system's performance on Training Set during the final evaluation

33% of the grade will be based on your SRA system's performance on Test Set during the final evaluation

33% of the grade will be based on your project report and presentation

To compute the final grade for the project, each SRA system will be ranked relative to the other systems in the class. For example, if your system ranks 1st on Training Set performance, 3rd on Test Set performance, and 5th on the project report and presentation, then your average ranking would be (1+3+5)/3=3.

The grade for the report and presentation will be based on clarity, as well as the creativity and ambitiousness shown in the design of your system. Thus, if you incorporate novel ideas and/or complex algorithms, then I will take that into account. Like the Olympics, difficulty can in effect boost your raw performance scores.

Note that the final grading is on a relative, not an absolute, scale. However, this does not mean that the team with the highest average ranking automatically gets an A (e.g., if the best score was no better than chance performance), or that the lowest scoring team fails. If every team produces a good and interesting system, I will be happy to give every team an A.

Encouragement

NLP is not a solved problem, and effective SRA'ing is HARD! Randomly choosing an answer will yield low accuracy, so anything higher means that you are doing something good!!

Credits

This project and these instructions are based on Semeval.