Project Description (CS 2731 / ISSP 2230), Fall 2013


Task Description

The goal of the task is to assess student answers to exercise questions that can be useful in tutorial dialogue and/or e-learning systems. Specifically, given a question, a known correct "reference answer" and a 1- or 2-sentence student answer, the goal is to determine the student's answer accuracy. A sample question and answer is shown below.

<question qtype="Q_EXPLAIN_GENERIC" id="VOLTAGE_DEFINE_Q"
          stype="QUESTION"
          module="FaultFinding">
    <questionText>What is voltage?</questionText>
    <referenceAnswers>
        <referenceAnswer category="BEST" id="answer244" fileID="VOLTAGE_DEFINE_Q_ANS1">Voltage is the difference in electrical 
        states between two terminals</referenceAnswer>
        ...
    </referenceAnswers>
    <studentAnswers>
        <studentAnswer id="FaultFinding.VOLTAGE_DEFINE_Q.sbj3-l1.qa213" dialogue_id="sbj3-l1"
                     accuracy="unknown"
                     answerMatch="answer244"
                     count="1">is the difference between two terminals</studentAnswer>
        ...
    </studentAnswers>
</question>

 Figure 1. Question and Student Answer example

All question elements contain a ID unique to the question, and a module name, which defines the general "topic" of the question.

The questionText element contains the full text of the question as text content.

The referenceAnswers element contains an arbitrary number of referenceAnswer elements, one for each correct answer to the question. The data coming from SciEntsBank corpus lists exactly one of these answers for every question, but the Beetle corpus may have multiple correct answers, each with a category attribute (containing BEST, GOOD, MINIMAL, or KEYWORD) and fileID attribute that allows the answer to be referenced from a studentAnswer element.

The studentAnswers element contains an arbitrary number of studentAnswer elements, one for each student attempt to answer the question.

Each studentAnswer element has text content which is the text of the student answer, and an accuracy attribute containing one of three values: correct, contradictory (3-way data only), and unknown. such that:

The task can be performed at different levels of granularity, namely:

- 5-way task, where the system is required to classify the student answer according to one of the following judgments:

- 3-way task, where the system is required to classify the student answer according to one of the following judgments:

- 2-way task, where the system is required to classify the student answer according to one of the following judgments:


Data Set

The following, two datasets will be used in the task:

Element IDs and naming conventions

The format of file names and id attributes on various elements is described below. Note that there should never be need to parse names and IDs, all the data is linked through direct matching, and any information present in names can be read off corresponding atributes. However, we used a consistent naming conventions to help with readability.

The and are linked through element IDs. All data is represented as XML, with each xml element (e.g., question, reference answer, student answer) assigned a unique ID. This ID can then be used to link elements inside XML files, and also to find the additional information corresponding to the element in the "extras" directory.

The file names for core data have have the following formats:

The student answer IDs in the Beetle data are in the form

where the last element attemptID changes where a student attempts the same question more than once.

The student answer IDs in SciEntsBank corpus follow the same convention except that attemptID is omitted since each student only provided one answer to the question.


The Phases of the Project

The project will involve three phases:

Development Phase

You will be given the Training Set to use in developing your SRA systems. You may use these questions and the answers in any way that you wish.

Preliminary Evaluation

First, there will be a preliminary evaluation of everyone's SRA systems. Each team will hand in the code for their SRA system and we will run the SRA systems on the Training Set. We will score the accuracy of each system and post the results on CourseWeb. The results of the preliminary evaluation will not count toward your final project grade, but should be useful for assessing your progress and seeing how well your system works compared to others in the class.

Final Evaluation

At this point, each team will hand in the final code for their SRA system. Your final project grade will be based on the performance of your SRA system on both the training and test sets.

The purpose of evaluating your systems on both sets of data is to balance specificity with generality. You will have several weeks to try to get your SRA systems to perform well on Training Set. Hopefully, everyone will be able to do fairly well on that set. Test Set will be a blind test set that no one will see until the final evaluation. A system that uses general techniques should work just as well on both sets. But a system that has lots of hacks and tweaks based on Training Set probably will perform very poorly on Test Set.

WARNING: You will be given the answers for Train Set, but your system is not allowed to use them when answering questions! The answer keys are being distributed only to show you what the correct answers should be, and to allow you to evaluate your SRA systems automatically if you wish. Your system should use general techniques that can apply to a wide variety of texts.


The Gory Details

Your SRA system should accept two command line parameters. Running your program should look like:

    mySRAproject PathToDataDirectory outputfile_name

The Input

The first parameter is the directory containing core data, which can be found in the training and testing data directory. Your SRA system should process each file in the specified directory. A sample input file is below.

Each file in the directory is formated as Figure 1.

The Output

Your program should assume a file input format where the first line is the header with column names, and the other lines are data. The first column must contain utterance IDs. The last column must contain the system output classes. The first line must be a header. All columns must be tab-separated, and contain an equal number of fields, the same as the header. Note that the files are sorted by the utterance ID internally, so the order of entries in the input files is ignored.

A sample output file is shown in below:

Evaluation

The performance of each SRA system will be scored based on on overall accuracy. (see the Tools section for the evaluation).


Tools

The scripts can be found in:

evaluation

./evaluation.sh [-mode 5way|3way|2way] <system> <gold>

The script takes 2 parameters: the system output (first) and the gold output. The evaluation results are printed on standard out. If mode is not specified, it defaults to 5-way task

e.g.,

./evaluation.sh -mode 5way ../semevalFormatProcessing/beetleBaselineOutput.txt ../semevalFormatProcessing/trainingGold.txt

Baseline

A baseline classifier (Dzikovska at al., NAACL 2012) is implemented. See the readme under the script directory for more details.

Requirements

To run the baseline banchmark , you have to install the dependencies:


Schedule

The schedule for the projects is shown below:

By November, we expect each team to have a working SRA system! It might not work well and may still be missing some components that you plan to incorporate, but it should be able to process student answers and judge them for each file.

Participation in the preliminary evaluation is mandatory. Failure to participate will result in a 10% deduction off your final project grade. This policy is to ensure that everyone is making adequate progress.


Grading

Each project will be graded according to the following criteria:

To compute the final grade for the project, each SRA system will be ranked relative to the other systems in the class. For example, if your system ranks 1st on Training Set performance, 3rd on Test Set performance, and 5th on the project report and presentation, then your average ranking would be (1+3+5)/3=3.

The grade for the report and presentation will be based on clarity, as well as the creativity and ambitiousness shown in the design of your system. Thus, if you incorporate novel ideas and/or complex algorithms, then I will take that into account. Like the Olympics, difficulty can in effect boost your raw performance scores.

Note that the final grading is on a relative, not an absolute, scale. However, this does not mean that the team with the highest average ranking automatically gets an A (e.g., if the best score was no better than chance performance), or that the lowest scoring team fails. If every team produces a good and interesting system, I will be happy to give every team an A.


Encouragement

NLP is not a solved problem, and effective SRA'ing is HARD! Randomly choosing an answer will yield low accuracy, so anything higher means that you are doing something good!!


Credits

This project and these instructions are based on Semeval.