CS 2731 Course Project: Fake News Challenge

Instruction written by: Yuhuan Jiang
Last modified: Feb 18, 2017

The project for this class will be to design and build a system for the Fake News Challenge. This will help you gain exposure to cutting edge research area and experience in building a real NLP system.

The core task of the project is not to label the news as "truth telling" or "fake". As the FAQ page of the organizers mention, truth labeling poses several major challenges (such as the lacking of freely available labeled data, or label bias). The aim of the project is essentially stance detection, which could form the basis of a useful tool for real-life human fact checkers, as the organizers say.

We are not forcing you to participate in that challenge. We are merely using their problem definition and data for the purpose of course project. You are, nevertheless, welcome to participate in such event.

Problem Definition

The problem of this project can be formally defined as follows:

Input
A headline and a body text (either from the same news article or from two different articles),

Output A classification of the stance of the body text relative to the claim made in the headline. The output label should be one of the following four:

The first three labels are collectively referred to as related, but neither the data nor your system output should use the related lable.

Example

Given a data entry (headline, bodyText), where the headline is

Robert Plant Ripped up $800M Led Zeppelin Reunion Contract

The following are some expected classifications for different bodyText.

Data

⚠️ Academic Integrity Warning
Do not attempt to obtain any more data than what we provide below from else where. You must use only the training set and the development set provided below to train and tune your system.

However, this is should not prevent you from using any extra data that you find useful for training sub-components of your system. For example, your system might rely on a POS tagger, which might require training data. In this case, using data such as Penn Treebank will not be considered as cheating.

Training and Development Set

The training set and development set are released now

Testing Set

The testing set will be released toward the end of the project. Only a file named test.csv will be released. You need to submit test.answer.csv, which is the output of your system.

Evaluation

The evaluation script scorer.py.txt (you will need to remove the suffice .txt to use it) generates an overall score, based on two weighted components:

For example, if a data entry (headline, bodyText) in the test set has the gold-standard label unrelated, then the evaluation score of your system will be incremented by 0.25 if your system also labels it as unrelated. If the gold-standard label is disagrees, then as long as your system outputs any of the three related labels, you gain 0.25. If your system's output is indeed disagrees, then the evaluation score will be incremented by another 0.75.

Run python3 scorer.py using Python 3 without supplying any arguments to see the usage.

Input/Output Requirement

Input

See the provided train_bodies.csv and train_stances.csv for the exact format.

Your submission should include a main script that can be executed as follows:

python main.py test_bodies.csv test_stances.csv answers.csv

The first argument is the path to the file containing the body texts. the second argument is the path to the file containing the (headline, bodyText) pairs. The third argument is the path to the output file.

Similarly, if you use Java, then you should submit a jar which can be executed as follows:

java -cp your.jar:dependency1.jar:dependency2.jar cs2731.project.Main test_bodies.csv test_stances.csv answers.csv

where dependency1.jar and dependency2.jar are possible external Java libraries that your system might use (such as Stanford CoreNLP, OpenNLP toolkit, ...).

Output

See scorer.py for the desired output CSV format. If score.py accepts your answers.csv, then you have the correct format.

Submission

A ZIP archive containing:

Presentation

Each team should prepare a short oral presentation to share the contents of your project report with the rest of the class. Presentations should be 10-12 minutes (subject to change once I know how many teams there are). The professor will cut off at 12 minutes to allow 3 minutes for questions.