HOMEWORK 3 (CS 1671)

Assigned:March 24, 2020

Due: April 2, 2020 (Before Midnight)

Note: Incorporated into the Project, details below

This assignment provides hands-on experience with applying baseline machine learning methods for a text classification task using bag-of-words and vector semantics. To do so, you will develop classifiers for constructiveness of user comments to news articles, using the annotated constructiveness and toxicity corpus from the SFU opinions and comments corpus (if you want more background, click here). In particular, given a comment from Column F, predict the constructiveness of it from column G.

Please note that the final project for this class will extend off of this homework assignment, by asking you to propose and evaluate some potential improvements. More details will be coming up.

Main Tasks

  1. Set up your global experimental framework. Select your evaluation metric. Make appropriate cross-validation splits. [NB: you should think about how you want to randomize the data and be able to justify your choice in the write-up. Note that in the file, the instances are sorted by the original articles and then by comment order.]
  2. Build a baseline logistic regression classifier and compare with a majority-vote classifier.
    You already know what a logistic regression classifier is, but a majority-vote classifier is an even simpler classifier used to get an idea if machine learning is actually helpful or not. This classifier will only predict one class. The class it predicts is the most common class seen in your training data. So if your training data has 55% of the observations as class_1 and 45% class_2, it will predict class_1 for all observations it sees in the test set.
  3. Attempt an improvement to your logistic regression baseline classifier using sparse vector semantic representations and compare with the previous classifier:
  4. Unlike prior homeworks, you are allowed to use external resources, including:

What to Submit

Grading Guideline

  1. Code (70 Points)
  2. Report (30 Points)

Acknowledgment

This assignment is adapted from Prof. Hwa