HOMEWORK 3 (CS 2731)

Assigned: October 22, 2019

Due: November 5, 2019 (before midnight)

This assignment provides hands-on experience with 1) applying baseline machine learning methods for a text classification task using bag-of-words and vector semantics, and 2) posing a research question and setting up an experiment to address the question. To do so, you will develop classifiers for toxicity of user comments to news articles, using the annotated constructiveness and toxicity corpus from the SFU opinions and comments corpus (if you want more background, click here). In particular, given a comment from Column F, predict the level of toxicity (use the left-most/first number in Column I of the corresponding row).

Main Tasks

  1. Set up your global experimental framework. Make appropriate cross-validation splits. [NB: you should think about how you want to randomize the data and be able to justify your choice in the write-up. Note that in the file, the instances are sorted by the original articles and then by comment order.]
  2. Build a baseline logistic regression classifier and compare with a majority-vote classifier:
  3. Make two improvements to your logistic regression baseline classifier and compare with the previous classifier:
  4. Perform a rigorous comparison between your 3 classifiers using statistical tests.
  5. Pose a simple question based on this classification task; conduct an experiment to answer the question; discuss the outcomes of the experiment and draw conclusions.
  6. Unlike prior homeworks, you are allowed to use external resources, including:

What to Submit

Grading Guideline

Assignments are graded qualitatively on a non-linear five point scale. Below is a rough guideline:

  1. (40%): A serious attempt at the assignment. The README clearly describes the problems encountered in detail.
  2. (60%): Correctly completed the assignment through Step 2, but encountered significant problems with later steps. Submitted a README documenting the problems and a REPORT for the outcomes of Step 2.
  3. (80%): Correctly completed the assignment through Step 4, but has a significantly flawed Step 5. Submitted a README and a REPORT.
  4. (93%): Correctly completed the assignment through Step 4. For step 5, the question posed is clear and rigorously answered through experimentation. The REPORT content is solid.
  5. (100%): Correctly completed the assignment through Step 4. For step 5, the question posed is clear and interesting; it is rigorously answered through experimentation. The REPORT content is well-written and insightful.

Acknowledgment

This assignment is adapted from Prof. Hwa