HOMEWORK 3 (CS 1671)

Assigned:March 24, 2020

Due: April 2, 2020 (Before Midnight)

Note: Incorporated into the Project, details below

This assignment provides hands-on experience with applying baseline machine learning methods for a text classification task using bag-of-words and vector semantics. To do so, you will develop classifiers for constructiveness of user comments to news articles, using the annotated constructiveness and toxicity corpus from the SFU opinions and comments corpus (if you want more background, click here). In particular, given a comment from Column F, predict the constructiveness of it from column G.

Please note that the final project for this class will extend off of this homework assignment, by asking you to propose and evaluate some potential improvements. More details will be coming up.

Main Tasks

Set up your global experimental framework. Select your evaluation metric. Make appropriate cross-validation splits. [NB: you should think about how you want to randomize the data and be able to justify your choice in the write-up. Note that in the file, the instances are sorted by the original articles and then by comment order.]
Build a baseline logistic regression classifier and compare with a majority-vote classifier.
You already know what a logistic regression classifier is, but a majority-vote classifier is an even simpler classifier used to get an idea if machine learning is actually helpful or not. This classifier will only predict one class. The class it predicts is the most common class seen in your training data. So if your training data has 55% of the observations as class_1 and 45% class_2, it will predict class_1 for all observations it sees in the test set.
- First, extract and preprocess the comment text so as to determine your vocabulary set for this task.
- Next, train a logistic regression classifier using bag of words. You may use standard off-the-shelf packages for training the classifier.
- Finally, record the performance of your logistic regression classifier using cross-validation. How does it compare against a majority-vote baseline?
Attempt an improvement to your logistic regression baseline classifier using sparse vector semantic representations and compare with the previous classifier:
- Modify the previous classifier so that it uses sparse vectors
- Measure and see if there is an improvement compared to the previous models made
- Describe how you used the sparse vectors
Unlike prior homeworks, you are allowed to use external resources, including:
- Standard off-the-shelf packages such as: NLTK, Stanford CoreNLP, SciKit.
- Pre-trained word embeddings (for the project)

What to Submit

Your code and data files
- Please document enough of your program to help the TA grade your work.
A README file that addresses the following:
- Describe the computing environment you used, especially if you used some off-the-shelf modules. (Do not use unusual packages. If you're not sure, please ask.)
- List any additional resources, references, or web pages you've consulted.
- List any person with whom you've discussed the assignment and describe the nature of your discussions.
- Discuss any unresolved issues or problems.
A REPORT document that discusses the following:
- Describe what you did for Step 2 and report the baseline performance and compare it against majority voting.
- Describe your model for Step 3 and report its performance. Compare this model against the previous baselines (Step 2, majority voting).
Submit all of the above materials to Courseweb as a zip file.

Grading Guideline

Code (70 Points)
- 20 points: Majority-vote baseline classifier
- 20 points: Logistic Regression baseline classifier
- 30 points: Logistic Regression with Sparse Vectors
Report (30 Points)
- 30 points for the program description, analysis, and data supporting that analysis.

Acknowledgment

This assignment is adapted from Prof. Hwa