HOMEWORK 3 (CS 1671)
Assigned:March 24, 2020
Due: April 2, 2020 (Before Midnight)
Note: Incorporated into the Project, details below
This assignment provides hands-on experience with applying
baseline machine learning methods for a text classification task
using bag-of-words and vector semantics. To do
so, you will develop classifiers for constructiveness of user
comments to
news articles, using the annotated constructiveness and
toxicity corpus from the SFU opinions and
comments corpus (if you want more background, click
here).
In particular, given a comment from Column F, predict the
constructiveness of it from column G. Please note that the final
project for this class will extend off of this homework assignment, by asking you
to propose and evaluate some potential improvements. More details
will be coming up.
Main Tasks
- Set up your global experimental framework. Select your
evaluation metric. Make appropriate cross-validation splits. [NB:
you should think about how you want to randomize the data and be
able to justify your choice in the write-up. Note that in the file,
the instances are sorted by the original articles and then by
comment order.]
- Build a baseline logistic regression classifier and compare with
a majority-vote classifier.
You already know what a logistic
regression classifier is, but a majority-vote classifier is an even
simpler classifier used to get an idea if machine learning is
actually helpful or not. This classifier will only predict one
class. The class it predicts is the most common class seen in your
training data. So if your training data has 55% of the observations
as class_1 and 45% class_2, it will
predict class_1 for all observations it sees in the test
set.
- First, extract and preprocess the comment text so as to
determine your vocabulary set for this task.
- Next, train a logistic regression classifier using bag of
words. You may use standard off-the-shelf packages for training the
classifier.
-
Finally, record the performance of your logistic regression
classifier using cross-validation. How does it compare against a
majority-vote baseline?
- Attempt an improvement to your logistic regression baseline
classifier using sparse vector semantic representations and compare
with the previous classifier:
- Modify the previous classifier so that it uses sparse vectors
- Measure and see if there is an improvement compared to the
previous models made
- Describe how you used the sparse vectors
- Unlike prior homeworks, you are allowed to use external resources,
including:
- Standard off-the-shelf packages such as: NLTK, Stanford
CoreNLP, SciKit.
- Pre-trained word embeddings (for the project)
What to Submit
- Your code and data files
- Please document enough of your program to help the TA grade your
work.
- A README file that addresses the following:
- Describe the computing environment you used, especially if you
used some off-the-shelf modules. (Do not use unusual packages. If
you're not sure, please ask.)
- List any additional resources, references, or web pages you've
consulted.
- List any person with whom you've discussed the assignment and
describe the nature of your discussions.
- Discuss any unresolved issues or problems.
- A REPORT document that discusses the following:
- Describe what you did for Step 2 and report the baseline
performance and compare it against majority voting.
- Describe your model for Step 3 and report its performance.
Compare this model against the previous
baselines (Step 2, majority voting).
- Submit all of the above materials to Courseweb as a zip file.
Grading Guideline
- Code (70 Points)
- 20 points: Majority-vote baseline classifier
- 20 points: Logistic Regression baseline classifier
- 30 points: Logistic Regression with Sparse Vectors
- Report (30 Points)
- 30 points for the program description, analysis, and data
supporting that analysis.
Acknowledgment
This assignment is adapted from Prof. Hwa