CS2750: Homework 1

Due: 2/9/2017, 11:59pm

Instructions: See "Homework Submission Mechanics" on the main course page. If you do not see the Assignments button, notify the instructor.

Note: If you are asked to implement something by yourself, it is not ok to use or even look at existing Matlab or Python code. For anything you are not asked to implement, feel free to look up relevant functions. If you have questions about what you can use, ask the instructor or the TA.

Part I: Short Answers (10 points)

Propose three new problems that can be solved with machine learning (ones that we have not discussed in class), and describe how you would go about solving each. For each problem, discuss:

What features would you use?
What would the labels be?
How would you collect data?
Why might the problem turn out to be challenging?

Provide your answers in a text file titled part1.txt.

Part II: The Classification Pipeline (30 points)

In this problem, you will train a multi-class model to distinguish between different types of flowers.

You will use the Iris dataset from the UCI Machine Learning Repository. The data is contained the iris.data file under "Data Folder", while the file iris.names contains a description of the data. The features X are given as the first four comma-separated values in each row in the data file. The labels Y are the last entry in each row, but you should convert the strings to integer IDs.

First, split the data into three groups: A training set (40%), a validation set (20%), and a test set (40%).
You will train a multi-way SVM classifier. We will talk about SVM classifiers in great length later in the class. The goal in this assignment is to use SVMs as a black box, so that you can experiment with the machine learning pipeline we discussed in the introduction. SVM is one of the most popular classifiers; you might need to use it for your project, and it might be one of the few things you remember how to use after this class is over.
Pick an SVM software package to use among LIBSVM, LIBLINEAR, SVM Light, the SVM built-in to Matlab, or the SVM in scikit-learn for Python. Look at different packages to see which one you would feel most comfortable using. Read the documentation and find which function in the package you are using performs learning/fitting/training, and which function performs prediction/classification/testing. You can also look for examples of how these functions are used.
Include a text file part2.txt in your submission in which you describe why you chose the package that you chose to use. Then copy-paste the parts of the documentation that show how to train and use an SVM.
Pick some SVM parameter that the package allows you to tune or specify values for. For now, you won't know what these parameters do, but they will affect the success of your learning algorithm in some way. Your goal is to pick the best value of the parameter, by tuning your model on a validation set. In other words, you will train one model for every value of the parameter of your choice (try 5 different values) and pick which value is best, according to the accuracy on the validation set. You will then use the chosen value of the parameter and apply it to classify the samples in the test set.
In the text file mentioned above, report the final accuracy on (1) the training set, and (2) the test set. The difference between these two is called the generalization error. If the code you are using does not include a function to compute accuracy, you should write such a function yourself (it will only be a few lines). To compute accuracy, you will compare the predicted labels for your test samples to the ground-truth (provided with the dataset) labels for those same samples, and compute the fraction of samples that were labeled correctly.
Finally, experiment with different amounts of training data, and report how the error on the (1) training data and (2) test data, changes as you add more training data. Include a plot in your write-up with at least 5 different values for the size of the training data on the x-axis, and accuracy on the y-axis, for the training and test sets separately (i.e. show two curves). In the text file, explain what you are observing, and why it might be happening.

Grading rubric:

Loading and splitting the data into train/test/validation and X/Y parts: 5 pts
Package choice and documentation excerpts: 5 pts
Training with different values of a parameter and using the validation set to pick the best value of that parameter: 10 pts
Demonstrating how train/test error changes as more training data is added, including plot: 10 pts

Part III: Segmentation via Clustering (30 points)

For this problem, you will implement the K-means algorithm. You will then use it to perform image clustering, to test your implementation.

Write code to perform clustering over a an NxD data matrix (where N is the number of samples and D is the dimensionality of your feature representation) that you receive as input from the user. Your code should output (1) an Nx1 output containing the data memberships of each sample (denoted by an index from 1 to K, where K is the number of clusters); (2) a KxD matrix containing the mean/center for each cluster; and (3) the final SSD error of the clustering, i.e. the sum of the squared distances between points and their assigned means, summed over all clusters.
In your K-means function, try 10 random restarts and return the clustering with the lowest SSD error.
You will next test your implementation by applying clustering to segment and recolor an image. Download 10 images from the The Berkeley Segmentation Dataset and Benchmark. To make sure running your method doesn't take a long time, downsample (reduce the size) of your 10 chosen images.
To perform segmentation, you need a representation for every image pixel. For simplicity, you will use a three-dimensional feature representation for each pixel, consisting of the R, G and B values of each pixel. You can also include the (x, y) location of each pixel in the feature representation if you wish.
Perform clustering over the pixels of the image. Then recolor the pixels of each image according to their cluster membership. In particular, replace each pixel with the average R, G, B values for the center to which the pixel belongs. Include the recoloring result for 10 images and 3 different values of K for each image, in your submission.

Grading rubric:

The correctness of your clustering method implementation: 20 pts
Applying your clustering method on images and recoloring depending on cluster membership: 10 pts

Part IV: Linear Regression (30 points)

In this problem, you will solve a regression problem in two ways: using the direct least-squares solution, and using gradient descent.

You will use the Wine Quality dataset. Use only the red wine data. The goal is to find the quality score of some wine based on its attributes. First, divide the data into a training and test set using approximately 50% for training. You don't need to use cross-validation for this problem.

Use the linear system of equations / least squares solution for the language of your choice. In Matlab, that's the backslash operator, Ax=b => x = A\b. You need to decide what are A, x and b in the case of linear regression.
Use the resulting solution to find the wine quality scores on the test data. Then measure and report (in a file part4.txt) the L2 distance between the true and predicted scores.
Now implement the gradient descent solution. For this, you will need to initialize the weights in some way (use either random values or all zeros). Then you repeat the following some number of times (for this problem, repeat 10 times). In each iteration, compute the error function gradient using all training data points, then adjust the weights in the direction opposite to the gradient.
Apply the solution to the test set, then compute and report the L2 distance as above.
Experiment with different learning rates (e.g. ones in the range 10.^(-5:-3) i.e. 0.00001, 0.0001, 0.001) and report your observations.

Grading rubric:

Data set up and least squares solution: 10 pts
Gradient descent solution: 20 pts