CS1674: Homework 7 - Programming

Due: 3/27/2018, 11:59pm

This assignment is worth 50 points.

In this assignment, you will develop two variants of a scene categorization system. You will write three functions and two scripts. The first function will compute the spatial pyramid match representation. The second and third will find labels for your test images, using two classifiers: K nearest neighbors (KNN), and support vector machines (SVM). In the scripts, you will set up the dataset, divide it into a training set and a test set, call your first function to compute the SPM representation, call your other functions to compute labels for the test data in different ways, and compare the performance of different levels of the SPM representation, and different classifiers.

Read the script load_split_dataset.m so that you know how to use its outputs. Specify the path where you downloaded and extracted the dataset, then run the script (it will take a few minutes). In brief, this script loads the scene categorization dataset provided with the assignment (including features and labels), divides the dataset into a training and test set by randomly choosing 30 images from each class for training and another 30 for testing, uses the folder membership of images to create labels, and runs K-means clustering. It outputs the following variables: train_labels, test_labels, train_images, test_images, train_sift, test_sift. The first two are vectors of size Mx1 and Nx1, where M is the total number training images and N is the total number of test images. The remaining variables are cell arrays of size M or N, and contain the grayscale images and collection of SIFT features for each image.

SIFT features for each image are extracted for you using the vl_sift function of the VLFeat package. Each feature .mat file contains two variables: f and d. Each column in the first variable matches a column in the second variable; both correspond to the same descriptor. You need the first two entries in each column of f to determine the (x, y) coordinates of the descriptor stored in that column of d. You can read more about the format of these two variables here. Images and SIFT features are provided on CourseWeb. For computational reasons (time), we will only use 10 categories, and a randomly chosen set of 60 images from each.


Part I: Computing the SPM representation (10 points)

The spatial pyramid representation was proposed in 2006 by Svetlana Lazebnik, Cordelia Schmid and Jean Ponce, and won the "test of time" award at CVPR 2016. The procedure of computing the pyramid is summarized in the following image from the paper, and described below.



Write the following: function [pyramid, level_0, level_1, level_2] = computeSPMRepr(im, sift, means); which computes the Spatial Pyramid Match histogram as described in class.

Inputs: Outputs: Instructions:
  1. [2 pts] First, create a "bag of words" histogram representation of the features in the image, using the provided function [bow] = computeBOWRepr(descriptors, means). This will give you the representation shown in the left-hand side of the figure, where the circles, diamonds and crosses denote different "words". In this toy example with K = 3; in your submission, use K = 50. This forms your representation of the image, at level L = 0 of the pyramid.
  2. [7 pts] Then, divide the image into four quadrants as shown below. You need to know the locations of the feature descriptors so that you know in which quadrant they fall; VLFeat provides these (see documentation for vl_sift) and they are stored in the f variable. Compute four BOW histograms, using the computeBOWRepr function, but generating a separate BOW representation for each quadrant. The concatenation of the four histograms is your level-1 representation of the image.



  3. [5 pts extra credit] In the original paper, there is one more subdivision into sixteen regions as shown below, and computation of one histogram for each cell in the grid. This is the level-2 representation of the image.



  4. [1 pt] Finally, concatenate the level-0, level-1, and level-2 representations computed in the above steps. This will give you the final image representation, and should be saved in the pyramid variable.

Part II: Training and obtaining labels from two classifiers (15 pts)

In this part, you will write functions to obtain labels on the test data from two classifiers, support vector machines (SVM) and K nearest neighbors (KNN). Note that the value of k in KNN is distinct from the value K in K-means; we'll use k to denote the former and K to denote the latter.

Write the following functions:
  1. [10 pts] function [predicted_labels_test] = findLabelsKNN(pyramids_train, labels_train, pyramids_test, k); which predicts the labels of the test images using the KNN classifier. For each test image, compute the Euclidean distance between its descriptor and each training image's descriptor (the descriptors are now the Spatial Pyramids). Then find its k closest neighbors among only training images; you can use the provided dist2 code. Find the mode (most common value; see Matlab's function mode) among the labels, and assign the test image to this label. In other words, the neighbors are "voting" on the label of the test image. You have to write your own code, and you are NOT allowed to use the built-in Matlab function for KNN!
    Inputs: Outputs:

  2. [5 pts] function [predicted_labels_test] = findLabelsSVM(pyramids_train, labels_train, pyramids_test); which predicts the labels of the test images using an SVM. This function should include training the SVM. The inputs and outputs are defined as above but now we will use an SVM to determine the outputs. Use the Matlab built-in SVM functions for training and test/prediction. To train a model, use model = fitcecoc(X, Y); where X (of size mxd) are your features, and Y (of size mx1) are the labels you want to predict. To use the model you just learned, call label = predict(model, X_test); where X_test of size nxd are the descriptors for the scenes whose labels you want to predict.

Part III: Comparing approaches (25 pts)

In this part, you will compare the KNN and SVM classifiers using the SPM representation. You will also compare how the same classifier performs when it uses different levels of the SPM pyramid. Your classifiers will predict to which scene category each test image belongs.

  1. [10 pts] In a script compare_representations.m:
    1. [5 pts] Call your computeSPMRepr to compute the spatial pyramid match representation on top of the extracted features, for all train/test images, and store the resulting representations in appropriate variables.
    2. [5 pts] Use an SVM classifier. Compare the quality of three representations, pyramid, level_0 and level_1. In other words, compare the full SPM representation to its constituent parts, which are the level-0 histogram and the concatenations of four histograms in level-1. Compute the accuracy at each level, by measuring what fraction of the images was assigned the correct label. In a file results1.txt, describe your findings, and give your explanation of the performance of the different representations.

  2. [15 pts] In a script compare_classifiers.m, do the following steps (you can interleave them as you wish, order does not have to be as shown). You can assume the previous script has been run first, so you don't have to recompute the SPM representations.
    1. [5 pts] Apply the SVM and KNN classifiers (i.e. call findLabelsSVM, findLabelsKNN) to predict labels on the test set, using the pyramid variable as the representation for each image. For KNN, use the following values of k=1:2:15. Each value of k gives a different KNN classifier.
    2. [2 pts] Compute the accuracy of each classifier on (1) the training set, and (2) the test set, by comparing its predictions with the "ground truth" labels.
    3. [5 pts] Plot the training and test accuracy of both types of classifiers, using the values of k on the x-axis, and accuracy on the y-axis. Since SVM does not depend on the value of k, plots its performance as a straight line. Save the result as results.png and submit it. For reference, my plot is as follows, but I have omitted some values of k intentionally. Label your axes and show a legend. Useful functions: plot, xlabel, ylabel, legend.
    4. [3 pts] Finally, in a text file results2.txt, explain what you see in your plot (using the full range of k values), and explain the trends on the training and test sets you see as k increases.



  3. [2 pts extra credit] Include level-2 in your comparison above, and give possible reasons for its performance relative to the other representations.


Submission: Provided for you on CourseWeb: