CS1699: Homework 4

Due: 11/24/2015, 11:59pm

Instructions: Please provide your code and your written answers. Your written answers should be in the form of a single PDF or Word document (.doc or .docx). Your code should be written in Matlab. Zip or tar your written answers and .m files and upload the .zip or .tar file on CourseWeb -> CS1699 -> Assignments -> Homework 4. Name the file YourFirstName_YourLastName.zip or YourFirstName_YourLastName.tar. Include your name in your write-up.

Note: This homework includes up to 30 points of extra credit!

Part I: Scene categorization (60 points)

In this problem, you will develop two variants of a scene categorization system.

Get the scene categorization dataset provided by Svetlana Lazebnik from here (scene_categorization.zip).
You will need to extract your own features, using the VLFeat package. Use the function vl_sift. To set up VLFeat, download this binary, and follow these instructions. Make sure to run both steps of the demo to see a SIFT descriptor show up.
Divide the dataset into a training and test set. Use roughly half of the images from each category/class for training, and the rest for testing. If this is causing your program to run too slowly, feel free to use a smaller sample of both training and test images, but ensure you have at least 20 images from each class for both training and testing, and document your subsampling in your code.
Compute a spatial pyramid over the features. The spatial pyramid representation was proposed in 2006 by Svetlana Lazebnik, Cordelia Schmid and Jean Ponce. The procedure of computing the pyramid is summarized in the following image from the paper (you don't have to recreate this figure, and don't have to make your image square), and described below.
First, you will create a ''bag of words'' representation of the features in the image. To do this, you will run k-means on the SIFT feature descriptors of all training images (or a subset of all training images, if k-means is running too slowly). Make sure to include images from all classes in the set of SIFT descriptors on which you run k-means. Use kmeansML.m from HW3. For each feature in each test image, you will compute its distance to each cluster representative, and assign it to the closest cluster representative. This will give you the representation shown in the left-hand side of the figure, where the circles, diamonds and crosses denote different ''words'', in this toy example with k = 3. In your implementation, use k = 100.
You will then create a histogram where you count how many features of each ''word'' are present in the image. This forms your representation of the image, at level L = 0 of the pyramid.
Then, divide the image into four quadrants as shown below. You need to know the locations of the feature descriptors so that you know in which quadrant they fall; VLFeat provides these (see documentation). Now you will compute histograms as above, but you will compute one histogram vector for each quadrant.
In the original paper, there is one more subdivision into sixteen regions as shown below, and computation of one histogram for each cell in the grid. However, you don't have to implement this part. You can implement it if you'd like for up to 10 points of extra credit.
Finally, you will concatenate the histograms computed in the above steps. Make sure you concatenate all histograms in the same order for all images. This will give you a 1xd-dimensional descriptor.
Now that you have a representation for each image, it is time to learn a classifier which can predict, for a test image, to which of 15 scene category/class it belongs. (Each folder in the scene category dataset is a different category.) For this, you will use two types of classifiers. Note that it is easier to run the classification tasks as multi-class, as opposed to binary tasks for each class. KNN and the Matlab function mentioned can handle multi-class classification.
The first type of classifier will be KNN (k nearest neighbors). Note that this k (and its value) is not the same as the k in k-means. For each test image, compute the Euclidean distance between its descriptor and each training image's descriptor (the descriptors are now the Spatial Pyramids). Then find its k closest neighbors among only training images. Since these are training images, you know their labels. Find the mode (most common value; see Matlab's function mode) among the labels, and assign the test image to this label. In other words, the neighbors are "voting" on the label of the test image. The value k you use for KNN will be discussed below.
The other type of classifier is an SVM. You fill use Matlab's function model = fitcecoc(X, Y); where X (of size nxd) are your features, and Y (of size nx1) are the labels you want to predict. All images from the same scene category will have the same label. The label values should be integers between 1 and 15. To use the model you just learned, you will call label = predict(model, x); where x of size 1xd is the descriptor for a single scene whose label you want to predict.
Finally, you need to evaluate the accuracy of your classifiers. You need to compute what fraction of the test images was assigned the correct label, i.e., the "ground truth" label that came with the dataset.

What you need to include in your submission:

[15 points] function [pyramid] = computeSPMHistogram(im, codebook_centers); which computes the Spatial Pyramid Match histogram as discussed above. im should be a grayscale image whose SIFT features you should extract, codebook_centers should be the cluster centers from the bag-of-visual-words clustering operation, and pyramid should be a 1xd feature descriptor for the image. You're allowed to pass in optional extra parameters after the first two.
[15 points] function [labels] = findLabelsKNN(pyramids_train, pyramids_test, labels_train); which predicts the labels of the test images using the KNN classifier. pyramids_train, pyramids_test should be an Mx1 cell array and an Nx1 cell array, respectively, where M is the size of the training image set and N is the size of your test image set, and each pyramids{i} is the 1xd Spatial Pyramid Match representation of the corresponding training or test image. labels_train should be an Mx1 vector of training labels, and labels should be a Nx1 vector of predicted labels for the test images.
[5 points] function [labels] = findLabelsSVM(pyramids_train, pyramids_test, labels_train); which predicts the labels of the test images using an SVM. This function should include training the SVM. The inputs and outputs are defined as above but now use an SVM.
[5 pts] function [accuracy] = computeAccuracy(trueLabels, predictedLabels); which computes and prints the accuracy of a classifier on the test images, where trueLabels is the Nx1 vector of ground truth labels that came with the dataset, and predictedLabels is the corresponding Nx1 vector of labels predicted by the classifier.
[20 pts] A script which get all images and their labels (feel free to reuse code from HW3 that shows how to get the contents of a directory), extracts the features of training images, runs kmeansML to find the codebook centers, then computes SPM representations, and runs the KNN and SVM classifiers, including computing their accuracy. In this script, run the KNN classification with the following values for the k (different from the k-means k = 100): 1, 5, 25, 125. In other words, you have to run KNN 4 times and show 4 accuracy values, plus 1 for SVM. Include your accuracy results in your write-up.

Part II: Pedestrian detection (40 points)

In this problem, you will implement a simple pedestrian detection system. This system is somewhat similar to the 2005 paper by Navneet Dalal and Bill Triggs found here.

Access the INRIA Person dataset in the same AFS directory as linked above (here). At that link, you will find a separate set for training, and one for testing.
Each positive (= containing a person) image (in the "pos" directory of the training folder, pedestrian_detection_training_data) is a crop ready to use. You will have to generate the negative data yourself. Use the uncropped images in the "neg" directory, and generate a set of crops that are of the same size as the positive crops. Generate as many negative crops as you have positive crops. An easy way to generate these crops is to cycle through random locations in some negative image, set these to be your top-left of the crop, check where the bottom-right would end up being using the size of the positive crops, and skip this location if it's outside the bounds of the image. If it is inside, get the corresponding pixels (in a matrix), and use imwrite to save the crop as a new image (or skip saving to a new file and directly train with the crop). Make sure you use many different negative images to get your negative crops.
Extract HOG features from all positive and negative patches, using VLFeat's vl_hog function.
Now you can use fitcecoc as above to train a model that can predict, for a new patch in the image, whether it contains a person (positive) or does not (negative).
After training your classifier, you will use it to find pedestrians in new test images. Pick 5 images from the test data, pedestrian_detection_test_data. You will note that in most test images, the people are not at the same scale as in the positive crops. Normally what you might do is to try looking for a person using an image pyramid (rescaling the image multiple times in the hopes that some rescaling will match the positive patch size). However, even though this is something you would never do in actual computer vision applications or research, for simplicity, manually resize the few test images you chose that do contain people to the visually correct size.
You now have to perform a sliding window detection. For each test image, you will slide a window of the same size as the positive patches, extract the HOG features for that window, run the SVM on it using predict to see if the SVM predicted positive or negative (person detection or not) for that window. Save some windows on which the SVM predicts "positive", and include them in your write-up file.
To implement sliding window detection, start your window at the top-left corner of the test image. For your second window, move 5-10 pixels to the right from the first window. When you reach a window that's over the right border of the image, move 5-10 pixels down and all the way to the left-hand side of the image. Continue until you have run your sliding window over the whole image.
For up to 20 points of extra credit, you can also compute how correct your detections are. You will have one correctness score for each predicted person detection in each test image. You will look at how well a predicted positive window matches a ground-truth crop (see the folder pedestrian_detection_evaluation). If you have a crop in that folder with the same filename as your test image, but with an "a" appended before the file extension, i.e. you know from the dataset that there is a person in that image, then you can use Matlab's intersect(A, B); and union(A, B); functions (where A, B are your ground truth and predicted crops) to compute the Intersection Over Union metric (see below) between any crop predicted positive and the ground truth crop, then take the best overlap as your final score for that predicted crop. If there is no crop with the same image filename, but you predict a positive window, your score for that box is 0. Finally, you will compute the precision of your system as the fraction of predicted person detections that have at least 0.5 intersection-over-union score with a ground truth crop, and recall as the fraction of ground truth crops that have at least 0.5 intersection-over-union score with some predicted positive window. Include both the precision and recall scores in your write-up.

What you need to include in your submission:

[20 points] A script setup_and_train.m that gets the positive crops and generates the negative crops (feel free to just use a sample for each), extracts their features, and trains an SVM with these.
[20 points] A script test.m that implements sliding window detection for a test image, plus your write-up which includes your test images and the predicted person detections in each.
[up to 20 points of extra credit] A script evaluate.m which computes Intersection Over Union scores, and from those, precision and recall. Also include the precision and recall scores in your write-up.