CS1674: Homework 9 - Programming

Due: 11/17/2016, 11:59pm

This assignment is worth 50 points.

In this problem, you will implement a simple pedestrian detection system. This system is similar to the 2005 paper by Navneet Dalal and Bill Triggs found here.

Access the INRIA Person dataset in the same AFS directory as linked above (here). At that link, you will find a separate set for training, and one for testing.
Each positive (= containing a person) image (in the "pos" directory of the training folder, pedestrian_detection_training_data) is a crop ready to use. You will have to generate the negative data yourself. Use the uncropped images in the "neg" directory, and generate a set of crops that are of the same size as the positive crops. Generate as many negative crops as you have positive crops. An easy way to generate these crops is to cycle through random locations in some negative image, set these to be your top-left of the crop, check where the bottom-right would end up being using the size of the positive crops, and skip this location if it's outside the bounds of the image. If it is inside, get the corresponding pixels (in a matrix), and use imwrite to save the crop as a new image (or skip saving to a new file and directly train with the crop). Make sure you use many different negative images to get your negative crops.
Extract HOG features from all positive and negative patches, using VLFeat's vl_hog function.
Now you can use the built-in Matlab SVM (fitcecoc which you used in HW8P, or fitcsvm which is specifically for two-class problems and uses similar syntax) to train a model that can predict, for a new patch in the image, whether it contains a person (positive) or does not (negative).
After training your classifier, you will use it to find pedestrians in new test images. Pick 10 images from the test data, pedestrian_detection_test_data. You will note that in most test images, the people are not at the same scale as in the positive crops. Normally what you might do is to try looking for a person at multiple sizes. However, even though this is something you would never do in actual computer vision applications or research, for simplicity, resize the few test images you chose that do contain people so that the person is at the visually correct size. Also include some test images that do not contain people, to see how many false positives your system returns.
You now have to perform a sliding window detection. For each test image, you will slide a window of the same size as the positive patches, extract the HOG features for that window, and run the SVM on it using predict to see if the SVM predicted positive or negative (person detection or not) for that window. Save 10 windows on which the SVM predicts "positive", and include them in your submission.
To implement sliding window detection, start your window at the top-left corner of the test image. For your second window, move 5-10 pixels to the right from the first window. When you reach a window that's over the right border of the image, move 5-10 pixels down and all the way to the left-hand side of the image. Continue until you have run your sliding window over the whole image.
For simplicity, you will manually rather than automatically score precision and recall. Count how many of the boxes your system returns as "positive" actually contain people, and how many do not. Then compute precision and recall as discussed in class. To compute recall, simply count what fraction of people in the images were detected.
For reference only: You can also compute precision and recall automatically, as the overlap between predicted and ground-truth positive bounding boxes. Overlap is computed as intersection over union. A box predicted positive is correct if its intersection over union with any ground-truth positive box is at least 0.5:

What you need to include in your submission:

[20 points] A script setup_and_train.m that gets the positive crops and generates the negative crops (feel free to just use a sample for each), extracts their features, and trains an SVM with these.
[20 points] A script test.m that implements sliding window detection for a test image.
[5 pts] A text file results.txt that lists how many bounding boxes your system predicted positive, how many of them really were positive, how many ground-truth persons were in the test set, and what the precision/recall of your system is.
[5 pts] 10 windows with positive detections, in jpg/png format.