CS1674: Homework 8 - Programming

Due: 11/10/2016, 11:59pm

This assignment is worth 50 points.

In this problem, you will implement a zero-shot recognition system which resembles the system proposed in Christoph Lampert et al.'s paper. All you need to know about this system is that it models the probability of a certain object (e.g. polar bear) being present in the image using the probabilities of being present for each of the attributes that a bear is known to have. For example, if we detect the attributes "white", "furry", "not lean", "not brown", "bulbous", etc. in the image, i.e. attributes that a polar bear is known to have, we can be fairly confident that there is a bear in the image. Hence, we can recognize a polar bear without ever having seen a polar bear, if (1) we know what attributes a polar bear has, and (2) we have classifiers trained for these attributes, using images from other object classes (i.e. other animals).

We will compare this zero-shot baseline against an SVM classifier which does see training samples from the categories of interest. We'll treat this SVM as an upper bound because it sees the features of the training data, and the zero-shot method does not. Follow the steps below to implement the SVM and the zero-shot approach.

First, copy the Animals with Attributes dataset (originally appearing here) from the following Pitt AFS directory. The dataset includes 50 animal categories, 85 attributes, and 30,475 images (of which we'll use a small sample). The dataset provides a 50x85 predicate-matrix-binary.txt (predicate=attribute). An entry (i, j)=1 in the matrix says that the i-th class has the j-th attribute (e.g. a bear is white), and an entry of (i, j)=0 says that the i-th class doesn't have the j-th attribute (e.g. a bear is not white). You will use SIFT features (already provided).
The paper splits the object classes (not images) into a training (seen) and test (unseen) set, for purposes of zero-shot recognition. In this scenario, the training classes are animals that your system will see, i.e. ones whose images the system has access to. In contrast, the test set contains classes (animals) for which your system will never see example images. Note this is different than the general recognition setup, where the system just does not get to see the labels on the test data, but it does see the test data during testing. The 40 training classes are given in trainclasses.txt and the 10 test classes are given in testclasses.txt. At each time, we will assume that a query image can only be classified as belonging to one of the 10 unseen classes, so chance performance (randomly guessing the label) will be 10%.
See the provided script zero_shot_setup.m to see how to read these files. These function also trains one classifier for each of the attributes for you. It also computes the probability that any attribute is present in any image. Read this script first. All outputs from it are saved in a MAT file that is provided. Don't run the script because it takes a while to run.
In that script, we use images from the training classes (or rather, their feature descriptors) to train a classifier for each of the 85 attributes. The predicate matrix mentioned above tells us which animals have which attributes. So if a bear is brown, we assign the "brown=1" tag to all of its images. Similarly, if a dalmatian is not brown, we assign the tag "brown=0" to all of its images. We the images tagged with "brown=1" as the positive data in our classifier, and the images tagged with "brown=0" as the negative data, for the "brown" classifier. We now have one classifier for each attribute. We next apply each attribute classifier j to each image l belonging to any of the test classes.

To compute the zero-shot recognition method:

Your task is to predict which animals are present in each test image. (This corresponds to Equation 2 in the paper.) As a pre-processing step, we will actually first split the data from the 10 "unseen" animals into a training and test set. Let's call this "set A" and "set B". (We need to do this split because of the SVM upper-bound below). To split the data, we take roughly 50% of the data from each of the 10 test/unseen classes to be training data ("set A"), and the rest to be test data ("set B"). This is done for you in zero_shot_setup.m.
To perform classification of a query test image from "set B", you will assign it to the test class (out of 10) whose attribute "signature" it matches the best. How can we compute the probability that an image belongs to some animal category? Let's use a toy example where we only have 2 animal classes and 5 attributes. We know (from a predicate matrix like the one discussed above) that the first class has the first, second, and fifth attributes, but does not have the third and fourth. Then the probability that the query image (with descriptor x) belongs to this class is P(class = 1|x) = P(attribute_1 = 1|x) * P(attribute_2 = 1|x) * P(attribute_3 = 0|x) * P(attribute_4 = 0|x) * P(attribute_5 = 1|x). The "|x" notation means "given x", i.e. we compute some probability using the image descriptor x. Let's say the second class is known to have attributes 3 and 5, and no others. Then the probability that the query image belongs to this class is P(class = 2|x) = P(attribute_1 = 0|x) * P(attribute_2 = 0|x) * P(attribute_3 = 1|x) * P(attribute_4 = 0|x) * P(attribute_5 = 1|x). The P(attribute_i = 0|x) and P(attribute_i = 1|x) probabilities are stored as the first and second columns, respectively, in the variable attr_probs{i}.
You will assign the image with descriptor x to that class i which gives the maximal P(class = i|x). For example, if P(class = 1|x) = 0.80 and P(class = 2|x) = 0.20, then you will assign x to class 1. You can call [~, ind] = max(probs); (or ~, ind] = max(probs, [], 2); depending on the orientation of your matrix) on a vector of probabilities such that probs(i) is P(class = i); then ind will give you the "winning" class to which x should be assigned.
You will classify each test image from "set B", and compute the average accuracy.

To compute the SVM upper bound:

The multi-class training SVM function in Matlab is called fitcecoc. You call it as model = fitcecoc(X_train, Y_train); where X_train are your training data features, and Y_train are your training labels. X_train, Y_train must have the same number of rows (that correspond to the number of training samples). Your X_train, Y_train will correspond to set_A_samples and set_A_animals.
The test function is called predict and you call it as labels = predict(model, X_test); where X_test are the features of the test data, with rows corresponding to samples/feature vectors. Your X_test will be the features of the images in "set B".
For reference, my accuracy with the SVM version is around 30%, and with the zero-shot version is around 15%. Note these are significantly lower than in the paper, because in zero_shot_setup I subsample the train/test data to use only a small set of data, for faster running of our functions.

What to include in your submission:

[30 pts] A script zero_shot.m which implements zero-shot recognition, and at the end prints the average accuracy on "set B" from zero-shot recognition.
[15 pts] A script svm.m which calls the appropriate train/test functions as shown above, and at the end computes accuracy on "set B" from an SVM that sees the images in "set B" (unlike the zero-shot which does not and only sees the attributes of the images in "set B").
[5 pts] A text file results.txt which shows one accuracy for the zero-shot version and one for the SVM version.