CS1674: Homework 4 - Programming

Due: 2/23/2018, 11:59pm

This assignment is worth 50 points.

Part I: Feature Description (25 points)

In this problem, you will implement a feature description pipeline, as discussed in class. While you will not exactly implement that, the SIFT paper by David Lowe is a useful resource, in addition to Section 4.1 of the Szeliski textbook.

Use the following signature: function [features, x, y, scores] = compute_features(x, y, scores, Ix, Iy). The intputs/outputs x, y, scores, Ix, Iy are defined in HW3P. The output features is an Nxd matrix, each row of which contains the d-dimensional descriptor for the n-th keypoint. We'll simplify the histogram creation procedure a bit, compared to the original implementation presented in class. In particular, we'll compute a descriptor with dimensionality d=8 (rather than 4x4x8), which contains an 8-dimensional histogram of gradients computed from a 11x11 grid centered around each detected keypoint (i.e. -5:+5 neighborhood horizontally and vertically).
  1. [5 pts] If any of your detected keypoints are less than 5 pixels from the top/left or 5 pixels from the bottom/right of the image, i.e. pixels lacking 5+5 neighbors in either the horizontal or vertical direction, erase this keypoint from the x, y, scores vectors at the start of your code and do not compute a descriptor for it.
  2. [5 pts] To compute the gradient magnitude m(x, y) and gradient angle θ(x, y) at point (x, y), take L to be the image and use the formula below shown in class and Matlab's atand, which returns values in the range [-90, 90]. If the gradient magnitude is 0, then both the x and y gradients are 0, and you should ignore the orientation for that pixel (since it won't contribute to the histogram).


  3. [5 pts] Quantize the gradient orientations in 8 bins (so put values between -90 and -67.5 degrees in one bin, the -67.5 to -45 degree angles in another bin, etc.). For example, you can have a variable with the same size as the image, that says to which bin (1 through 8) the gradient at that pixel belongs.
  4. [5 pts] To populate the SIFT histogram, consider each of the 8 bins. To populate the first bin, sum the gradient magnitudes that are between -90 and -67.5 degrees. Repeat analogously for all bins.
  5. [5 pts] Finally, you should clip all values to 0.2 as discussed in class, and normalize each descriptor to be of unit length, e.g. using hist_final = hist_final / sum(hist_final); Normalize both before and after the clipping. You do not have to implement any more sophisticated detail from the Lowe paper.
Part II: Image Description with SIFT Bag-of-Words (8 points)

In this part, you will compute a bag-of-words histogram representation of an image. The histogram for image Ij is a k-dimensional vector: F(Ij) = [ freq1, j    freq2, j    ...    freqk, j ], where each entry freqi, j counts the number of occurrences of the i-th visual word in image j, and k is the number of total words in the vocabulary.

Use the following function signature: function [bow_repr] = computeBOWRepr(features, means) where bow_repr is a normalized bag-of-words histogram, features is the Mx8 set of descriptors computed for the image (output by the function you implemented in Part I above), and means is the kx8 set of cluster means, which is provided for you.
  1. [1 pt] A bag-of-words histogram has as many dimensions as the number of clusters k, so initialize the bow variable accordingly.
  2. [3 pts] Next, for each feature (i.e. each row in features), compute its distance to each of the cluster means, and find the closest mean. A feature is thus conceptually "mapped" to the closest cluster. You can do this efficiently using the provided dist2 function, whose use is described at the top of the function.
  3. [3 pts] To compute the bag-of-words histogram, count how many features are mapped to each cluster.
  4. [1 pts] Finally, normalize the histogram by dividing each entry by the sum of the entries.
Part III: Image Description with Texture (5 points)

In this problem, you will use texture to represent images. You will use the responses of images to filters that you computed in HW2P. You will then compute two image representations based on the filter responses. The first will simply be a concatenation of the filter responses for all pixels and all filters. The second will contain the mean of filter reponses (averaged across all pixels) for each image.

Write a function [texture_repr_concat, texture_repr_mean] = computeTextureReprs(image, F) where image is the output of an imread, and F is the 49x49xnum_filters matrix of filters you used in HW2P.
  1. [3 pts] First, create a new variable responses of size num_filtersxnum_rowsxnum_cols, where num_rowsxnum_cols is the size of the image. Convert the input image to grayscale, and convert it to size 100x100. Compute the responses of the image to each of the filters, and store the results in responses.
  2. [1 pts] Create the first image representation texture_repr_concat by simply converting responses to a vector, i.e. concatenating all pixels in the response images for all filters.
  3. [1 pts] Now let's compute the image represenation in a different way. This time, the representation texture_repr_mean will be of size num_filtersx1. Compute each entry in texture_repr_mean as the mean response across all pixels to the corresponding filter. In other words, rather than keeping information about how each pixel responded to the filter, we are collapsing that information to a single value: the mean across all pixels.
Part IV: Comparison of Image Descriptions (12 points)

In this part, we will test the quality of the different representations. A good representation is one that retains some of the semantics of the image; oftentimes by "semantics" we mean object class label. In other words, a good representation should be one such that two images of the same object have similar representations, and images of different objects have different representations. We will test to what extent this is true, using our images of three object classes from HW2P: two cardinals, two leopards, and two pandas.

To test the quality of the representations, we will compare two averages: the average within-class distance and the average between-class distance. A representation is a vector, and "distance" is the Euclidean distance between two vectors (i.e. the representations of two images). Use the provided eucl.m. "Within-class distances" are distances computed between the vectors for images of the same class (i.e. cardinal-cardinal, panda-panda). "Between-class distances" are those computed between images of different classes, i.e. cardinal-panda, panda-leopard, etc. If you have a good image representation, should the average within-class or the average between-class distance be smaller?

In a script compare_representations.m:
  1. [1 pts] Read in the cardinal, leopard, and panda images from HW2P, and resize them to 100x100.
  2. [2 pts] Use the code you wrote above to compute three image representations (bow_repr, texture_repr_concat, and texture_repr_mean) for each image.
  3. [6 pts] Compute and print the ratio average_within_class_distance / average_between_class_distance for each representation. To do so, use one vector to store within-category distances (distances between images that are of the same animal category), and another to store between-category distances (distance between images showing different animal categories). Compute the mean of each of the two vectors, then compute the ratio of the means.
  4. [3 pts] In addition to your code, answer the following questions in a file answers.txt. For which of the three representations is the within-between ratio smallest? Is this what you expected? Why or why not? Which of the three types of descriptors that you used is the best one? How can you tell?
Submission: