In this problem, you will use texture to represent images. For each image, you will compute the response of each image pixel to the filters from a "filter bank" as discussed in class. You will then compute two image representations based on filters. The first will simply be a concatenation of the filter responses for all pixels and all filters. The second will contain the mean of filter reponses (averaged across all pixels) for each image.

You will then compare the within-class and between-class distances for images of different classes, based on each of the representations. You will keep track of the average within-class distance, i.e. the Euclidean distance between image representations for any pair of images where both images belong to the same animal class. You will also keep track of the average between-class distance, i.e. the Euclidean distance between any pair of images where the two images are

- Download these images: cardinal1, cardinal2, leopard1, leopard2, panda1, panda2. As you can see, there are two images for each of three animal categories. Convert all images to the same square size (e.g. 100x100). Also convert them to grayscale.
- Download the Leung-Malik filter bank from here (this is a Matlab file; in Matlab you can read it with load, and in Python/Scipy you can read it with loadmat). Each filter F(:, :, i) is of size 49x49, and there are 48 filters.
- For each image, you need to do the following. First, read in the image. Then convolve your image with each of the 48 filters. In Matlab, use imfilter; in Python, use convolve.
- For five of the filters, visualize the filters and responses by generating images like the ones shown below. You can use subplots to put the filter and response in a single image, but you don't have to.
- For each image, create an image representation by concatenating all pixels in the response images for all filters. This results in a single vector representation of length 100*100*48 for each image.
- Now compute the Euclidean distance between pairs of images. Use one variable to store within-category distances (distances between images that are of the same animal category), and another to store between-category distances (distance between images showing different animal categories). Print the mean of the within-category and between-category distances. Which one is smaller? By how much? Is this what you would expect? Why or why not? Answer these questions in a text file called responses.txt which you include in your submission zip file.
- Now let's compute the image represenation in a different way, again using filters. This time, each image's representation will be the mean response across all pixels to each of the filters, resulting in one mean value per filter and an overall image representation of length 48. Again compute within-category and between-category distances. Now how does average within-category and between-category distance compare? Is this more in line with what you would expect?
- Which of the two types of descriptors that you used is the better one? How can you tell? Include your reasoning in your response.

In this problem, you will create a hybrid image (which looks like one thing zoomed in and another zoomed out) like the one shown in class.

- Download one pair of images: woman_happy and woman_neutral, or baby_happy and baby_weird.
- Read in both images, convert them to grayscale, and resize them to the same square size.
- Create a Gaussian filter; see fspecial in Matlab and gaussian_filter in Python.
- Apply the filter to both, saving the results as im1_blur, im2_blur.
- For the second image, subtract the blur of the image from the image (as we did with the Einstein and Pittsburgh images in class), and save the result as im2_detail.
- Now add im1_blur and im2_detail, show the image, save it, and include it with your submission. Play with scaling it up and down (by dragging in Matlab) to see the "hybrid" effect.

In this problem, you will implement feature extraction using the Harris corner detector, as discussed in class.

- Your function should be named extract_keypoints, and should take in a color image as input and convert it to grayscale in the function. It should output the following: x, y, scores, Ix, Iy. Each of x,y is an
*n*x1 vector that denotes the x and y locations, respectively, of each of the*n*detected keypoints (i.e. points with "cornerness" scores greater than a threshold). Keep in mind that*x*denotes the horizontal direction, hence*columns*of the image, and*y*denotes the vertical direction, hence*rows*, counting from the top-left of the image. scores is an*n*x1 vector (denoted by*R*in the lecture slides) that contains the value to which you applied a threshold, for each detected keypoint. Ix,Iy are matrices with the same number of rows and columns as your input image, and store the gradients in the x and y directions at each pixel. - You can use a window function of your choice; opt for the simplest one, e.g. 1 inside, 0 outside. Use a window size of e.g. 5 pixels.
- Common values for the
*k*value in the "Harris Detector: Algorithm" slide are 0.04-0.06. - You can set the threshold for the "cornerness" score
*R*however you like; for example, you can set it to 5 times the average*R*score. Alternatively, you can simply output the top*n*keypoints (e.g. top 1%). - After you have written your extract_keypoints function, show what it does on a set of 10 images of your choice. Visualize the keypoints you have detected, for example by drawing circles over them. Use the scores variable and make keypoints with higher scores correspond to larger circles.

In this problem, you will implement a feature descriptor, as discussed in class. While you will not exactly implement it, the SIFT paper by David Lowe is a useful resource, in addition to Section 4.1 of the Szeliski textbook.

- Your function called compute_features should take in as inputs: x, y, scores, Ix, Iy (defined as above). It should output features which is an
*n*x*d*matrix, each row of which contains the*d*-dimensional descriptor for the*n*-th keypoint. - We'll simplify the histogram creation procedure a bit, compared to the original implementation presented in class. In particular, we'll compute a descriptor with dimensionality
*d*=8 (rather than 4x4x8), which contains an 8-dimensional histogram of gradients computed from a 11x11 grid centered around each detected keypoint (i.e. -5:+5 neighborhood horizontally and vertically). - Quantize the gradient orientations in 8 bins (so put values between 0 and A degrees in one bin, the A to B degree angles in another bin, etc.). Note that the arctan function you use may return values in a range of 360 or 180 degrees; either is fine to use. To populate the SIFT histogram, consider each of the 8 bins. To populate the first bin, sum the gradient magnitudes for angles that are between 0 and A degrees. Repeat analogously for all bins.
- Finally, you should clip all values to 0.2, and normalize each descriptor to be of unit length, e.g. using hist_final = hist_final / sum(hist_final). Normalize both before and after the clipping. You do not have to implement any more sophisticated detail from the Lowe paper.
- If any of your detected keypoints are less than 5 pixels from the top/left or 5 pixels from the bottom/right of the image, set the descriptor for that keypoint to all 0's.
- To compute the gradient magnitude
*m(x, y)*and gradient angle*θ(x, y)*at point (x, y), take*L*to be the image and use the formula below shown in class. Remember that the differences between*L*values in the formula are exactly the image gradients you already computed for Part III.