Learning Attributes from Human Gaze

Nils Murrugarra-Llerena and Adriana Kovashka

Abstract

While semantic visual attributes have been shown useful for a variety of tasks, many attributes are difficult to model computationally. One of the reasons for this difficulty is that it is not clear where in an image the attribute lives. We propose to tackle this problem by involving humans more directly in the process of learning an attribute model. We ask humans to examine a set of images to determine if a given attribute is present in them, and we record where they looked. We create gaze maps for each attribute, and use these gaze maps to improve attribute prediction models. For test images we do not have gaze maps available, so we predict them based on models learned from collected gaze maps for each attribute of interest. Compared to six baselines, we improve prediction accuracies on attributes of faces and shoes, and we show how our method might be adapted for scene images. We demonstrate additional uses of our gaze maps for visualization of attribute models and learning "schools of thought" between users in terms of their understanding of the attribute

Introduction

Since attributes are less well-defined, capturing them with computational models poses a different set of challenges than capturing object categories does. There is a disconnect between how humans and machines perceive attributes, and it negatively impacts tasks that involve communication between a human and a machine, since the machine may not understand what a human user has in mind when referring to a particular attribute. Since attributes are human-defined, the best way to deal with their ambiguity is by learning from humans what these attributes really mean.

We learn the spatial support of attributes by asking humans to judge if an attribute is present in training images. We use this support to improve attribute prediction.

We propose to learn attribute models using human gaze maps that show which part of an image contains the attribute. To obtain gaze maps for each attribute, we conduct human subject experiments where we ask viewers to examine images of faces, shoes, and scenes, and determine if a given attribute is present in the image or not.

Approach

Step 1: Collect data.
Step 2: Generate gaze maps templates.
Step 3: Learn attribute models using gaze templates.
Step 4: Learn attribute models using gaze prediction.

Data collection

Our experiment begins with a screening phase in which we show ten images to each participant and ask him/her to look at a fixed region in the image that is marked by a red square, or to look at e.g. the nose or right eye for faces. If the fixated pixel locations lie within the marked region, the participant moves on to the data collection session. The latter consists of 200 images organized in four sub-sessions. In order to increase the participants' performance, we allow a five-minute break between sub-sessions. We ask the viewer whether a particular attribute is present in a particular image which we then show him/her. The participant has two seconds to look at the image and answer. His/her gaze locations and answers are recorded. We obtain 2.5 gaze maps on average, for each image-attribute question.

Gaze maps templates generation

First, the gaze maps across all images that correspond to positive attribute labels are OR-ed (the maximum value is taken per pixel) and divided by the maximum value in the map. Thus we arrive at a gaze map gm_m for the attribute m with values in the range [0, 1]. Second, a binary template bt_m is created using a threshold of t = 0.1 on gm_m. All locations greater than t are marked as 1 in bt_m and the rest as 0. Third, we apply a 15x15 grid over the binary template to get a grid template gt_m. The process starts with a grid template filled with all 0 values. Then if a pixel with value 1 of bt_m falls inside some grid cell of gt_m, this cell is turned on (all pixels in that cell are replaced with 1).

Grid templates for the face (top two rows) and shoe attributes.

To get templates that capture the subtle variations of how an attribute might appear [21] and also separate different types of objects, a clustering is performed over the images labeled as positive by our human participants. For example, boots can be in one group and high-heels in another. We use K-means with k = 5. After the clustering procedure, we repeat the grid template generation, but now separately for each of the five clusters.

Grid templates for each positive cluster for the attributes "open" (top) and "chubby" (bottom). At the top, we show multiple templates capturing the nuances of "openness". At the bottom, we show how multiple templates for "chubby" look on the same image.

Attribute learning using gaze templates

We consider two approaches:

For Single Template (ST), the parts of images involved in training and testing are multiplied by the grid template values, which results in image pixels under a 0 value being removed and keeping other pixels the same. We then extract both local and global features from the remaining part of the image, and train a classifier corresponding to the template using these features. At test time, we apply the template to each image, extract features from the 1-valued part, and apply the classifier.
For Multiple Templates (MT), we train five different classifiers (one per cluster), each corresponding to one grid template. We classify a new image as positive if at least one of the five classifiers predicts it contains the attribute.

Attribute learning using gaze prediction

Rather than using a fixed template, one can also learn what a gaze map would look like for a novel test image. We construct a model following Judd's simple method [19], by inputting (1) our training gaze templates, from which 0/1 gaze labels are extracted per pixel, and (2) per-pixel image features (the same feature set as in [19] including color, intensity, orientation, etc; but excluding person and car detections). This saliency model learns an SVM which predicts whether each pixel will be fixated or not, using the per-pixel features. We learn a separate saliency model for each attribute. We call this approach

Single Template Predicted (STP) or
Multiple Templates Predicted (MTP)

depending on whether a single or multiple templates were used per attribute at training time

Experimental results

Experimental design

Datasets

We select 60 images total per attribute. In order to get representative examples of each attribute, we sample: (a) 30 instances where the attribute is definitely present, (b) 18 instances where it is definitely not present, and (c) 12 instances where it may or may not be present. We employed the following datasets

Shoes (dataset of [22]) -- We select the following attributes: "feminine", "formal", "open", "pointy", and "sporty".
Faces (dataset of [25]) -- We select the following attributes: "asian", "attractive", "baby-faced", "big-nosed", "chubby", "Indian", "masculine", and "youthful".
Scenes (dataset of [33]) -- We select the following attributes: "climbing", "open area", "cold", "soothing", "competing", "sunny", "driving", "swimming", "natural", and "vegetation".

Evaluation metrics

F-measure captures better accuracy when the data distribution is imbalanced.

Our approaches

Single Template (ST)
Multiple Template (MT)
Single Template Predicted (STP)
Multiple Template Predicted (MTP)

Baselines

Whole Image (WI) -- a baseline which uses the whole image for both training and testing
Data-Driven (DD) -- a baseline which selects features using an L1-regularizer over features extracted on a grid, then sets grid template cells on/off depending on whether at least one feature in that grid cell received a non-zero weight from the regularizer (note we do this only for localizable features);
Unsupervised Saliency (US) -- a baseline which predicts standard saliency using a state-of-the-art method [18] but without training on our attribute-specific gaze data, and the resulting saliency map is then used to compute a template mask;
Random (R) -- a baseline which generates a random template over a 15x15 grid, where the number of 1-valued cells is equal to the number of 1-valued cells in the corresponding single template
Random Ensemble (RE) -- an ensemble of random template classifiers (random ensemble), which is the random counterpart to the ensemble used by multiple templates.
Spatial Extent (SE) -- method of Xiao and Lee [48] which discovers the spatial extent of relative attributes..

Results

Comparison table using concatenation of hog and gist features

F-measure using HOG+GIST features. Bold indicates best performer excluding ties.

Comparison table using fc6 features

F-measure using fc6.

Comparison table using dense sift features on fixed and predicted templates

F-measure using gaze maps predicted using the saliency method of [19].

Comparison with spatial extend approach

Time comparison of our MT and MTP with SE. On the y-axis is the average F-measure over the attributes tested. Run1, run2, and run3 use different parameter configurations for SE (each one requiring more processing time). Our MT is more accurate than the cheaper SE versions and as accurate as the most expensive one.

Qualitative results

Representative predicted templates for "chubby" and "pointy". Red = most, blue = least salient.
Note: You can find additional results in our supplementary file.

Applications

Adaptation for scene attributes

The objects most often fixated per scene attribute.

Visualizing attribute models

Model visualizations for (a) the attribute "baby-faced", using whole image features (left) and our template masks (right), and (b) the attribute "big-nosed".

Using gaze to find schools of thought

Quantitative comparison of the original schools of thought approach and our gaze-based approach.

Conclusion

We showed an approach for learning more accurate attribute prediction models by using supervision from humans in the form of gaze locations.

Publication and dataset

Learning Attributes from Human Gaze. N. Murrugarra-Llerena and A. Kovashka. In Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, California, 2017. IEEE. [pdf] [supp] [shoes-faces gaze data] [scenes gaze data] [readme] [mat_file gaze]