Learning Attributes from Human Gaze

Nils Murrugarra-Llerena and Adriana Kovashka


While semantic visual attributes have been shown useful for a variety of tasks, many attributes are difficult to model computationally. One of the reasons for this difficulty is that it is not clear where in an image the attribute lives. We propose to tackle this problem by involving humans more directly in the process of learning an attribute model. We ask humans to examine a set of images to determine if a given attribute is present in them, and we record where they looked. We create gaze maps for each attribute, and use these gaze maps to improve attribute prediction models. For test images we do not have gaze maps available, so we predict them based on models learned from collected gaze maps for each attribute of interest. Compared to six baselines, we improve prediction accuracies on attributes of faces and shoes, and we show how our method might be adapted for scene images. We demonstrate additional uses of our gaze maps for visualization of attribute models and learning "schools of thought" between users in terms of their understanding of the attribute


Since attributes are less well-defined, capturing them with computational models poses a different set of challenges than capturing object categories does. There is a disconnect between how humans and machines perceive attributes, and it negatively impacts tasks that involve communication between a human and a machine, since the machine may not understand what a human user has in mind when referring to a particular attribute. Since attributes are human-defined, the best way to deal with their ambiguity is by learning from humans what these attributes really mean.

We learn the spatial support of attributes by asking humans to judge if an attribute is present in training images. We use this support to improve attribute prediction.

We propose to learn attribute models using human gaze maps that show which part of an image contains the attribute. To obtain gaze maps for each attribute, we conduct human subject experiments where we ask viewers to examine images of faces, shoes, and scenes, and determine if a given attribute is present in the image or not.


Step 1: Collect data.
Step 2: Generate gaze maps templates.
Step 3: Learn attribute models using gaze templates.
Step 4: Learn attribute models using gaze prediction.

Data collection

Our experiment begins with a screening phase in which we show ten images to each participant and ask him/her to look at a fixed region in the image that is marked by a red square, or to look at e.g. the nose or right eye for faces. If the fixated pixel locations lie within the marked region, the participant moves on to the data collection session. The latter consists of 200 images organized in four sub-sessions. In order to increase the participants' performance, we allow a five-minute break between sub-sessions. We ask the viewer whether a particular attribute is present in a particular image which we then show him/her. The participant has two seconds to look at the image and answer. His/her gaze locations and answers are recorded. We obtain 2.5 gaze maps on average, for each image-attribute question.

Gaze maps templates generation

First, the gaze maps across all images that correspond to positive attribute labels are OR-ed (the maximum value is taken per pixel) and divided by the maximum value in the map. Thus we arrive at a gaze map gmm for the attribute m with values in the range [0, 1]. Second, a binary template btm is created using a threshold of t = 0.1 on gmm. All locations greater than t are marked as 1 in btm and the rest as 0. Third, we apply a 15x15 grid over the binary template to get a grid template gtm. The process starts with a grid template filled with all 0 values. Then if a pixel with value 1 of btm falls inside some grid cell of gtm, this cell is turned on (all pixels in that cell are replaced with 1).

Grid templates for the face (top two rows) and shoe attributes.

To get templates that capture the subtle variations of how an attribute might appear [21] and also separate different types of objects, a clustering is performed over the images labeled as positive by our human participants. For example, boots can be in one group and high-heels in another. We use K-means with k = 5. After the clustering procedure, we repeat the grid template generation, but now separately for each of the five clusters.

Grid templates for each positive cluster for the attributes "open" (top) and "chubby" (bottom). At the top, we show multiple templates capturing the nuances of "openness". At the bottom, we show how multiple templates for "chubby" look on the same image.

Attribute learning using gaze templates

We consider two approaches:

Attribute learning using gaze prediction

Rather than using a fixed template, one can also learn what a gaze map would look like for a novel test image. We construct a model following Judd's simple method [19], by inputting (1) our training gaze templates, from which 0/1 gaze labels are extracted per pixel, and (2) per-pixel image features (the same feature set as in [19] including color, intensity, orientation, etc; but excluding person and car detections). This saliency model learns an SVM which predicts whether each pixel will be fixated or not, using the per-pixel features. We learn a separate saliency model for each attribute. We call this approach depending on whether a single or multiple templates were used per attribute at training time

Experimental results

Experimental design


We select 60 images total per attribute. In order to get representative examples of each attribute, we sample: (a) 30 instances where the attribute is definitely present, (b) 18 instances where it is definitely not present, and (c) 12 instances where it may or may not be present. We employed the following datasets

Evaluation metrics

Our approaches



Comparison table using concatenation of hog and gist features

F-measure using HOG+GIST features. Bold indicates best performer excluding ties.

Comparison table using fc6 features

F-measure using fc6.

Comparison table using dense sift features on fixed and predicted templates

F-measure using gaze maps predicted using the saliency method of [19].

Comparison with spatial extend approach

Time comparison of our MT and MTP with SE. On the y-axis is the average F-measure over the attributes tested. Run1, run2, and run3 use different parameter configurations for SE (each one requiring more processing time). Our MT is more accurate than the cheaper SE versions and as accurate as the most expensive one.

Qualitative results

Representative predicted templates for "chubby" and "pointy". Red = most, blue = least salient.

Note: You can find additional results in our supplementary file.


Adaptation for scene attributes

The objects most often fixated per scene attribute.

Visualizing attribute models

Model visualizations for (a) the attribute "baby-faced", using whole image features (left) and our template masks (right), and (b) the attribute "big-nosed".

Using gaze to find schools of thought

Quantitative comparison of the original schools of thought approach and our gaze-based approach.


We showed an approach for learning more accurate attribute prediction models by using supervision from humans in the form of gaze locations.

Publication and dataset

Learning Attributes from Human Gaze. N. Murrugarra-Llerena and A. Kovashka. In Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, California, 2017. IEEE. [pdf] [supp] [shoes-faces gaze data] [scenes gaze data] [readme] [mat_file gaze]