Learning Attributes from Human Gaze
Nils Murrugarra-Llerena and Adriana Kovashka
While semantic visual attributes have been shown useful
for a variety of tasks, many attributes are difficult to model
computationally. One of the reasons for this difficulty is
that it is not clear where in an image the attribute lives. We
propose to tackle this problem by involving humans more
directly in the process of learning an attribute model. We
ask humans to examine a set of images to determine if a
given attribute is present in them, and we record where
they looked. We create gaze maps for each attribute, and
use these gaze maps to improve attribute prediction models.
For test images we do not have gaze maps available,
so we predict them based on models learned from collected
gaze maps for each attribute of interest. Compared to six
baselines, we improve prediction accuracies on attributes
of faces and shoes, and we show how our method might
be adapted for scene images. We demonstrate additional
uses of our gaze maps for visualization of attribute models
and learning "schools of thought" between users in terms
of their understanding of the attribute
Since attributes are less well-defined, capturing them
with computational models poses a different set of challenges
than capturing object categories does. There is a
disconnect between how humans and machines perceive attributes,
and it negatively impacts tasks that involve communication
between a human and a machine, since the machine
may not understand what a human user has in mind
when referring to a particular attribute. Since attributes are
human-defined, the best way to deal with their ambiguity is
by learning from humans what these attributes really mean.
We learn the spatial support of attributes by asking
humans to judge if an attribute is present in training images.
We use this support to improve attribute prediction.
We propose to learn attribute models using human gaze
maps that show which part of an image contains the attribute.
To obtain gaze maps for each attribute, we conduct
human subject experiments where we ask viewers to
examine images of faces, shoes, and scenes, and determine
if a given attribute is present in the image or not.
Step 1: Collect data.
Step 2: Generate gaze maps templates.
Step 3: Learn attribute models using gaze templates.
Step 4: Learn attribute models using gaze prediction.
Our experiment begins with a screening phase in which
we show ten images to each participant and ask him/her to
look at a fixed region in the image that is marked by a red
square, or to look at e.g. the nose or right eye for faces. If
the fixated pixel locations lie within the marked region, the
participant moves on to the data collection session. The latter
consists of 200 images organized in four sub-sessions. In
order to increase the participants' performance, we allow a
five-minute break between sub-sessions. We ask the viewer
whether a particular attribute is present in a particular image
which we then show him/her. The participant has two
seconds to look at the image and answer. His/her gaze locations
and answers are recorded. We obtain 2.5 gaze maps
on average, for each image-attribute question.
Gaze maps templates generation
First, the gaze maps across all images
that correspond to positive attribute labels are OR-ed
(the maximum value is taken per pixel) and divided by the
maximum value in the map. Thus we arrive at a gaze map
gmm for the attribute m with values in the range [0, 1]. Second,
a binary template btm is created using a threshold of
t = 0.1 on gmm. All locations greater than t are marked as
1 in btm and the rest as 0. Third, we apply a 15x15 grid over
the binary template to get a grid template gtm. The process
starts with a grid template filled with all 0 values. Then if a
pixel with value 1 of btm falls inside some grid cell of gtm,
this cell is turned on (all pixels in that cell are replaced with
Grid templates for the face (top two rows) and shoe attributes.
To get templates that capture the subtle variations of how
an attribute might appear  and also separate different
types of objects, a clustering is performed over the images
labeled as positive by our human participants. For example,
boots can be in one group and high-heels in another.
We use K-means with k = 5. After the clustering procedure,
we repeat the grid template generation, but now separately
for each of the five clusters.
Grid templates for each positive cluster for the attributes "open" (top) and "chubby" (bottom). At the top, we show multiple templates capturing the nuances of "openness". At the bottom, we show how multiple templates for "chubby" look on the same image.
Attribute learning using gaze templates
We consider two approaches:
- For Single Template (ST), the parts of images involved in training and testing are multiplied by the grid template values, which results in image pixels under a 0 value being removed and keeping other pixels the same. We then extract both local and global features from the remaining part of the image, and train a classifier corresponding to the template using these features. At test time, we apply the template to each image, extract features from the 1-valued part, and apply the classifier.
- For Multiple Templates (MT), we train five different classifiers (one per cluster), each corresponding to one grid template. We classify a new image as positive if at least one of the five classifiers predicts it contains the attribute.
Attribute learning using gaze prediction
Rather than using a fixed template, one can also learn what a gaze
map would look like for a novel test image. We construct
a model following Judd's simple method , by inputting
(1) our training gaze templates, from which 0/1 gaze labels
are extracted per pixel, and (2) per-pixel image features (the
same feature set as in  including color, intensity, orientation,
etc; but excluding person and car detections). This
saliency model learns an SVM which predicts whether each
pixel will be fixated or not, using the per-pixel features. We
learn a separate saliency model for each attribute.
We call this approach
depending on whether a single or multiple templates were used per attribute at training time
- Single Template Predicted (STP) or
- Multiple Templates Predicted (MTP)
We select 60 images total per attribute. In order to get representative examples of each attribute, we sample: (a) 30 instances where the attribute is definitely present, (b) 18 instances where it is definitely not present, and (c) 12 instances where it may or may not be present. We employed the following datasets
- Shoes (dataset of ) -- We select the following attributes: "feminine", "formal", "open", "pointy", and "sporty".
- Faces (dataset of ) -- We select the following attributes: "asian", "attractive", "baby-faced", "big-nosed", "chubby", "Indian", "masculine", and "youthful".
- Scenes (dataset of ) -- We select the following attributes: "climbing", "open area", "cold", "soothing", "competing", "sunny", "driving", "swimming", "natural", and "vegetation".
- F-measure captures better accuracy when the data distribution is imbalanced.
- Single Template (ST)
- Multiple Template (MT)
- Single Template Predicted (STP)
- Multiple Template Predicted (MTP)
- Whole Image (WI) -- a baseline which uses the whole image for both training and testing
- Data-Driven (DD) -- a baseline which selects features using an L1-regularizer over features extracted on a grid, then sets grid template cells on/off depending on whether at least one feature in that grid cell received a non-zero weight from the regularizer (note we do this only for localizable features);
- Unsupervised Saliency (US) -- a baseline which predicts standard saliency using a state-of-the-art method  but without training on our attribute-specific gaze data, and the resulting saliency map is then used to compute a template mask;
- Random (R) -- a baseline which generates a random template over a 15x15 grid, where the number of 1-valued cells is equal to the number of 1-valued cells in the corresponding single template
- Random Ensemble (RE) -- an ensemble of random template classifiers (random ensemble), which is the random counterpart to the ensemble used by multiple templates.
- Spatial Extent (SE) -- method of Xiao and Lee  which discovers the spatial extent of relative attributes..
Comparison table using concatenation of hog and gist features
F-measure using HOG+GIST features. Bold indicates best performer excluding ties.
Comparison table using fc6 features
F-measure using fc6.
Comparison table using dense sift features on fixed and predicted templates
F-measure using gaze maps predicted using the saliency method of .
Comparison with spatial extend approach
Time comparison of our MT and MTP with SE.
On the y-axis is the average F-measure over the attributes
tested. Run1, run2, and run3 use different parameter configurations
for SE (each one requiring more processing time).
Our MT is more accurate than the cheaper SE versions and
as accurate as the most expensive one.
Representative predicted templates for "chubby" and "pointy". Red = most, blue = least salient.
Note: You can find additional results in our supplementary file.
Adaptation for scene attributes
The objects most often fixated per scene attribute.
Visualizing attribute models
Model visualizations for (a) the attribute "baby-faced", using whole image features (left) and our template masks (right), and (b) the attribute "big-nosed".
Using gaze to find schools of thought
Quantitative comparison of the original schools of thought approach and our gaze-based approach.
We showed an approach for learning more accurate attribute prediction models by using supervision from humans in the form of gaze locations.
Publication and dataset
Learning Attributes from Human Gaze. N. Murrugarra-Llerena and A. Kovashka. In Proceedings of IEEE Winter
Conference on Applications of Computer Vision (WACV), Santa Rosa, California, 2017. IEEE. [pdf] [supp] [shoes-faces gaze data] [scenes gaze data] [readme] [mat_file gaze]