Preserving Semantic Neighborhoods for
Robust Cross-modal Retrieval
Abstract
The abundance of multimodal data (e.g. social media posts with text and images) have inspired interest
in cross-modal retrieval methods. However, most prior methods have focused on the case where image and
text convey redundant information; in contrast, real-world image-text pairs convey complementary
information with little overlap. Popular approaches to cross-modal retrieval rely on a variety of metric
learning losses, which prescribe what the proximity of image and text should be, in the learned space.
However, images in news articles and media portray topics in a visually diverse fashion; thus, we need
to take special care to ensure a meaningful image representation. We propose novel within-modality
losses which ensure that not only are paired images and texts close, but the expected image-image and
text-text relationships are also observed. Specifically, our method encourages semantic coherency in
both the text and image subspaces, and improves the results of cross-modal retrieval in three
challenging scenarios.
Method Overview
We propose a metric learning approach where we use the semantic relationships between
text segments, to guide the embedding learned for corresponding images.
In other words, to understand what an image "means", we look at what articles it appeared with.
Unlike prior approaches, we capture this information not only across modalities, but
within the image modality itself.
If texts and are semantically similar, we learn an embedding where we explicitly
encourage their paired images and to be similar, using a new unimodal loss.
Note that in general and need not be similar in the original visual space.
In addition, we encourage texts and , who were close in the unimodal space, to remain
close.
Our novel loss formulation explicitly encourages within-modality semantic coherence.
We show how our method brings paired images and text closer, while also preserving semantically
coherent regions, e.g. the texts remained close in the graphic above.