Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval

Preserving Semantic Neighborhoods for
Robust Cross-modal Retrieval

Christopher Thomas and Adriana Kovashka
University of Pittsburgh

Published in ECCV 2020

Abstract
The abundance of multimodal data (e.g. social media posts with text and images) have inspired interest in cross-modal retrieval methods. However, most prior methods have focused on the case where image and text convey redundant information; in contrast, real-world image-text pairs convey complementary information with little overlap. Popular approaches to cross-modal retrieval rely on a variety of metric learning losses, which prescribe what the proximity of image and text should be, in the learned space. However, images in news articles and media portray topics in a visually diverse fashion; thus, we need to take special care to ensure a meaningful image representation. We propose novel within-modality losses which ensure that not only are paired images and texts close, but the expected image-image and text-text relationships are also observed. Specifically, our method encourages semantic coherency in both the text and image subspaces, and improves the results of cross-modal retrieval in three challenging scenarios.

Download our ECCV 2020 Paper
Additional Resources:
GitHub page (contains code)

Method Overview

We propose a metric learning approach where we use the semantic relationships between text segments, to guide the embedding learned for corresponding images. In other words, to understand what an image "means", we look at what articles it appeared with. Unlike prior approaches, we capture this information not only across modalities, but within the image modality itself. If texts \(y_i\) and \(y_j\) are semantically similar, we learn an embedding where we explicitly encourage their paired images \(x_i\) and \(x_j\) to be similar, using a new unimodal loss. Note that in general \(x_i\) and \(x_j\) need not be similar in the original visual space. In addition, we encourage texts \(y_i\) and \(y_j\), who were close in the unimodal space, to remain close. Our novel loss formulation explicitly encourages within-modality semantic coherence. We show how our method brings paired images and text closer, while also preserving semantically coherent regions, e.g. the texts remained close in the graphic above.

Acknowledgements

This material is based upon work supported by the National Science Foundation under CISE Award No. 1718262. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. This work was also supported by Adobe and Amazon gifts, and an NVIDIA hardware grant. We thank the reviewers and AC for their valuable suggestions.