Predicting the Politics of an Image
Using Webly Supervised Data
Christopher Thomas and Adriana Kovashka
University of Pittsburgh

Published in NeurIPS 2019
Abstract
The news media shape public opinion, and often, the visual bias they contain is evident for human observers. This bias can be inferred from how different media sources portray different subjects or topics. In this paper, we model visual political bias in contemporary media sources at scale, using webly supervised data. We collect a dataset of over one million unique images and associated news articles from left- and right-leaning news sources, and develop a method to predict the image’s political leaning. This problem is particularly challenging because of the enormous intra-class visual and semantic diversity of our data. We propose a two-stage method to tackle this problem. In the first stage, the model is forced to learn relevant visual concepts that, when joined with document embeddings computed from articles paired with the images, enable the model to predict bias. In the second stage, we remove the requirement of the text domain and train a visual classifier from the features of the former model. We show this two-stage approach facilitates learning and outperforms several strong baselines. We also present extensive qualitative results demonstrating the nuances of the data.
Paper snapshot
Download our NeurIPS 2019 Paper
Additional Resources:
Download poster
Download slides
Download supplementary material
Download code (899 MB)
Download dataset (123 GB)
Paper Overview
Method figure
We propose the problem of predicting the political bias of images, which we define as whether an image came from a left- or right-leaning media source. This requires understanding: 1) what visual concepts to look for in images, and 2) how these visual concepts are portrayed across the political spectrum. Note that this is a very challenging task because many of the concepts that we aim to learn show serious visual variability within the left and right. To address this problem, we propose a two-staged method. In stage 1, we learn visual features jointly with paired text for bias classification. In stage 2, we remove the text dependency by training a classifier on top of our prior model using purely visual features. We show that this approach significantly outperforms directly training a model to predict bias. Additionally, we make available a large dataset of biased images with paired text, and a large amount of diverse crowdsourced annotations regarding political bias. Finally, we perform a detailed quantitative and qualitative analysis of our method and dataset.
Dataset Release
Immigration Image 1 Immigration Image 1 Source
Immigration Image 2 Immigration Image 2 Source
Immigration Image 3 Immigration Image 3 Source
Abortion Image 1 Abortion Image 1 Source
Abortion Image 2 Abortion Image 2 Source
Abortion Image 3 Abortion Image 3 Source
Because no dataset exists for this problem, we assembled a large dataset of images and text about contemporary politically charged topics such as abortion, immigration, LGBT rights, etc. We crawled media sources determined to be biased by Media Bias Fact Check. We used sources which were labeled left/right or extreme left/right and queried for images using each topic as a query. We extracted the article text on the pages the images appeared with using Dragnet. We obtained 1,861,336 images total and 1,559,004 articles total. We manually removed boilerplate text (headers, copyrights,etc.) which leaked into some articles. Because sources cover the same events, some images are published multiple times. To prevent models from "cheating" by memorization, all experiments are performed on a "deduplicated" subset of our data, which we also release in our dataset download.
Human consensus vs no consensus
We treat the problem of predicting bias as a weakly supervised task, that is, we assume all image-text pairs have the political leaning of the source they come from. In order to better explore this assumption and understand human conceptions of bias, we also ran a large-scale crowdsourcing study on Amazon Mechanical Turk (MTurk). We asked workers to guess the political leaning of images by indicating whether the image favored the left, right, or was unclear. In total, we showed 3,237 images to at least three workers each at a total cost of $4,771. We show examples of different levels of agreement between workers above. In total, 993 images were labeled with a clear L/R label by at least a majority. Our collected annotations include many types of additional data, including the article text that best aligned with the image, the topic, the political leaning of the image-text pair, and other data. We release our collected crowdsourced annotations with our human dataset.
Method: Leveraging Paired Text as Priviledged Information
Our method leverages text as priviledged information
Predicting the political bias of images is a very challenging task because many of the concepts that we must model show serious visual variability both across and within the left and right. Because of this, model training may fall into poor local minima due to the lack of a recurring discriminative signal. Further, it is not merely the presence or absence of objects that matters, but rather how they are portrayed, often in subtle ways. We hypothesize that the text paired with each image provides a useful cue to guide the training of our visual bias classifier towards relevant visual concepts. In step 1 of our model, we use information flowing from the visual pipeline, and fuse it with the document embedding of the text as an auxiliary source of information, for bias classification. However, because we are primarily interested in visual political bias we next remove our model's reliance on textual features, but keep all convolutional layers fixed. This preserves the semantics learned jointly with text in step 1, while removing the requirement of text. Thus in step 2, we train a linear bias classifier on top of the first model, treating it as a feature extractor. Therefore, at test time, our model predicts the bias of an image without using any text. Our experimental results show that this approach significantly outperforms directly training a model to predict bias.
Generating Biased Faces
Generated biased faces
Many workers noted how politicians were portrayed in making their decision. To visualize the differences in how well-known individuals are portrayed within our dataset, we trained a generative model to modify a given Trump/Clinton/Obama face, and make it appear as if it came from a left/right leaning source. We use a variation of our autoencoder-based model, which learns a distribution of facial attributes and latent features. We train the model using the features from the original method on faces of Trump/Clinton/Obama detected in our dataset. To modify an image, we condition the generator on the image’s embedding and modify the distribution of attributes/expressions for the image to match that person’s average portrayal on the left/right. We show example results above. Observe that Trump and Clinton appear angry on the far-left/right (respectively) end of the spectrum. In contrast, all three appear happy/benevolent in sources supporting their own party. We also observe Clinton appears younger in far-left sources. In far-right sources, Obama appears confused or embarrassed. These results further underscore that our weakly supervised labels are accurate enough to extract a meaningful signal.
Closest Images Across Left / Right
Closest images across left / right
We show the challenge of classifying images as left or right in visual space only. We compute the distance between images from the left and right, and show L/R pairs that have a small distance in feature space within topics. For black lives matter, for example, the left image is serious, while the right image is whimsical. For climate change, one presents a more negative vision, while the other is picturesque. Both border control images show fire, but the left one is of a Trump effigy. For terrorism, the left image shows a white domestic terrorist while the right shows Middle-Eastern men. These pairs highlight how subtle the distinctions between left and right are for some images.
Acknowledgements
This material is based upon work supported by the National Science Foundation under CISE Award No. 1566270 and a NVIDIA hardware grant. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. We also would like to thank the reviewers for their constructive feedback.
National Science Foundation Logo
Fair Use Notice
Our dataset and this site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We make such material available in an effort to advance understanding of technological, scientific, and cultural issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit to those who have expressed a prior interest in receiving the included information for non-commercial research and educational purposes. For more information on fair use please click here. If you wish to use copyrighted material on this site or in our dataset for purposes of your own that go beyond non-commercial research and academic purposes, you must obtain permission directly from the copyright owner.