CS3710: Vision-Language Models for Computer Vision (Advanced Topics in AI), Fall 2023

Location: Sennott Square 6516
Time: Monday and Wednesday, 9:30am - 10:45am
Instructor: Adriana Kovashka
Email: kovashka AT cs DOT pitt DOT edu (Note: When emailing, please use "CS3710" at the beginning of the subject line.)
Office: Sennott Square 5325
Office hours: by appointment

Overview

Course description: In this graduate level seminar course, we will examine recent advances in computer vision, with a focus on vision-language models (VLMs). We will study VLMs for tasks such as object detection and visual reasoning, techniques such as prompting and classification by description, challenges such as domain shifts and bias (including across continents and cultures), and opportunities such as modalities beyond vision and language (e.g. audio). The course structure will primarily consists of student presentations on assigned conference and journal publications, in-class discussions, and a course project. We will analyze approaches to research problems in learning from vision-language data, and discuss future directions that these problems and approaches inspire. The goal of this course is to become knowledgeable about the state of the art and the challenges in vision-language models, develop the skills to read critically, as well as write up and present research work clearly.

Prerequisites: Basic understanding of probability and linear algebra is required. Familiarity or experience with machine learning is recommended. Each student should speak with the instructor to determine if they have sufficient background for this course. Email the instructor if you have trouble signing up for the class.

[top]

Policies

Grading will be based on the following components:

Paper reviews/quizzes

At the beginning of each class, the instructor will call upon 3-5 randomly chosen students, with each answering one of the questions below (instructor's choice) about one of the papers from the readings for that day (student's choice). Answers should be limited to 1-2 minutes. Expect to be called upon 4-5 times during the semester.
  1. Summarize what this paper aims to do, and the very high-level idea of the proposed approach.
  2. What is one advantage of the proposed approach?
  3. What is one disadvantage or weakness of the approach or experimental validation?
Grading rubric: The student will earn between 10 and 5 points if they attempt to answer; 0 if they refuse to answer (the lowest score will be dropped at the end of the semester). The instructor will take points off for incorrect statements, statements which are too vague to indicate understanding of the paper, and unclear statements.

In-class participation and discussion

Students should actively engage in in-class discussions. For example, this could take the form of asking questions or making meaningful remarks and comments about the work following a paper presentation, or responding to others' questions or comments. Given the course format, participation is essential. In particular, it is important that students who are not scheduled to present come prepared to discuss the strengths and weaknesses of the paper. Think of this as a debate, where you either provide arguments to question or defend the paper's contribution.

Paper presentations

Each class, a team of 2-3 students will present the papers listed for that class. Each student should sign up to co-lead 3 paper presentations, here, by end of day, Sept. 1. If more than two papers are listed for a given class, the presenters can choose whether to present all papers or a subset (at least two). The speaking time for co-presenters should be comparable. It is up to the presenters to split the presentation amongst themselves, e.g. each can cover one paper, one can cover methods and another experiments, etc. The presenters should also engage (drive and moderate) the rest of the class in a discussion. Plan to spend roughly 40 min presenting and 20 min discussing. Presentations should cover the following: Presenters are required to send their slides to the instructor by end of day Thursday at the latest (if the presentation is the following Monday) or Sunday (if the presentation is on Wednesday of the same week). Use "CS3710 Presentation Slides" as the email subject. The instructor will send feedback and the presenter is asked to adjust the final version according to this feedback.

Use many visuals on the slides, and use text sparingly, primarily in bulleted form. You are encouraged to browse the web for resources or slides related to this work, including for original slides from the authors that include results not shown in the paper. However, always cite your sources for any slides that other authors created. Also, always use your own words on slides and during the presentation. Do not simply copy text from the paper or from other resources. Make sure to rehearse your presentations so that it is clear and polished.

Grading rubric: Your grade for a paper presentation will be based on: (1) whether you sent me a draft of your slides and met with me to discuss these by the deadline; (2) whether you adequately, clearly and correctly addressed and explained all important points of the paper in your presentation; (3) how clear and informative your classmates found the presentation; (4) how well-rehearsed your presentation was; and (5) how you moderated the discussion of the paper.

Final project

Students will complete a project which studies in more depth one of the topics we cover in class. Students should work in groups of two or three. A project can become a subsequent conference publication. These projects should focus on one of the following: In the project proposal, students should include the following: clear problem statement, extensive literature review, detailed outline of the approach, and planned experimental setup. Students are encouraged to discuss a draft of the proposal with the instructor before the proposal is due. Proposals should be 3-5 pages long. When you submit your proposal, please also sign up for a presentation slot.

The mid-semester project status report will describe the students' progress on the project, and any problems encountered along the way. The status report should use a known conference format (e.g. CVPR/ICCV/ECCV), but can be more informal than a conference paper. The progress report should include the following sections: Introduction, Related Work, Approach, and Results. In Results, include your experimental setup (this can change later). If you have results but they do not yet look great, include them anyway. Comment on any challenges encountered as well.

The project presentation will describe the students' approach and their experimental findings in a clear and engaging fashion. This will be a chance to get feedback from the class before final submission of your report. Presentations should be about 15 minutes long. Please submit a copy of your slides to Canvas on the same day as your presentation.

The project final report should be formatted and should read like a conference paper, with clear problem definition and argumentation of why this problem is important, overview of related work, detailed explanation of the approach, and well-motivated experimental evaluation.

All submissions (proposal, status report, final report) should include a description of each team member's contribution.

The grade breakdown and due dates for the project are:

Collaboration Policy and Academic Honesty

You will do your work (exams and homework) individually. The work you turn in must be your own work. When in doubt about what you can or cannot use, ask the instructor! A first offense will cause you to get 0% credit on the assignment. A report will be filed with the school. A second offense will cause you to fail the class and receive disciplinary penalty. Please consult SCI's Academic Integrity Country and Pitt's Academic Integrity Guidelines.

Note on Disabilities

If you have a disability for which you are or may be requesting an accommodation, you are encouraged to contact both your instructor and Disability Resources and Services (DRS), 140 William Pitt Union, (412) 648-7890, drsrecep@pitt.edu, (412) 228-5347 for P3 ASL users, as early as possible in the term. DRS will verify your disability and determine reasonable accommodations for this course.

Note on Medical Conditions

If you have a medical condition which will prevent you from doing a certain assignment, you must inform the instructor of this before the deadline. You must then submit documentation of your condition within a week of the assignment deadline.

Statement on Classroom Recording

To ensure the free and open discussion of ideas, students may not record classroom lectures, discussion and/or activities without the advance written permission of the instructor, and any such recording properly approved in advance can be used solely for the student's own private use.

[top]

Schedule

Date Topic Papers (Presenters) Due
8/28 1: Introduction slides
8/30
9/6
9/11 2: Vision-language models: Contrastive learning Radford et al. "Learning Transferable Visual Models From Natural Language Supervision." ICML 2021.
Jia et al. "Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision". ICML 2021.
9/13 Dwibedi et al. "With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations." ICCV 2021.
Zolfaghari et al. "CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations." ICCV 2021.
9/18 3: Vision-language models: Grounding for object detection Ye et al. "Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection." ICCV 2019.
Gu et al. "Open-vocabulary Object Detection via Vision and Language Knowledge Distillation." ICLR 2022.
9/20 Wu et al. "Aligning Bag of Regions for Open-Vocabulary Object Detection." CVPR 2023.
Yao et al. "DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment." CVPR 2023.
Kim et al. "Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers." CVPR 2023.
Kirillov et al. "Segment Anything." ICCV 2023.
9/25 4: Vision-language models: Reasoning and visual question answering Ye and Kovashka. "A Case Study of the Shortcut Effects in Visual Commonsense Reasoning." AAAI 2021.
Mao et al. "The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision." ICLR 2019.
proposal (9/25)
9/27 Gupta and Kembhavi. "Visual Programming: Compositional visual reasoning without training." CVPR 2023.
Zhang et al. "Multimodal Chain-of-Thought Reasoning in Language Models." arxiv 2023.
Li et al. "Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models." ICML 2023.
10/2 5: Prompt learning Jia et al. "Visual Prompt Tuning." ECCV 2022.
Zhou et al. "Learning to Prompt for Vision-Language Models." IJCV 2022.
10/4 Khattak et al. "MaPLe: Multi-modal Prompt Learning." CVPR 2023.
Du et al. "Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model." CVPR 2023.
Liu et al. "Visual Instruction Tuning (LLaVA: Large Language and Vision Assistant)." arXiv 2023.
10/9 6: Classification by description Menon and Vondrick. "Visual Classification via Description from Large Language Models." ICLR 2023.
Pratt et al. "What does a platypus look like? Generating customized prompts for zero-shot image classification." ICCV 2023.
10/11 Udandarao et al. "SuS-X: Training-Free Name-Only Transfer of Vision-Language Models." ICCV 2023.
Mao et al. "Doubly Right Object Recognition: A Why Prompt for Visual Rationales." CVPR 2023.
10/16 7: Domain shifts Ganin and Lempitsky. "Unsupervised domain adaptation by backpropagation." ICML 2015.
Hoffman et al. "Cycada: Cycle-consistent adversarial domain adaptation." ICML 2018.
Peng et al. "Moment matching for multi-source domain adaptation." ICCV 2019.
10/18 Vidit et al. "CLIP the Gap: A Single Domain Generalization Approach for Object Detection." CVPR 2023.
Ge et al. "Domain adaptation via prompt learning." arxiv 2022.
10/23 8: Geographic shifts Prabhu et al. "Can domain adaptation make object recognition work for everyone?" CVPRW 2022.
Kalluri et al. "GeoNet: Benchmarking Unsupervised Adaptation across Geographies." CVPR 2023.
10/25 Chen et al. "PaLI: A Jointly-Scaled Multilingual Language-Image Model." ICLR 2023.
Liu et al. "Visually Grounded Reasoning across Languages and Cultures." EMNLP 2021.
Kadar et al. "Lessons learned in multilingual grounded language learning." CONNL 2018.
10/30 9: Other modalities (audio) Aytar et al. "SoundNet: Learning sound representations from unlabeled video." NeurIPS 2016.
Afouras et al. "Self-supervised object detection from audio-visual correspondence." CVPR 2022.
report (10/30)
11/1 Gao et al. "Learning to Separate Object Sounds by Watching Unlabeled Video." ECCV 2018.
Tan et al. "Language-Guided Audio-Visual Source Separation via Trimodal Consistency." CVPR 2023.
11/6 10: Video, actions, procedures Miech et al. "End-to-end learning of visual representations from uncurated instructional videos." CVPR 2020.
Grauman et al. "Ego4D: Around the World in 3,000 Hours of Egocentric Video." CVPR 2022.
11/8 Papadopoulos et al. "Learning Program Representations for Food Images and Cooking Recipes." CVPR 2022.
Cheng et al. "VindLU: A Recipe for Effective Video-and-Language Pretraining." CVPR 2023.
Dessalene et al. "Therbligs in Action: Video Understanding through Motion Primitives." CVPR 2023.
11/13 11: Social relevance (persuasion, bias) Ye et al. "Interpreting the Rhetoric of Visual Advertisements." TPAMI 2019.
Akula et al. "MetaCLUE: Towards Comprehensive Visual Metaphors Research." CVPR 2023.
11/15 Hendricks et al. "Women also snowboard: Overcoming bias in captioning models." ECCV 2018.
Fung et al. "Infosurgeon: Cross-media fine-grained information consistency checking for fake news detection." ACL 2021.
11/27 12: Generation Ramesh et al. "Zero-Shot Text-to-Image Generation." ICML 2021.
Rombach et al. "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR 2022.
11/29 Gafni et al. "Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors." ECCV 2022.
Li et al. "GLIGEN: Open-Set Grounded Text-to-Image Generation." CVPR 2023.
Ramesh et al. "Hierarchical Text-Conditional Image Generation with CLIP Latents."
12/4 13: Embodied and assistive learning Jiang and Zhao. "Learning Visual Attention to Identify People with Autism Spectrum Disorder." ICCV 2017.
Zuo et al. "Natural Language-Assisted Sign Language Recognition." CVPR 2023.
Inan et al. "Modeling Intensification for Sign Language Generation: A Computational Approach." ACL Findings 2022.
12/6 Gadre et al. "CoWs on PASTURE: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation." CVPR 2023.
Szot et al. "Habitat 2.0: Training Home Assistants to Rearrange their Habitat." NeurIPS 2021.
Ahn et al. "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances." CoRL 2023.
Dreiss et al. "PaLM-E: An Embodied Multimodal Language Model." ICML 2023.
12/11 14: Project presentations
12/13 final report (12/15)

[top]