CS1674: Homework 10 - Practice

Due: 4/19/2018, 11:59pm

This assignment is worth 20 points, and will be graded by the instructor.

In this assignment, you will "test" the state of the art in computer vision, by using four web demos: two for object/concept recognition, one for image captioning, and one for visual question ansering. For each task, you will find two images that make it produce a very good (accurate, human-like) response, and three that make it produce a really bad (incorrect, illogical, non-fluent) response. You will then write a 5-to-10-sentence essay summarizing what you think computer vision currently can do well, and what it cannot do well.

At most one of the images you find can be overlapping between Parts I, II and III.


Part I: Object/concept recognition (5 points)

Examine the Clarifai demo and the Google Cloud Vision demo. Note that both provide object labels (for Google Cloud, click on the Labels tab) but also other labels (e.g. faces). Focus on the semantic labels (like objects and faces). Find two images where the labels you get from either demo are accurate, and three that are grossly inaccurate in both demos. Submit your images, using the naming format below.


Part II: Image captioning (5 points)

Examine the Caption Bot demo by Microsoft. Find two images where the caption you get seems roughly accurate, and the produced phrase sounds like decent English. Find three images where the produced caption seems incorrect or too vague to be useful. Submit your images.


Part III: Visual question answering (5 points)

Examine this VQA demo by CloudCV. Find pairs of images and questions about those images, that cause the demo to give you accurate answers. Find three image-question pairs where the returned answers are incorrect. Submit your images and questions.


Part IV: Essay (5 points)

Write a 5-to-10-sentence essay describing the state of the art in computer vision, based on the experiments you conducted, and the good/bad results. Describe the following: For what kind of images (and questions) do the web demos work well? What kind of images (or questions) "break" the demos and cause them to fail? Based on your findings, what kind of tasks do we still need to solve in computer vision?


Submission:
Grading rubric: