03-29
Compositional Visual Intelligence

The field of computer vision has made enormous progress in the last few years, largely due to convolutional neural networks. Despite success on traditional computer vision tasks, our systems are still a long way from the general visual intelligence of people. I will argue that an important facet of visual intelligence is composition - understanding of the whole derives from an understanding of the parts. To achieve the goal of compositional visual intelligence, we must explore new computer vision tasks, create new datasets, and develop new models that exploit compositionality. I will discuss the Visual Genome dataset which we created in service of these goals, and three research directions enabled by this new data where incorporating compositionality results in systems with richer visual intelligence.

I will first discuss image captioning: traditional systems generate short sentences describing images, but by decomposing images into regions and descriptions into phrases we can that generate two types of richer descriptions: dense captions and paragraphs. Second, I will discuss visual question answering: existing datasets (including Visual Genome) consist primarily of short, simple questions; to study more complex questions requiring compositional reasoning, we built the CLEVR dataset and show that existing methods fall short on this new benchmark. We then propose an explicitly compositional model for visual question answering that internally converts questions to functional programs, and executes these programs by composing neural modules. Third, I will discuss text-to-image synthesis: existing systems can generate simple images of a single object conditioned on text descriptions, but struggle with more complex descriptions. By replacing freeform natural language with compositional scene graphs of objects and relationships, we can generate complex images containing multiple objects. I will conclude by discussing future areas where compositionality can be used to enrich visual intelligence.

Bio:
Justin is a PhD candidate in Computer Science at Stanford University, advised by Fei-Fei Li. His research interests lie at the intersection of computer vision and machine learning. Since 2015 he has co-taught a Stanford course on convolutional neural networks and deep learning with Andrej Karpathy, Serena Yeung, and Fei-Fei Li which has been viewed hundreds of thousands of times online. He received his BS in Mathematics and Computer Science at the California Institute of Technology, and during his PhD he has spent time at Google Cloud AI, Facebook AI Research, and Yahoo Research

Date and Time

Thursday March 29, 2018 12:30pm - 1:30pm

Location

Computer Science Small Auditorium (Room 105)

Event Type

CS Department Colloquium Series

Speaker

Justin Johnson, from Stanford University

Host

Olga Russakovsky

Contributions to and/or sponsorship of any event does not constitute departmental or institutional endorsement of the specific program, speakers or views presented.

CS Talks Mailing List

03-29 Compositional Visual Intelligence

03-29
Compositional Visual Intelligence