Given only a single picture, people are capable of inferring a mental representation that encodes rich information about the underlying 3D scene. We acquire this skill not through massive labeled datasets of 3D scenes, but through self-supervised observation and interaction. Building machines that can infer similarly rich neural scene representations is critical if they are to one day parallel people’s ability to understand, navigate, and interact with their surroundings. In my talk, I will discuss how this motivates a 3D approach to self-supervised learning for vision. I will then present recent advances of my research group towards enabling us to train self-supervised scene representation learning methods at scale, on uncurated video without pre-computed camera poses. I will further present recent advances towards modeling of uncertainty in 3D scenes, as well as progress on endowing neural scene representations with more semantic, high-level information.
Bio: Vincent is an Assistant Professor at MIT EECS, where he is leading the Scene Representation Group. Previously, he finished his Ph.D. at Stanford University. He is interested in the self-supervised training of 3D-aware vision models: His goal is to train models that, given a single image or short video, can reconstruct a representation of the underlying scene that incodes information about materials, affordance, geometry, lighting, etc, a task that is simple for humans, but currently impossible for AI.
To request accommodations for a disability, please contact Emily Lawrence at emilyl@cs.princeton.edu at least one week prior to the event.
This talk will be recorded and live streamed via Zoom. Register for webinar here.