An important, open vision problem is to automatically describe what people are doing in a sequence of video. This problem is difficult for several reasons. First, one needs to determine how many people (if any) are in each frame and estimate their configurations (where they are and what their arms and legs are doing). But finding people and localizing their limbs is hard because people (a) wear a variety of different clothes, (b) appear in a variety of poses and (c) tend to partially occlude themselves and each other. Secondly, one must sew together estimated configuration reports from across frames into a motion path; this is tricky because people can move fast and unpredictably. Finally, one must describe what each person is doing; this problem is poorly understood, not least because there is no natural or canonical set of categories into which to classify activities. In this talk I will discuss our progress on this problem. We develop a tracker that works in two stages; i
t first (a) builds a model of appearance of each person in a video and then (b) tracks by detecting those models in each frame ("tracking by model-building and detection"). We then marry our tracker with a motion synthesis engine that works by re-assembling pre-recorded motion clips. The synthesis engine generates new motions that are human-like and close to the image measurements reported by the tracker. By using labeled motion clips, our synthesizer also generates activity labels for each image frame ("analysis by synthesis"). We have extensively tested our system, running it on hundreds of thousands of frames of unscripted indoor and outdoor activity, a feature-length film, and legacy sports footage.
Date and Time
Wednesday April 13, 2005 4:00pm -
5:30pm
Location
Computer Science Small Auditorium (Room 105)
Event Type
Speaker
Deva Ramanan, from UC Berkeley
Host
Adam Finkelstein