Speaker: Jitendra Malik
UC Berkeley
Monday, February 10, 2003
11:00 pm to 12:00 pm
AP&M Room 4301
ABSTRACT
The central problem in computer vision is that of "understanding" images and video. I will talk about recent
progress at UC Berkeley on two principal components: recognizing objects and recognizing actions.
The object recognition problem is that of finding instances of object classes in an image or video sequences:
faces, giraffes, the digit 5, chairs etc. This has to be accomplished while allowing for intra-class variation,
as well as changes in illumination and viewpoint. Belongie, Malik and Puzicha (2001) introduced a relational
descriptor for shapes represented as point sets, the "shape context". This enables one to compute similarity
measures between shapes which, together with similarity measures for texture and color, can be used to drive
object recognition. I will show further steps to a complete theory of object recognition based on shape contexts.
These include (1) algorithmic speedups for finding likely matches at a computational complexity sublinear in the
number of models (2) dealing with scene clutter (3) adaptive measures of shape distance for discriminative
categorization. I will show results on a variety of 2D and 3D recognition problems.
The action recognition problem is that of finding instances of actions in video sequences: run, jump, kick etc.
This has to be accomplished while allowing for variation in the person performing the action, clothing,
illumination and viewpoint. We have developed two approaches to recognition of actions. In low resolution data,
("far field") the approach is based on collecting low resolution optical flow measurements over a spatiotemporal
volume for each moving figure, constructing a robust descriptor from this volume, and then matching these to
stored sequences. We show generalization over person, clothing and illumination while pose variations are dealt in
a multiple-view framework. In high resolution data ("near field") the approach is based on extracting stick
figures in each frame, and relying on joint level human body tracking to provide a complete intermediate
representation which is robust to lighting, clothing as well as pose.
This talk is based on joint work; please visit http://http.cs.berkeley.edu/projects/vision/vision_group.html for
pointers to publications.