A Caltech Library Service

Visual Recognition: Computational Models and Human Psychophysics


Li, Fei-Fei (2005) Visual Recognition: Computational Models and Human Psychophysics. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/G5NP-KH67.


Object and scene recognition is one of the most essential functionalities of human vision. It is also of fundamental importance for machines to be able to learn and recognize meaningful objects and scenes. In this thesis, we explore the following four aspects of object and scene recognition.

It is well known that humans can be "blind" even to major aspects of natural scenes when we attend elsewhere. The only tasks that do not need attention appear to be carried out in the early stages of the visual system. Contrary to this common belief, we show that subjects can rapidly detect animals or vehicles in briefly presented novel natural scenes while simultaneously performing another attentionally demanding task. By comparison, they are unable to discriminate large T’s from L’s, or bisected two-color disks from their mirror images under the same conditions. We explore this phenonmenon further by removing color from the natural scenes, or increasing the number of images peripherally. We find evidence that suggests that familiarity and meaningfulness might be among the factors that determine attentional requirements for both natural and synthetic stimuli.

So what exactly do we see when we glance at a natural scene? And does what we see change as the glance becomes longer? We asked naive subjects to report what they saw in nearly a hundred briefly presented photographs. After each presentation subjects reported what they had just seen as completely as possible. Afterward, another group of sophisticated individuals who were not aware of the goals of the experiment were instructed to score each of the descriptions produced by the subjects in the first stage. Individual scores were assigned to more than a hundred different attributes. Given the evaluation of the responses, we show that within a single glance, much object and scene level information is perceived by human subjects. But the richness of our perception seems asymmetrical. Subjects tend to have a bias to natural scenes being perceived as outdoor rather than indoor.

In computer vision, it is commonly known that learning visual models of object categories notoriously requires thousands of training examples. We show that it is possible to learn much information about a category from just one image, or a handful of images. The key insight is that, rather than learning from scratch, one can take advantage of knowledge coming from previously learnt categories, no matter how different these categories might be. We explore a Bayesian implementation of this idea. Object categories are represented by probabilistic models. Prior knowledge is represented as a probability density function on the parameters of these models. The posterior model for an object category is obtained by updating the prior in the light of one or more observations. We test a simple implementation of our algorithm on a database of 101 diverse object categories. We compare category models learnt by a simple implementation of our Bayesian approach to models learnt from maximum likelihood (ML) and maximum a posteriori (MAP) methods. We find that in a database of more than 100 categories the Bayesian approach produces informative models when the number of training examples is too small for other methods to operate successfully.

We also propose a novel approach to learn and recognize natural scene categories. Unlike previous work, it does not require experts to annotate the training set. We represent the image of a scene by a collection of local regions, denoted as codewords obtained by unsupervised learning. Each region is represented as part of a "theme." In previous work, such themes were learnt from hand-annotations of experts, while our method learns the theme distributions as well as the codewords distribution over the themes without supervision. We report satisfactory categorization performances on a large set of 13 categories of complex scenes.

Item Type:Thesis (Dissertation (Ph.D.))
Subject Keywords:categorization; computer vision; human vision; machine learning; natural scene; object recognition; psychophysics
Degree Grantor:California Institute of Technology
Division:Engineering and Applied Science
Major Option:Electrical Engineering
Awards:Caltech Distinguished Alumni Award, 2020.
Thesis Availability:Public (worldwide access)
Research Advisor(s):
  • Perona, Pietro
Group:Caltech Distinguished Alumni Award
Thesis Committee:
  • Perona, Pietro (chair)
  • Koch, Christof
  • Biederman, Irving
  • Abu-Mostafa, Yaser S.
Defense Date:27 February 2005
Record Number:CaltechETD:etd-06022005-150332
Persistent URL:
Li, Fei-Fei0000-0002-7481-0810
Default Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:2390
Deposited By: Imported from ETD-db
Deposited On:03 Jun 2005
Last Modified:08 Nov 2023 00:44

Thesis Files

PDF (FeiFeiLi_phD_thesis_2005.pdf) - Final Version
See Usage Policy.


Repository Staff Only: item control page