A Caltech Library Service

Advancing a Machine's Visual Awareness of People


Hall, David Christopher (2017) Advancing a Machine's Visual Awareness of People. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/Z9ST7MWC.


Methods to advance a machine's visual awareness of people with a focus on understanding 'who is where' in video are presented. 'Who' is used in a broad sense that includes not only the identity of a person but attributes of that person as well. Efforts are focused on improving algorithms in four areas of visual recognition: detection, tracking, fine-grained classification and person reidentification.

Each of these problems appear to be quite different on the surface; however, there are two broader questions that are answered across each of the works. The first, the machine is able to make better predictions when it has access to the extra information that is available in video. The second, that it is possible to learn on-the-fly from single examples. How each work contributes to answering these over-arching questions as well as its specific contributions to the relevant problem domain are as follows:

The first problem studied is one-shot, real-time, instance detection. Given a single image of a person, the task for the machine is to learn a detector that is specific to that individual rather than to an entire category such as faces or pedestrians. In subsequent images, the individual detector indicates the size and location of that particular person in the image. The learning must be done in real-time. To solve this problem, the proposed method starts with a pre-trained boosted category detector from which an individual-object detector is trained, with near-zero computational cost, through elementary manipulations of the thresholds of the category detector. Experiments on two challenging pedestrian and face datasets indicate that it is indeed possible to learn identity classifiers in real-time; besides being faster-trained, the proposed classifier has better detection rates than previous methods.

The second problem studied is real-time tracking. Given the initial location of a target person, the task for the machine is to determine the size and location of the target person in subsequent video frames, in real-time. The method proposed for solving this problem treats tracking as a repeated detection problem where potential targets are identified with a pre-trained boosted person detector and identity across frames is established by individual-specific detectors. The individual-specific detectors are learnt using the method proposed to solve the first problem. The proposed algorithm runs in real-time and is robust to drift. The tracking algorithm is benchmarked against nine state-of-the-art trackers on two benchmark datasets. Results show that the proposed method is 10% more accurate and nearly as fast as the fastest of the competing algorithms, and it is as accurate but 20 times faster than the most accurate of the competing algorithms.

The third problem studied is the fine-grained classification of people. Given an image of a person, the task for the machine is to estimate characteristics of that person such as age, clothing style, sex, occupation, social status, ethnicity, emotional state and/or body type. Since fine-grained classification using the entire human body is a relatively unexplored area, a large video dataset was collected. To solve this problem, a method that uses deep neural networks and video of a person is proposed. Results show that the class average accuracy when combining information from a sequence of images of an individual and then predicting the label is 3.5-7.1% better than independently predicting the label of each image, when severely under-represented classes are ignored.

The final problem studied is person reidentification. Given an image of a person, the task for the machine is to find images that match the identity of that person from a large set of candidate images. This is a challenging task since images of the same individual can vary significantly due to changes in clothing, viewpoint, pose, lighting and background. The method proposed for solving this problem is a two-stage deep neural network architecture that uses body part patches as inputs rather than an entire image of a person. Experiments show that rank-1 matching rates increase by 22-25.6% on benchmark datasets when compared to state-of-the-art methods.

Item Type:Thesis (Dissertation (Ph.D.))
Subject Keywords:computer vision, machine learning, tracking, instance detection, fine-grained classification, person re-identification
Degree Grantor:California Institute of Technology
Division:Engineering and Applied Science
Major Option:Electrical Engineering
Thesis Availability:Public (worldwide access)
Research Advisor(s):
  • Perona, Pietro
Thesis Committee:
  • Perona, Pietro (chair)
  • Abu-Mostafa, Yaser S.
  • Yue, Yisong
  • Kostina, Victoria
  • Munich, Mario E.
Defense Date:25 May 2017
Record Number:CaltechTHESIS:05292017-224523022
Persistent URL:
Related URLs:
URLURL TypeDescription adapted for Ch. 2. adapted for Ch. 3. adapted for Ch. 4.
Hall, David Christopher0000-0003-3244-5744
Default Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:10218
Deposited By: David Hall
Deposited On:07 Jun 2017 00:12
Last Modified:16 Jun 2017 22:22

Thesis Files

PDF - Final Version
See Usage Policy.


Repository Staff Only: item control page