A Caltech Library Service

Vision for Social Robots: Human Perception and Pose Estimation


Ronchi, Matteo Ruggero (2020) Vision for Social Robots: Human Perception and Pose Estimation. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/n2v1-1g79.


In order to extract the underlying meaning from a scene captured from the surrounding world in a single still image, social robots will need to learn the human ability to detect different objects, understand their arrangement and relationships relative both to their own parts and to each other, and infer the dynamics under which they are evolving. Furthermore, they will need to develop and hold a notion of context to allow assigning different meanings (semantics) to the same visual configuration (syntax) of a scene.

The underlying thread of this Thesis is the investigation of new ways for enabling interactions between social robots and humans, by advancing the visual perception capabilities of robots when they process images and videos in which humans are the main focus of attention.

First, we analyze the general problem of scene understanding, as social robots moving through the world need to be able to interpret scenes without having been assigned a specific preset goal. Throughout this line of research, i) we observe that human actions and interactions which can be visually discriminated from an image follow a very heavy-tailed distribution; ii) we develop an algorithm that can obtain a spatial understanding of a scene by only using cues arising from the effect of perspective on a picture of a person’s face; and iii) we define a novel taxonomy of errors for the task of estimating the 2D body pose of people in images to better explain the behavior of algorithms and highlight their underlying causes of error.

Second, we focus on the specific task of 3D human pose and motion estimation from monocular 2D images using weakly supervised training data, as accurately predicting human pose will open up the possibility of richer interactions between humans and social robots. We show that when 3D ground-truth data is only available in small quantities, or not at all, it is possible to leverage knowledge about the physical properties of the human body, along with additional constraints related to alternative types of supervisory signals, to learn models that can regress the full 3D pose of the human body and predict its motions from monocular 2D images.

Taken in its entirety, the intent of this Thesis is to highlight the importance of, and provide novel methodologies for, social robots' ability to interpret their surrounding environment, learn in a way that is robust to low data availability, and generalize previously observed behaviors to unknown situations in a similar way to humans.

Item Type:Thesis (Dissertation (Ph.D.))
Subject Keywords:Computer vision, social robots, machine learning, weakly-supervised learning, self-supervised learning, scene understanding, 3d pose estimation.
Degree Grantor:California Institute of Technology
Division:Engineering and Applied Science
Major Option:Computer Science
Thesis Availability:Public (worldwide access)
Research Advisor(s):
  • Perona, Pietro
Thesis Committee:
  • Yue, Yisong (chair)
  • Perona, Pietro
  • Ames, Aaron D.
  • Bouman, Katherine L.
  • Papon, Jeremie
Defense Date:20 December 2019
Record Number:CaltechTHESIS:05212020-155425112
Persistent URL:
Related URLs:
URLURL TypeDescription web page containing all the materials included in this thesis. adapted for Ch. 4. adapted for Ch. 5. adapted for Ch. 6. adapted for Ch. 7. adapted for Ch. 8. adapted for Ch. 9.
Ronchi, Matteo Ruggero0000-0002-4277-3314
Default Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:13713
Deposited By: Matteo Ruggero Ronchi
Deposited On:08 Jun 2020 16:10
Last Modified:08 Nov 2023 00:44

Thesis Files

PDF - Final Version
See Usage Policy.


Repository Staff Only: item control page