Kinematic Measurement and Feature Sets for Automatic Speech Recognition

Citation

Fain, Daniel Cark (2001) Kinematic Measurement and Feature Sets for Automatic Speech Recognition. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/9vse-8c78. https://resolver.caltech.edu/CaltechTHESIS:11192010-074900476

Abstract

This thesis examines the use of measured and inferred kinematic information in automatic speech recognition and lipreading, and investigates the relative information content and recognition performance of vowels and consonants. The kinematic information describes the motions of the organs of speech-the articulators. The contributions of this thesis include a new device and set of algorithms for lipreading (their design, construction, implementation, and testing); incorporation of direct articulator-position measurements into a speech recognizer; and reevaluation of some assumptions regarding vowels and consonants. The motivation for including articulatory information is to improve modeling of coarticulation and reconcile multiple input modalities for lipreading. Coarticulation, a ubiquitous phenomenon, is the process by which speech sounds are modified by preceding and following sounds. To be useful in practice, a recognizer will have to infer articulatory information from sound, video, or both. Previous work made progress towards recovery of articulation from sound. The present project assumes that such recovery is possible; it examines the advantage of joint acousticarticulatory representations over acoustic-only. Also reported is an approach to recovery from video in which camera placement (side view, head-mounted) and lighting are chosen to robustly obtain lip-motion information. Joint acoustic-articulatory recognition experiments were performed using the University of Wisconsin X-ray Microbeam Speech Production Database. Speaker-dependent monophone recognizers, based on hidden Markov models, were tested on paragraphs each lasting about 20 seconds. Results were evaluated at the phone level and tabulated by several classes (vowel, stop, and fricative). Measured articulator coordinates were transformed by principal components analysis, and velocity and acceleration were appended. Concatenating the transformed articulatory information to a standard acoustic (cepstral) representation reduced the error rate by 7.4 %), demonstrating across-speaker statistical significance (p = 0.018). Articulation improved recognition of male speakers more than female, and recognition of vowels more than fricatives or stops. The analysis of vowels, stops, and fricatives included both the articulatory recognizer of chapter 3 and other recognizers for comparison. The information content of the different classes was also estimated. Previous assumptions about recognition performance are false, and findings of information content require consonants to be defined to include vowel-like sounds.

Item Type:	Thesis (Dissertation (Ph.D.))
Subject Keywords:	Computation and Neural Systems
Degree Grantor:	California Institute of Technology
Division:	Engineering and Applied Science
Major Option:	Computation and Neural Systems
Thesis Availability:	Public (worldwide access)
Research Advisor(s):	Barr, Alan H.
Thesis Committee:	Abu-Mostafa, Yaser S. Arvo, James R. Perona, Pietro
Defense Date:	14 May 2001
Record Number:	CaltechTHESIS:11192010-074900476
Persistent URL:	https://resolver.caltech.edu/CaltechTHESIS:11192010-074900476
DOI:	10.7907/9vse-8c78
Default Usage Policy:	No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:	6182
Collection:	CaltechTHESIS
Deposited By:	Rita Suarez
Deposited On:	22 Nov 2010 18:13
Last Modified:	19 Apr 2021 22:40

Thesis Files

Preview

PDF - Final Version
See Usage Policy.
4MB

Repository Staff Only: item control page