Citation
Fain, Daniel Cark (2001) Kinematic Measurement and Feature Sets for Automatic Speech Recognition. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/9vse-8c78. https://resolver.caltech.edu/CaltechTHESIS:11192010-074900476
Abstract
This thesis examines the use of measured and inferred kinematic information in automatic speech recognition and lipreading, and investigates the relative information content and recognition performance of vowels and consonants. The kinematic information describes the motions of the organs of speech-the articulators. The contributions of this thesis include a new device and set of algorithms for lipreading (their design, construction, implementation, and testing); incorporation of direct articulator-position measurements into a speech recognizer; and reevaluation of some assumptions regarding vowels and consonants. The motivation for including articulatory information is to improve modeling of coarticulation and reconcile multiple input modalities for lipreading. Coarticulation, a ubiquitous phenomenon, is the process by which speech sounds are modified by preceding and following sounds. To be useful in practice, a recognizer will have to infer articulatory information from sound, video, or both. Previous work made progress towards recovery of articulation from sound. The present project assumes that such recovery is possible; it examines the advantage of joint acousticarticulatory representations over acoustic-only. Also reported is an approach to recovery from video in which camera placement (side view, head-mounted) and lighting are chosen to robustly obtain lip-motion information. Joint acoustic-articulatory recognition experiments were performed using the University of Wisconsin X-ray Microbeam Speech Production Database. Speaker-dependent monophone recognizers, based on hidden Markov models, were tested on paragraphs each lasting about 20 seconds. Results were evaluated at the phone level and tabulated by several classes (vowel, stop, and fricative). Measured articulator coordinates were transformed by principal components analysis, and velocity and acceleration were appended. Concatenating the transformed articulatory information to a standard acoustic (cepstral) representation reduced the error rate by 7.4 %), demonstrating across-speaker statistical significance (p = 0.018). Articulation improved recognition of male speakers more than female, and recognition of vowels more than fricatives or stops. The analysis of vowels, stops, and fricatives included both the articulatory recognizer of chapter 3 and other recognizers for comparison. The information content of the different classes was also estimated. Previous assumptions about recognition performance are false, and findings of information content require consonants to be defined to include vowel-like sounds.
Item Type: | Thesis (Dissertation (Ph.D.)) |
---|---|
Subject Keywords: | Computation and Neural Systems |
Degree Grantor: | California Institute of Technology |
Division: | Engineering and Applied Science |
Major Option: | Computation and Neural Systems |
Thesis Availability: | Public (worldwide access) |
Research Advisor(s): |
|
Thesis Committee: |
|
Defense Date: | 14 May 2001 |
Record Number: | CaltechTHESIS:11192010-074900476 |
Persistent URL: | https://resolver.caltech.edu/CaltechTHESIS:11192010-074900476 |
DOI: | 10.7907/9vse-8c78 |
Default Usage Policy: | No commercial reproduction, distribution, display or performance rights in this work are provided. |
ID Code: | 6182 |
Collection: | CaltechTHESIS |
Deposited By: | Rita Suarez |
Deposited On: | 22 Nov 2010 18:13 |
Last Modified: | 19 Apr 2021 22:40 |
Thesis Files
|
PDF
- Final Version
See Usage Policy. 4MB |
Repository Staff Only: item control page