Data driven production models for speech processing

Citation

Roweis, Sam T. (1999) Data driven production models for speech processing. Dissertation (Ph.D.), California Institute of Technology. doi:10.7907/DP55-8897. https://resolver.caltech.edu/CaltechETD:etd-02272008-093303

Abstract

When difficult computations are to be performed on sensory data it is often advantageous to employ a model of the underlying process which produced the observations. Because such generative models capture information about the set of possible observations, they can help to explain complex variability naturally present in the data and are useful in separating signal from noise. In the case of neural and artificial sensory processing systems generative models are learned directly from environmental input although they are often rooted in the underlying physics of the modality involved. One effective use of learned models is made by performing model inversion or state inference on incoming observation sequences to discover the underlying state or control parameter trajectories which could have produced them. These inferred states can then be used as inputs to a pattern recognition or pattern completion module. In the case of human speech perception and production, the models in question are called articulatory models and relate the movements of a talker's mouth to the sequence of sounds produced. Linguistic theories and substantial psychophysical evidence argue strongly that articulatory model inversion plays an important role in speech perception and recognition in the brain. Unfortunately, despite potential engineering advantages and evidence for being part of the human strategy, such inversion of speech production models is absent in almost all artificial speech processing systems. This dissertation presents a series of experiments which investigate articulatory speech processing using real speech production data from a database containing simultaneous audio and mouth movement recordings. I show that it is possible to learn simple low dimensionality models which accurately capture the structure observed in such real production data. I discuss how these models can be used to learn a forward synthesis system which generates spectral sequences from articulatory movements. I also describe an inversion algorithm which estimates movements from an acoustic signal Finally, I demonstrate the use of articulatory movements, both true and recovered, in a simple speech recognition task, showing the possibility of doing true articulatory speech recognition in artificial systems.

Item Type:	Thesis (Dissertation (Ph.D.))
Subject Keywords:	Computation and Neural Systems
Degree Grantor:	California Institute of Technology
Division:	Engineering and Applied Science
Major Option:	Computation and Neural Systems
Thesis Availability:	Public (worldwide access)
Research Advisor(s):	Abu-Mostafa, Yaser S. (advisor) Hopfield, John J. (advisor) Perona, Pietro (advisor)
Thesis Committee:	Unknown, Unknown
Defense Date:	13 January 1999
Record Number:	CaltechETD:etd-02272008-093303
Persistent URL:	https://resolver.caltech.edu/CaltechETD:etd-02272008-093303
DOI:	10.7907/DP55-8897
Default Usage Policy:	No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:	786
Collection:	CaltechTHESIS
Deposited By:	Imported from ETD-db
Deposited On:	12 Mar 2008
Last Modified:	08 Nov 2023 00:44

Thesis Files

Preview

PDF (Roweis_st_1999.pdf) - Final Version
See Usage Policy.
13MB

Repository Staff Only: item control page