Roweis, Sam T. (1999) Data driven production models for speech processing. Dissertation (Ph.D.), California Institute of Technology. http://resolver.caltech.edu/CaltechETD:etd-02272008-093303
When difficult computations are to be performed on sensory data it is often advantageous to employ a model of the underlying process which produced the observations. Because such generative models capture information about the set of possible observations, they can help to explain complex variability naturally present in the data and are useful in separating signal from noise. In the case of neural and artificial sensory processing systems generative models are learned directly from environmental input although they are often rooted in the underlying physics of the modality involved. One effective use of learned models is made by performing model inversion or state inference on incoming observation sequences to discover the underlying state or control parameter trajectories which could have produced them. These inferred states can then be used as inputs to a pattern recognition or pattern completion module. In the case of human speech perception and production, the models in question are called articulatory models and relate the movements of a talker's mouth to the sequence of sounds produced. Linguistic theories and substantial psychophysical evidence argue strongly that articulatory model inversion plays an important role in speech perception and recognition in the brain. Unfortunately, despite potential engineering advantages and evidence for being part of the human strategy, such inversion of speech production models is absent in almost all artificial speech processing systems. This dissertation presents a series of experiments which investigate articulatory speech processing using real speech production data from a database containing simultaneous audio and mouth movement recordings. I show that it is possible to learn simple low dimensionality models which accurately capture the structure observed in such real production data. I discuss how these models can be used to learn a forward synthesis system which generates spectral sequences from articulatory movements. I also describe an inversion algorithm which estimates movements from an acoustic signal Finally, I demonstrate the use of articulatory movements, both true and recovered, in a simple speech recognition task, showing the possibility of doing true articulatory speech recognition in artificial systems.
|Item Type:||Thesis (Dissertation (Ph.D.))|
|Subject Keywords:||Computation and Neural Systems|
|Degree Grantor:||California Institute of Technology|
|Division:||Engineering and Applied Science|
|Major Option:||Computation and Neural Systems|
|Thesis Availability:||Public (worldwide access)|
|Defense Date:||13 January 1999|
|Default Usage Policy:||No commercial reproduction, distribution, display or performance rights in this work are provided.|
|Deposited By:||Imported from ETD-db|
|Deposited On:||12 Mar 2008|
|Last Modified:||26 Dec 2012 02:32|
- Final Version
See Usage Policy.
Repository Staff Only: item control page