Computer Science and Engineering, Department of


Silent Speech Recognition from Articulatory Motion

Jun Wang, University of Nebraska-Lincoln

A DISSERTATION Presented to the Faculty of The Graduate School at the University of Nebraska In Partial Fulfillment of Requirements For the Degree of Doctor of Philosophy, Major: Computer Science, Under the Supervision of Professors Ashok Samal and Jordan R. Green. Lincoln, Nebraska: November, 2011

Copyright (c) 2011 Jun Wang


Silent speech recognition is the process of converting motion data of articulators (e.g., tongue, lips, and jaw) into speech in the form of text. The primary objective of this dissertation was to develop new approaches for silent speech recognition from segmented and continuous input of tongue and lip movement data at three levels of speech units with increasing conceptual complexity - phonemes, words, and sentences. At each level, unique theoretical issues were addressed and plans for use in specific applications were described. This dissertation is motivated by the need for (1) speech movement-based treatment options for people with speech and voice impairments and (2) computational approaches for recognizing speech when acoustic data are not available or extremely noisy.

Machine learning and statistical shape analysis were used to classify and quantify the articulatory distinctiveness of phonemes, words, and sentences. The approach is unique in that it maps the motion data directly (instead of articulatory features) to speech units. Procrustes analysis, a statistical shape matching approach, provided an index of articulatory distinctiveness of vowels and consonants, which was used to derive quantitative articulatory vowel and consonant spaces. The derived vowel space resembles long-standing descriptions of articulatory vowel space. The theoretical properties and practical applications in speech pathology (e.g., motor speech decline in amyotrophic lateral sclerosis) of these spaces were also discussed. In addition, support vector machine, Procrustes analysis, and Eigenspace approaches were used to classify a set of phonetically balanced words and functional sentences from articulatory motion. The direct mapping approaches resulted high classification accuracy levels, which were adequate for practical applications.

A near-time algorithm (Holistic Articulatory Recognition, HAR) to recognize the whole words and sentences from continuous (unsegmented) articulatory motion was proposed and evaluated. The accuracy and speed of HAR demonstrated its potential for practical applications. HAR is based on classification probabilities and hence any classifier that could estimate them can be incorporated seamlessly. HAR can serve as the recognition component of an articulation-based silent speech interface that may provide an alternate oral communication modality for persons with speech impairments.

Advisors: Ashok Samal and Jordan R. Green