Date of this Version
Shahsavarani, B. S. (2018). Speech Emotion Recognition using Convolutional Neural Networks (Master thesis, The University of Nebraska-Lincoln).
Automatic speech recognition is an active field of study in artificial intelligence and machine learning whose aim is to generate machines that communicate with people via speech. Speech is an information-rich signal that contains paralinguistic information as well as linguistic information. Emotion is one key instance of paralinguistic information that is, in part, conveyed by speech. Developing machines that understand paralinguistic information, such as emotion, facilitates the human-machine communication as it makes the communication more clear and natural. In the current study, the efficacy of convolutional neural networks in recognition of speech emotions has been investigated. Wide-band spectrograms of the speech signals were used as the input features of the networks. The networks were trained on speech signals that were generated by the actors while acting a specific emotion. The speech databases with different languages were used to train and evaluate our models. The training data on each database were augmented with two levels of augmentations. The dropout technique was implemented to regularize the networks. Our results showed that the gender-independent, language-independent CNN models achieved the state-of-the-art accuracy, outperformed previously reported results in the literature, and emulated or even outperformed human performance over the benchmark databases. Future work is warranted to examine the capability of the deep learning models in speech emotion recognition using daily-life speech signals.
Advisor: Stephen D. Scott