Computer Science and Engineering, Department of


First Advisor

Stephen Scott

Date of this Version

Spring 3-16-2018


Shahsavarani, B. S. (2018). Speech Emotion Recognition using Convolutional Neural Networks (Master thesis, The University of Nebraska-Lincoln).


A THESIS Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfillment of Requirements For the Degree of Master of Science, Major: Computer Science, Under the Supervision of Professor Stephen D. Scott. Lincoln, Nebraska: March, 2018

Copyright (c) 2018 Somayeh Shahsavarani


Automatic speech recognition is an active field of study in artificial intelligence and machine learning whose aim is to generate machines that communicate with people via speech. Speech is an information-rich signal that contains paralinguistic information as well as linguistic information. Emotion is one key instance of paralinguistic information that is, in part, conveyed by speech. Developing machines that understand paralinguistic information, such as emotion, facilitates the human-machine communication as it makes the communication more clear and natural. In the current study, the efficacy of convolutional neural networks in recognition of speech emotions has been investigated. Wide-band spectrograms of the speech signals were used as the input features of the networks. The networks were trained on speech signals that were generated by the actors while acting a specific emotion. The speech databases with different languages were used to train and evaluate our models. The training data on each database were augmented with two levels of augmentations. The dropout technique was implemented to regularize the networks. Our results showed that the gender-independent, language-independent CNN models achieved the state-of-the-art accuracy, outperformed previously reported results in the literature, and emulated or even outperformed human performance over the benchmark databases. Future work is warranted to examine the capability of the deep learning models in speech emotion recognition using daily-life speech signals.

Advisor: Stephen D. Scott