Speech Emotion Recognition using Convolutional Neural Networks and Recurrent Neural Networks with Attention Model
Abstract— Speech emotion recognition is an essential step in advanced human-computer speech interaction. Most of researchers focus on the entire speech sequence without handling emotionally-irrelevant speech frames specifically. In this study, a novel deep recognition framework is proposed, which using attention mechanism to focus on speech segments with salient emotion. The framework involves two stages. In the first stage, the unconstrained sparse auto-encoder is used to learn the convolution kernel, and the local salient features are extracted using convolutional neural networks(CNNs). In the second stage, the local salient features are aggregated into a high-level representation using bidirectional long short-term memory(BLSTM) with attention model. The experimental results on different language data sets show that the framework leads to higher accuracy and outperformed the conventional methods by about 12.32%.
Index Terms— convolutional neural networks, bidirectional long short-term memory, auto-encoder, attention model, speech emotion recognition
Xi He, Liyong Ren, Yongbin He
University of Electronic Science and Technology of China, CHINA
Cite: Xi He, Liyong Ren, Yongbin He, "Speech Emotion Recognition using Convolutional Neural Networks and Recurrent Neural Networks with Attention Model," Proceedings of 2019 the 9th International Workshop on Computer Science and Engineering, pp. 295-301, Hong Kong, 15-17 June, 2019.