Research on the Effect of Different Speech Segment Lengths on Speech Emotion Recognition Based on LSTM

WCSE 2019 SUMMER ISBN: 978-981-14-1684-2
DOI: 10.18178/wcse.2019.06.073

Zheng Liu, Fuji Ren, and Xin Kang

Abstract— The emergence and development of deep learning makes speech emotion recognition more crucial. For the neural network sequence model, the amount of information contained in different lengths of speech segments has different effects on the sequence model. There is no reasonable explanation for how to separate the speech as input. In this work, we used the CASIA Chinese Emotional Corpus and divided it into 5 groups that every group has different lengths between 100-500ms. Using the features extracted by OpenSmile toolkit to calculate the standard deviation of each group, we found that the features of the same dimension have a very even distribution in the 200ms segments. We used the LSTM model with different features as input, and statistically analyzed the results, the results verified that 200ms is the most suitable input for the sequence model.

Index Terms— speech emotion recognition, LSTM, CASIA, feature analysis

Zheng Liu, Fuji Ren, and Xin Kangbr /> School of Information Faculty of Engineering, Tokushima University, JAPAN

[Download]

Cite: Zheng Liu, Fuji Ren, and Xin Kang, "Research on the Effect of Different Speech Segment Lengths on Speech Emotion Recognition Based on LSTM," Proceedings of 2019 the 9th International Workshop on Computer Science and Engineering, pp. 491-499, Hong Kong, 15-17 June, 2019.

PREVIOUS PAPER
Heterogeneous Ontology Merging Using Formal Concept Analysis

NEXT PAPER
Location Context Ontology Model based on Ubiquitous Computing Environment