Research on the Effect of Different Speech Segment Lengths on Speech Emotion Recognition Based on LSTM
Abstract— The emergence and development of deep learning makes speech emotion recognition more
crucial. For the neural network sequence model, the amount of information contained in different lengths of
speech segments has different effects on the sequence model. There is no reasonable explanation for how to
separate the speech as input. In this work, we used the CASIA Chinese Emotional Corpus and divided it into
5 groups that every group has different lengths between 100-500ms. Using the features extracted by
OpenSmile toolkit to calculate the standard deviation of each group, we found that the features of the same
dimension have a very even distribution in the 200ms segments. We used the LSTM model with different
features as input, and statistically analyzed the results, the results verified that 200ms is the most suitable
input for the sequence model.
Index Terms— speech emotion recognition, LSTM, CASIA, feature analysis
Zheng Liu, Fuji Ren, and Xin Kangbr /> School of Information Faculty of Engineering, Tokushima University, JAPAN
Cite: Zheng Liu, Fuji Ren, and Xin Kang, "Research on the Effect of Different Speech Segment Lengths on Speech Emotion Recognition Based on LSTM," Proceedings of 2019 the 9th International Workshop on Computer Science and Engineering, pp. 491-499, Hong Kong, 15-17 June, 2019.