WCSE 2023
ISBN: 978-981-18-7950-0 DOI: 10.18178/wcse.2023.06.013

Lip Shape Classification of Sounds for Speech Therapy using SlowFast Networks

Penpicha Boonsri, Salita Eiamboonsert, Punnarai Siricharoen

Abstract—It is important for patients with Aphasia and Dysarthria to have speech and language therapy to practice breathing exercises, tongue strengthening exercises, and especially speech sounds such as short vowel sounds. To ensure the clarity of the pronunciation sound and the correct position of mouth shape, it is required to be monitored by a therapist. We proposed an automated method using convolutional networks to identify the motion of pronunciation of 9 short vowel sounds which is required for speech exercises in the Thai language. Firstly, videos of vowel sound pronunciation are captured, then preprocessed to crop only the mouth area using Dlib library. The cropped image sequence is then fed into audiovisual SlowFast Networks based on convolutional networks which have Slow and Fast visual pathways to capture spatial and temporal information of a video. We compared our selected model with the transformer-based state-of-the-art model, such as TimeSformer. Our proposed framework using SlowFast networks achieved average accuracy at 97.3% for 9-class video classification of Thai vowel sounds. It shows our proposed framework has a potential for use as a tool for speech sound self-exercises and therapy.

Index Terms—Deep Learning, Speech Therapy, Stroke, SlowFast, Timesformer, Video Classification

Penpicha Boonsri
Chulalongkorn University, THAILAND
Salita Eiamboonsert
King Mongkut's University of Technology Thonburi, THAILAND
Punnarai Siricharoen
Chulalongkorn University, THAILAND


Cite: Penpicha Boonsri, Salita Eiamboonsert, Punnarai Siricharoen, "Lip Shape Classification of Sounds for Speech Therapy using SlowFast Networks" Proceedings of 2023 the 13th International Workshop on Computer Science and Engineering (WCSE 2023), pp. 83-87, June 16-18, 2023.