Abusive Language and Hate Speech Detection for Javanese and Sundanese Languages in Tweets: Dataset and Preliminary Study
Abstract— Indonesia’s demography as an archipelago with lots of tribes and local languages added variances in their communication style. Every region in Indonesia has its own distinct culture, accents, and languages. The demographical condition can influence the characteristic of the language used in social media, such as Twitter. It can be found that Indonesian uses their own local language for communicating and expressing their mind in tweets. Nowadays, research about identifying hate speech and abusive language has become an attractive and developing topic. Moreover, the research related to Indonesian local languages still rarely encountered. This paper analyzes the use of machine learning approaches such as Naïve Bayes (NB), Support Vector Machine (SVM), and Random Forest Decision Tree (RFDT) in detecting hate speech and abusive language in Sundanese and Javanese as Indonesian local languages. The classifiers were used with the several term weightings features, such as word n-grams and char n-grams. The experiments are evaluated using the F-measure. It achieves over 60 % for both local languages.
Index Terms— abusive, hate speech, twitter, Indonesian Local language, Javanese, Sundanese
Shofianina Dwi Ananda Putri, Muhammad Okky Ibrohim, Indra Budi
Universitas Indonesia, INDONESIAL
Cite: Shofianina Dwi Ananda Putri, Muhammad Okky Ibrohim, Indra Budi, "Abusive Language and Hate Speech Detection for Javanese and Sundanese Languages in Tweets: Dataset and Preliminary Study, " Proceedings of 2021 the 11th International Workshop on Computer Science and Engineering (WCSE 2021), pp. 65-69, February 25-27, 2021.