Microblog Text Clustering Based on BK-Means Algorithm
Abstract— In recent years, the increasing popularity of social media such as WeChat and Weibo has facilitated the communication among people. However, due to the characteristics like large scale, fast propagation, low quality and diverse modalities of social short texts, the short text clustering faces the challenge of sparse features, high dimension and noise interference. The traditional clustering method based on vector space model is not good for short text data processing. With the improvement of K-means algorithm, this paper proposes a short-text clustering algorithm named BK-means which alleviates the effect of data sparseness. Firstly, we preprocess the wordset by means of word segmentation, stop-of-word and other operations, then extract the biterm using the BTM to model the document, and get the document-topic, the topic-word distribution matrix. Finally, we use the proposed BK-means algorithm to cluster short texts of documents represented by vectors. Experiments on the short text data of Sina Weibo have proved that the short text clustering algorithm based on BK-means is superior to the traditional one, and both the F-measure and the purity are improved.
Index Terms— microblog text, BTM, BK-means algorithm, F-measure, purity.
Qianru Li, Xiuliang Mo, Chundong Wang
Tianjin Intelligent Computing and Software New Technology Key Laboratory School of Computer Science and Engineering, Tianjin university of technology, CHINA
Cite: Qianru Li, Xiuliang Mo, Chundong Wang, "Microblog Text Clustering Based on BK-Means Algorithm," Proceedings of 2018 the 8th International Workshop on Computer Science and Engineering, pp. 245-250, Bangkok, 28-30 June, 2018.