WCSE 2017
ISBN: 978-981-11-3671-9 DOI: 10.18178/wcse.2017.06.006

A Sampling Strategy for Skewed Data Problem in MapReduce

Cheng Wenjuan, Tong Bing, Zhou Miaomiao, Zhu Junhong

Abstract— As an efficient and reliable parallel computing model, MapReduce was widely used in all walks of life. However, when MapReduce dealing with the skewed data, the efficiency of the whole cluster will be reduced. And the load imbalance in reducer nodes will happen quite often after assigning the results from map stage. This paper used a reservoir sampling algorithm, it can sample with the same probability in case of unknown and skewed data set, thus we estimated the frequency of the Key in overall data, and then reallocated the tasks of processing node to achieve load balance. Finally, by comparing with the traditional sampling strategy, the experimental results showed that the method in this paper is more effective in case of computing skewed data set, advantages are more obvious with the increase of data set.

Index Terms— MapReduce, Skewed data, Reservoir sampling, Load balancing

Cheng Wenjuan, Tong Bing, Zhou Miaomiao
School of Computer and Information, Hefei University of Technology, CHINA
Zhu Junhong
School of Management, Hefei University of Technology, CHINA

[Download]


Cite: Cheng Wenjuan, Tong Bing, Zhou Miaomiao, Zhu Junhong, "A Sampling Strategy for Skewed Data Problem in MapReduce," Proceedings of 2017 the 7th International Workshop on Computer Science and Engineering, pp. 37-41, Beijing, 25-27 June, 2017.