Mining Web Content Outliers for Improving the Quality of Search Results by using Mathematical Approaches
Abstract— The main task of Web mining is to provide users for ret rieving relevant information from the web effectively and efficiently. The unnecessary irrelevant duplicated web pages on searching informat ion from web affect the low quality of search results and increase indexing space and time complexity. It becomes a challenging task to provide high quality and effective search result to retrieve information. Web content outlier mining focus on mining outliers such as irrelevant and redundant pages from other the web pages under the same categories. A mathemat ical approach, Statistical Correlation Coefficient Approach with Term Frequency Inverse Document Frequency (TF.IDF) technique and domain dictionary is used to remove the irrelevant documents. And Kendall's Tau rank correlation coefficient is used to remove the redundant web documents and to retrieve ranked unique web documents. The results from proposed method gives F1-measures and accuracy higher than existing methods.
Index Terms— web content outliers, TF.IDF, Statistical correlation coefficient, Kendall's Tau rank correlation
University of Information Technology, MYANMAR
Khin Mo Mo Tun
Faculty of Computing Department, University of Information Technology, MYANMAR
Cite: Thinzar Tun, Khin Mo Mo Tun, "Mining Web Content Outliers for Improving the Quality of Search Results by using Mathematical Approaches," Proceedings of 2019 the 9th International Workshop on Computer Science and Engineering WCSE_2019_SPRING, pp. 154-158, Yangon, Myanmar, February 27-March 1, 2019.