Predictive Analytics on High-Dimensional Big Data using Principal Component Regression (PCR)
Abstract— Nowadays, the increasing volu me, comp le xity of formats and delivery speed of “ Big Data” fro m diverse application domains have exceeded the capabilit ies of tradit ional data management tools and technologies. There is a need to re-design classical data analysis methods and algorithms to be adaptable in parallel and distributed architecture which can work well with the vast amounts of data not only in size of samples but also in number of dimensions. Moreover, high-dimensional big datasets have experienced many issues and challenges to handle huge collection of wide (dimensions) and tall (samples) data nature ext racting useful value from it . Principal Component Analysis (PCA) is an important machine learning algorithm in dimensionality reduction for highly correlated large-scale data. In this system, we will apply PCA as selecting regressors for multiple linear regression model we called Principal Component Regression (PCR) for high-dimensional big data analytics with the aim to select effective and efficient features or dimensions. Additionally, we will develop the parallel and distributed version of PCA as preliminary machine learning approach for multiple linear regression model implemented on two widely-used scalable and distributed platforms such as Disk-Based MapReduce and Memory-Based Spark solving the scalability issue of big data. Large-scale OpenStreetMap (OSM) data which can provide as reality fulfillment to GIS market and spatial world will be applied for experimentation of the system.
Index Terms— Big Data, High-Dimensional, Principal Component Analysis, Multiple Linear Regression, Principal Component Regression, MapReduce, Spark, OpenStreetMap (OSM)
Kyi Lai Lai Khine
Cloud Computing Lab, University of Computer Studies, MYANMAR
Thi Thi Soe Nyunt
Faculty of Computer Science, University of Computer Studies, MYANMAR
Cite: Kyi Lai Lai Khine, Thi Thi Soe Nyunt, "Predictive Analytics on High-Dimensional Big Data using Principal Component Regression (PCR)," Proceedings of 2019 the 9th International Workshop on Computer Science and Engineering WCSE_2019_SPRING, pp. 148-153, Yangon, Myanmar, February 27-March 1, 2019.