On the Use of Data Parallelism Technologies for Implementing Statistical Analysis Functions

WCSE 2024 ISBN: 978-981-94-1156-6
DOI: 10.18178/wcse.2024.06.015

Amirkia Rafiei Oskooei

Abstract— This study presents a comparative analysis of data parallelism technologies for implementing statistical analysis functions using the Apache Spark big data processing framework. As data volume and complexity continues to grow exponentially, selecting the right parallel processing framework is crucial for efficient big data analysis. Through a comprehensive methodology, we evaluate the performance and suitability of Spark's data parallelism capabilities for implementing descriptive, exploratory, and inferential statistical functions. By comparing Apache Spark with Hadoop MapReduce, the study highlights Spark's superior performance, especially in handling complex and iterative analytical tasks. The findings show significant performance gains with Spark, positioning it as the preferred framework for a variety of statistical analysis needs in the big data era. The findings of this research offer valuable insights for researchers and practitioners looking to optimize their data analysis workflows and leverage the full potential of big data technologies.

Index Terms— Big Data Analytics, Statistical Functions, Apache Spark, Data Parallelism, MapReduce, Parallel Processing

Amirkia Rafiei Oskooei
R&D Center, Intellica Business Intelligence Consultancy, TURKEY
Yildiz Technical University, TURKEY

[Download]

Cite: Amirkia Rafiei Oskooei, "On the Use of Data Parallelism Technologies for Implementing Statistical Analysis Functions," 2024 The 14th International Workshop on Computer Science and Engineering (WCSE 2024), pp. 94-102, Phuket Island, Thailand, June 19-21, 2024.

PREVIOUS PAPER
Predicting Consumer Actions in Digital Banking with Time-Sensitive User Behavior Analysis

NEXT PAPER
Performance Metrics Analysis of Machine Learning Classification Models with GloVe Word Embedding for a School-based Email Data