WCSE 2024 ISBN: 978-981-94-1156-6
DOI: 10.18178/wcse.2024.06.018

Semantic-based Big Data Integration with Apache Spark

Nang Kham Soe, Myat Pwint Phyu

Abstract— Providing a consistent and unifying view of all data is a challenging task in the big data context because each big data store has its own data model and permits flexible schema. Moreover, advanced technologies (such as Hadoop MapReduce, Apache Spark) are needed to be able to leverage the integration process of big data. Although differences in data models can be solved by using Apache Spark, it cannot be used for integrating data with different data schemas. For these reasons, a semantic-based data integration approach is proposed to provide a unified view of data in different big data stores. The approach generates schemas by means of local ontologies for data in different big data stores and merges extracted ontologies by using the proposed alignment algorithm. Then, the global (integrated) ontology is converted to the schema using the Spark Dataset API. The schema is used in the data integration step. The main steps of the proposed system are to align local ontologies for global ontology construction, convert global (integrated) ontology to schema in the form of Apache Spark Dataset, and integrate data by applying the schema. The proposed approach is implemented on top of the Apache Spark framework, and the study uses Apache Cassandra and MongoDB as big data stores. Experimental evaluation is conducted to verify the accuracy of the proposed approach.

Index Terms— Big Data Integration, Apache Spark, NoSQL databases, Ontology

Nang Kham Soe
University of Information Technology, MYANMAR
Myat Pwint Phyu
University of Information Technology, MYANMAR

[Download]


Cite: Nang Kham Soe, Myat Pwint Phyu, "Semantic-based Big Data Integration with Apache Spark," 2024 The 14th International Workshop on Computer Science and Engineering (WCSE 2024), pp. 121-126, Phuket Island, Thailand, June 19-21, 2024.