WCSE 2017
ISBN: 978-981-11-3671-9 DOI: 10.18178/wcse.2017.06.025

Kraken: A Continuous Incremental Data Acquisition System for GitHub and Git Repositories

Lingbin Zeng , Gang Yin, Tao Wang, Yue Yu, Qiang Fan, Zhi-Xing Li, Jie Yu, H. M. Wang

Abstract— With the quick development of open source software, quantity of software is produced in the open source community (OSC) [1]. Lots of researches are launched to study the internal regular patterns of OSC [2], [3]. GitHub is one of the most famous open source community which owns thousands software projects. As a result, there are massive and abundant data of software development activities in GitHub. With the purpose to offer an accuracy and efficient dataset of GitHub, this paper proposes Kraken which is a continuous incremental data acquisition system for GitHub. Kraken contains three main modules which are independent with each other. Kraken gets the data of GitHub from two ways: git repositories and rest API. The final result shows that Kraken could extract the commits information of git repositories and get pull requests(PRs) and issues through rest API. The commits information contains the detail development history of software and the feedbacks and wisdom of software engineers are showed through PRs and issues.

Index Terms— GitHub, open source software, data extraction, rest API

Lingbin Zeng , Gang Yin, Tao Wang, Yue Yu, Qiang Fan, Zhi-Xing Li, Jie Yu, H. M. Wang
National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, CHINA

[Download]


Cite: Lingbin Zeng , Gang Yin, Tao Wang, Yue Yu, Qiang Fan, Zhi-Xing Li, Jie Yu, H. M. Wang, "Kraken: A Continuous Incremental Data Acquisition System for GitHub and Git Repositories," Proceedings of 2017 the 7th International Workshop on Computer Science and Engineering, pp. 144-149, Beijing, 25-27 June, 2017.