DocumentCode
3730521
Title
Differential snapshot algorithms based on Hadoop MapReduce
Author
Wei Du;Xianxia Zou
Author_Institution
Department of Computer Science, GongDong Police College, Guangzhou 510232, China
fYear
2015
Firstpage
1203
Lastpage
1208
Abstract
Change Data Capture from source system is the first step in the incremental maintenance of data warehouses and business intelligence and is a key component of ETL (Extract, Transform and Load) technique. Methods of CDC are currently available, namely, time stamps, differential snapshots, triggers, and archive log. Differential snapshots do not rely on the implementation mechanism of the information sources, and therefore demonstrates better universality and adaptability. Due to the lack of computing resources, the differential snapshots based on sort merge and hash partition are sometimes error and not effective. This paper proposes the differential snapshot of low cost and high efficiency which combines open source database and Hadoop MapReduce. The differential snapshot based data summary which is generated by the MD5 algorithm is very effective but I/O cost is very heavy. So the paper proposes the SQL statement which queries the database while generating the tuples summary only once I/O. We implement the SQL statement on the open source database MySQL. In addition the parallel programming of MapReduce is used to find difference of database files which improves the efficiency and avoids the error. Experiment verifies the different performances among differential snapshot algorithms difference algorithm.
Keywords
"Databases","Particle separators","Algorithm design and analysis","Partitioning algorithms","Data mining","Syntactics","Data warehouses"
Publisher
ieee
Conference_Titel
Fuzzy Systems and Knowledge Discovery (FSKD), 2015 12th International Conference on
Type
conf
DOI
10.1109/FSKD.2015.7382113
Filename
7382113
Link To Document