Title :
IncMR: Incremental Data Processing Based on MapReduce
Author :
Yan, Cairong ; Yang, Xin ; Yu, Ze ; Li, Min ; Li, Xiaolin
Author_Institution :
Dept. of Comput. Sci. & Technol., Donghua Univ., Shanghai, China
Abstract :
MapReduce programming model is widely used for large scale and one-time data-intensive distributed computing, but lacks flexibility and efficiency of processing small incremental data. IncMR framework is proposed in this paper for incrementally processing new data of a large data set, which takes state as implicit input and combines it with new data. Map tasks are created according to new splits instead of entire splits while reduce tasks fetch their inputs including the state and the intermediate results of new map tasks from designate nodes or local nodes. Data locality is considered as one of the main optimization means for job scheduling. It is implemented based on Hadoop, compatible with the original MapReduce interfaces and transparent to users. Experiments show that non-iterative algorithms running in MapReduce framework can be migrated to IncMR directly to get efficient incremental and continuous processing without any modification. IncMR is competitive and in all studied cases runs faster than that processing the entire data set.
Keywords :
data mining; distributed programming; scheduling; Hadoop; IncMR framework; Map tasks; MapReduce framework; MapReduce interfaces; MapReduce programming model; continuous processing; data locality; incremental data processing; job scheduling; local nodes; noniterative algorithms; one-time data-intensive distributed computing; Algorithm design and analysis; Computational modeling; Data models; Data processing; Distributed databases; Parallel processing; Programming; Compatible; Data locality; Incremental data processing; MapReduce; State;
Conference_Titel :
Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on
Conference_Location :
Honolulu, HI
Print_ISBN :
978-1-4673-2892-0
DOI :
10.1109/CLOUD.2012.67