DocumentCode
2111183
Title
Distributed log information processing with Map-Reduce: A case study from raw data to final models
Author
Luo, Mingyue ; Liu, Gang
Author_Institution
Sch. of Electron. Eng., Beijing Univ. of Posts & Telecommun., Beijing, China
fYear
2010
fDate
17-19 Dec. 2010
Firstpage
1143
Lastpage
1146
Abstract
With the high development of Internet, e-commerce websites now routinely have to work with log datasets which are up to a few terabytes in size. How to remove messy data timely with low cost and find out useful information is a problem we have to face. The mining process involves several steps from pre-processing the raw data to establishing the final models. In this paper we describe our method to solve the problem with Map-Reduce. Hadoop is a Map-Reduce implementation develops open-source software for reliable, scalable, distributed computing. Several applications which we have proposed: data extracting, sum operation, join operation and clustering algorithm are applied on hadoop. We can apply them on data pre-processing and detect users with the same interests. In particular, we focus on clustering algorithms. A clustering algorithms which integrate SOM (Self-Organized Map) and fuzzy logic is combined with Map-Reduce and we call it MRSF here. With the help of hadoop cluster, large calculation of jobs with MRSF can be accommodated easily by just adding more nodes or computers to the cluster. From the experiment, we show that MRSF can scale well and efficiently process and analyze extremely large datasets.
Keywords
Internet; Web sites; data analysis; data mining; distributed processing; electronic commerce; fuzzy logic; pattern clustering; public domain software; self-organising feature maps; Hadoop; Internet; clustering algorithm; data extraction; data mining; distributed computing; distributed log information processing; e-commerce Website; fuzzy logic; join operation; log dataset; map-reduce; open-source software; raw data processing; self-organized map; sum operation; user interest detection; Clustering algorithms; Computational modeling; Computers; Data mining; Data models; Distributed databases; Training; Distributed Data Mining; Map-Reduce; data pre-processing; join operation;
fLanguage
English
Publisher
ieee
Conference_Titel
Information Theory and Information Security (ICITIS), 2010 IEEE International Conference on
Conference_Location
Beijing
Print_ISBN
978-1-4244-6942-0
Type
conf
DOI
10.1109/ICITIS.2010.5689760
Filename
5689760
Link To Document