• DocumentCode
    2111183
  • Title

    Distributed log information processing with Map-Reduce: A case study from raw data to final models

  • Author

    Luo, Mingyue ; Liu, Gang

  • Author_Institution
    Sch. of Electron. Eng., Beijing Univ. of Posts & Telecommun., Beijing, China
  • fYear
    2010
  • fDate
    17-19 Dec. 2010
  • Firstpage
    1143
  • Lastpage
    1146
  • Abstract
    With the high development of Internet, e-commerce websites now routinely have to work with log datasets which are up to a few terabytes in size. How to remove messy data timely with low cost and find out useful information is a problem we have to face. The mining process involves several steps from pre-processing the raw data to establishing the final models. In this paper we describe our method to solve the problem with Map-Reduce. Hadoop is a Map-Reduce implementation develops open-source software for reliable, scalable, distributed computing. Several applications which we have proposed: data extracting, sum operation, join operation and clustering algorithm are applied on hadoop. We can apply them on data pre-processing and detect users with the same interests. In particular, we focus on clustering algorithms. A clustering algorithms which integrate SOM (Self-Organized Map) and fuzzy logic is combined with Map-Reduce and we call it MRSF here. With the help of hadoop cluster, large calculation of jobs with MRSF can be accommodated easily by just adding more nodes or computers to the cluster. From the experiment, we show that MRSF can scale well and efficiently process and analyze extremely large datasets.
  • Keywords
    Internet; Web sites; data analysis; data mining; distributed processing; electronic commerce; fuzzy logic; pattern clustering; public domain software; self-organising feature maps; Hadoop; Internet; clustering algorithm; data extraction; data mining; distributed computing; distributed log information processing; e-commerce Website; fuzzy logic; join operation; log dataset; map-reduce; open-source software; raw data processing; self-organized map; sum operation; user interest detection; Clustering algorithms; Computational modeling; Computers; Data mining; Data models; Distributed databases; Training; Distributed Data Mining; Map-Reduce; data pre-processing; join operation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Theory and Information Security (ICITIS), 2010 IEEE International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4244-6942-0
  • Type

    conf

  • DOI
    10.1109/ICITIS.2010.5689760
  • Filename
    5689760