• DocumentCode
    179744
  • Title

    Automatic extraction of topics on big data streams through scalable advanced analysis

  • Author

    Romsaiyud, Walisa

  • Author_Institution
    Grad. Sch. of Inf. Technol., Siam Univ., Bangkok, Thailand
  • fYear
    2014
  • fDate
    July 30 2014-Aug. 1 2014
  • Firstpage
    255
  • Lastpage
    260
  • Abstract
    Extracting words, data patterns and topic models from streaming big data by way of real-time processing is a challenging job. Currently, many of applied machine learning techniques in data mining aim to utilize online feedbacks by making model updates faster and quicker. However, Mahout and Massive Online Analysis (MOA) existing solutions are not supported for streaming machine learning, and consequently, not suitable for scalable multiple machines. In this paper enhanced the machine learning algorithms for extracting the words and generating topic models based on the continuing data which was initially proposed. One of the great advantages of the proposed algorithm was the capability to be scaled into multiple machines, in which made it very suitable for real-time processing of streaming data. In general, the algorithm includes two main methods: (a) the first method introduces a principle approach to pre-process documents in an associated time sequence. It implements a class to detect identical files from input files so as to reduce computation time. (b) The second method suits real time monitoring and control of the process from multiple asynchronous text streams. In the experiment, these two methods were alternatively executed, and subsequently after iterations a monotonic convergence was guaranteed. The study conducts the experiments based on a real-world dataset collected from TREC KBA Stream Corpus in 2012. Finally, the accuracy of the proposed method resulted in greater robustness towards the ability to deal with noise and reduce the computation.
  • Keywords
    data mining; learning (artificial intelligence); MOA; automatic extraction; big data streams; data mining; data patterns; machine learning algorithms; machine learning streaming; machine learning techniques; massive online analysis; real-time processing; scalable advanced analysis; topic models; word extraction; Analytical models; Computer architecture; Data mining; Data models; Distributed databases; Machine learning algorithms; Real-time systems; Big Data; Data Streaming; Machine Learning; Scalable Advanced Massive Online Analysis (SAMOA);
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Science and Engineering Conference (ICSEC), 2014 International
  • Conference_Location
    Khon Kaen
  • Print_ISBN
    978-1-4799-4965-6
  • Type

    conf

  • DOI
    10.1109/ICSEC.2014.6978204
  • Filename
    6978204