DocumentCode :
252322
Title :
Big data topic modeling with mahout for managing business analysis services
Author :
Romsaiyud, W.
Author_Institution :
Walisa Romsaiyud is with the Grad. Sch. of Inf. Technol., Siam Univ., Bangkok, Thailand
fYear :
2014
fDate :
13-15 Dec. 2014
Firstpage :
514
Lastpage :
519
Abstract :
Topic modeling for big data provides a key opportunity to address the needs of data-driven businesses in a way to deliver genuine value to business users simplifying search and summary processes via the vast amount of information. Many businesses have already worked with Hadoop paradigm in order to rapidly apply computational processing to merge data from several operational systems and analyze large volumes of multi-structured data. In this paper, we extended the features of collapsed variational Bayesian (CVB) inference algorithm for Latent Dirichlet Allocation (LDA) to discover the hidden topical patterns through statistical regularities and eliminate noises on Hadoop framework. The approach captures the evolution of topics in a sequentially organized corpus of documents into two mainly phases, mapping and reducing phases. In the mapping phase the probabilistic on each word, in collected documents, is calculated by using collapsed space of latent variables and parameters for summarizing words in each topic, and reducing phase to utilize the various results from map phase while predicting a new topic model from a given trained models. The study conducts the experiments based on a Reuters-21578 text categorization collection corpus on Hadoop clustering with 64 nodes to improve the computationally in a more efficient and accurate approach.
Keywords :
Big Data; belief networks; business data processing; distributed processing; document handling; inference mechanisms; pattern clustering; CVB; Hadoop clustering; Hadoop framework; Hadoop paradigm; LDA; Mahout; Reuters-21578 text categorization collection corpus; big data topic modeling; business analysis service management; collapsed variational Bayesian inference algorithm; computational processing; data-driven businesses; hidden topical patterns; latent dirichlet allocation; map phase; multistructured data; operational systems; sequentially organized documents corpus; statistical regularities; Analytical models; Bayes methods; Data models; Hidden Markov models; Inference algorithms; Probabilistic logic; Text categorization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
System Integration (SII), 2014 IEEE/SICE International Symposium on
Conference_Location :
Tokyo
Print_ISBN :
978-1-4799-6942-5
Type :
conf
DOI :
10.1109/SII.2014.7028092
Filename :
7028092
Link To Document :
بازگشت