DocumentCode :
256041
Title :
Distributed pattern matching and document analysis in big data using Hadoop MapReduce model
Author :
Ramya, A.V. ; Sivasankar, E.
Author_Institution :
Dept. of Comput. Sci. & Eng., Nat. Inst. of Technol., Tiruchirappalli, India
fYear :
2014
fDate :
11-13 Dec. 2014
Firstpage :
312
Lastpage :
317
Abstract :
Sequential pattern mining and Document analysis is an important data mining problem in Big Data with broad applications. This paper investigates a specific framework for managing distributed processing in dataset pattern match and document analysis context. MapReduce programming model on a Hadoop cluster is highly scalable and works with commodity machines with integrated mechanisms for fault tolerance. In this paper, we propose a Knuth Morris Pratt based sequential pattern matching in distributed environment with the help of Hadoop Distributed File System as efficient mining of sequential patterns. It also investigates the feasibility of partitioning and clustering of text document datasets for document comparisons. It simplifies the search space and acquires a higher mining efficiency. Data mining tasks has been decomposed to many Map tasks and distributed to many Task trackers. The map tasks find the intermediate results and send to reduce task which consolidates the final result. Both theoretical analysis and experimental result with data as well as cluster of varying size shows the effectiveness of MapReduce model primarily based on time requirements.
Keywords :
Big Data; data mining; parallel processing; pattern clustering; pattern matching; text analysis; Hadoop Distributed File System; Hadoop MapReduce model; Hadoop cluster; Knuth Morris Pratt based sequential pattern matching; MapReduce programming model; big data; data mining; distributed pattern matching; document analysis; map tasks; sequential pattern mining; task trackers; text document dataset clustering; text document dataset partitioning; Artificial intelligence; Conferences; Grid computing; Nickel; Niobium; Distributed Processing; Hadoop Cluster; MapReduce Programming Model;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel, Distributed and Grid Computing (PDGC), 2014 International Conference on
Conference_Location :
Solan
Print_ISBN :
978-1-4799-7682-9
Type :
conf
DOI :
10.1109/PDGC.2014.7030762
Filename :
7030762
Link To Document :
بازگشت