DocumentCode
256041
Title
Distributed pattern matching and document analysis in big data using Hadoop MapReduce model
Author
Ramya, A.V. ; Sivasankar, E.
Author_Institution
Dept. of Comput. Sci. & Eng., Nat. Inst. of Technol., Tiruchirappalli, India
fYear
2014
fDate
11-13 Dec. 2014
Firstpage
312
Lastpage
317
Abstract
Sequential pattern mining and Document analysis is an important data mining problem in Big Data with broad applications. This paper investigates a specific framework for managing distributed processing in dataset pattern match and document analysis context. MapReduce programming model on a Hadoop cluster is highly scalable and works with commodity machines with integrated mechanisms for fault tolerance. In this paper, we propose a Knuth Morris Pratt based sequential pattern matching in distributed environment with the help of Hadoop Distributed File System as efficient mining of sequential patterns. It also investigates the feasibility of partitioning and clustering of text document datasets for document comparisons. It simplifies the search space and acquires a higher mining efficiency. Data mining tasks has been decomposed to many Map tasks and distributed to many Task trackers. The map tasks find the intermediate results and send to reduce task which consolidates the final result. Both theoretical analysis and experimental result with data as well as cluster of varying size shows the effectiveness of MapReduce model primarily based on time requirements.
Keywords
Big Data; data mining; parallel processing; pattern clustering; pattern matching; text analysis; Hadoop Distributed File System; Hadoop MapReduce model; Hadoop cluster; Knuth Morris Pratt based sequential pattern matching; MapReduce programming model; big data; data mining; distributed pattern matching; document analysis; map tasks; sequential pattern mining; task trackers; text document dataset clustering; text document dataset partitioning; Artificial intelligence; Conferences; Grid computing; Nickel; Niobium; Distributed Processing; Hadoop Cluster; MapReduce Programming Model;
fLanguage
English
Publisher
ieee
Conference_Titel
Parallel, Distributed and Grid Computing (PDGC), 2014 International Conference on
Conference_Location
Solan
Print_ISBN
978-1-4799-7682-9
Type
conf
DOI
10.1109/PDGC.2014.7030762
Filename
7030762
Link To Document