Distributed pattern matching and document analysis in big data using Hadoop MapReduce model

Author

Ramya, A.V. ; Sivasankar, E.

Author_Institution

Dept. of Comput. Sci. & Eng., Nat. Inst. of Technol., Tiruchirappalli, India

fYear

2014

fDate

11-13 Dec. 2014

Firstpage

312

Lastpage

317

Abstract

Sequential pattern mining and Document analysis is an important data mining problem in Big Data with broad applications. This paper investigates a specific framework for managing distributed processing in dataset pattern match and document analysis context. MapReduce programming model on a Hadoop cluster is highly scalable and works with commodity machines with integrated mechanisms for fault tolerance. In this paper, we propose a Knuth Morris Pratt based sequential pattern matching in distributed environment with the help of Hadoop Distributed File System as efficient mining of sequential patterns. It also investigates the feasibility of partitioning and clustering of text document datasets for document comparisons. It simplifies the search space and acquires a higher mining efficiency. Data mining tasks has been decomposed to many Map tasks and distributed to many Task trackers. The map tasks find the intermediate results and send to reduce task which consolidates the final result. Both theoretical analysis and experimental result with data as well as cluster of varying size shows the effectiveness of MapReduce model primarily based on time requirements.

Keywords

Big Data; data mining; parallel processing; pattern clustering; pattern matching; text analysis; Hadoop Distributed File System; Hadoop MapReduce model; Hadoop cluster; Knuth Morris Pratt based sequential pattern matching; MapReduce programming model; big data; data mining; distributed pattern matching; document analysis; map tasks; sequential pattern mining; task trackers; text document dataset clustering; text document dataset partitioning; Artificial intelligence; Conferences; Grid computing; Nickel; Niobium; Distributed Processing; Hadoop Cluster; MapReduce Programming Model;

fLanguage

English

Publisher

ieee

Conference_Titel

Parallel, Distributed and Grid Computing (PDGC), 2014 International Conference on

Conference_Location

Solan

Print_ISBN

978-1-4799-7682-9

Type

conf

DOI

10.1109/PDGC.2014.7030762

Filename

7030762