DocumentCode :
606243
Title :
Locality Sensitive Hashing based incremental clustering for creating affinity groups in Hadoop — HDFS - An infrastructure extension
Author :
Kala, K.A. ; Chitharanjan, K.
Author_Institution :
Dept. of Comput. Sci. & Eng., Sree Chithra Thirunal Coll. of Eng., Thiruvananthapuram, India
fYear :
2013
fDate :
20-21 March 2013
Firstpage :
1243
Lastpage :
1249
Abstract :
Apache´s Hadoop is an open source framework for large scale data analysis and storage. It is an open source implementation of Google´s Map/Reduce framework. It enables distributed, data intensive and parallel applications by decomposing a massive job into smaller tasks and a massive data set into smaller partitions such that each task processes a different partition in parallel. Hadoop uses Hadoop distributed File System (HDFS) which is an open source implementation of the Google File System (GFS) for storing data. Map/Reduce application mainly uses HDFS for storing data. HDFS is a very large distributed file system that assumes commodity hardware and provides high throughput and fault tolerance. HDFS stores files as a series of blocks and are replicated for fault tolerance. The default block placement strategy doesn´t consider the data characteristics and places the data blocks randomly. Customized strategies can improve the performance of HDFS to a great extend. Applications using HDFS require streaming access to the files and if the related files are placed in the same set of data nodes, the performance can be increased. This paper is discussing about a method for clustering streaming data to the same set of data nodes using the technique of Locality Sensitive Hashing. The method utilizes the compact bitwise representation of document vectors called fingerprints created using the concept of Locality Sensitive Hashing to increase the data processing speed and performance. The process will be done without affecting the default fault tolerant properties of Hadoop and requires only minimal changes to the Hadoop framework.
Keywords :
data analysis; data structures; distributed databases; document handling; network operating systems; pattern clustering; public domain software; random processes; software fault tolerance; GFS; Google MapReduce framework; Google file system; HDFS performance improvement; Hadoop distributed file system; affinity group creation; bitwise document vector representation; customized strategies; data intensive applications; data nodes; data processing performance; data processing speed; default block placement strategy; distributed applications; fault tolerance; fingerprints; incremental clustering; large distributed file system; large scale data analysis; large scale data storage; locality sensitive hashing; massive data set; open source framework; parallel applications; random data block placement; streaming data clustering; streaming file access; Fingerprint; HDFS; Hadoop; Locality Sensitive Hashing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Circuits, Power and Computing Technologies (ICCPCT), 2013 International Conference on
Conference_Location :
Nagercoil
Print_ISBN :
978-1-4673-4921-5
Type :
conf
DOI :
10.1109/ICCPCT.2013.6528999
Filename :
6528999
Link To Document :
بازگشت