DocumentCode
659549
Title
Efficient near-duplicate document detection using FPGAs
Author
Xi Luo ; Najjar, Walid ; Hristidis, Vagelis
Author_Institution
Comput. Sci. & Eng., UC Riverside, Riverside, CA, USA
fYear
2013
fDate
6-9 Oct. 2013
Firstpage
54
Lastpage
61
Abstract
Detecting duplicate and near-duplicate documents is critical in applications like Web crawling since it helps save document processing resources. Simhash is a state-of-art method to assign a bit-string fingerprint to a document, such that similar documents have similar fingerprints. Finding the near-duplicates in a large collection of documents consists of two stages: (a) compute the simhash fingerprint of each document, and (b) find pairs of similar fingerprints by computing their Hamming distance. Previous work has focused on optimizing the second stage, i.e., avoiding the quadratic number of comparisons to compute the all to all Hamming distance. However, our experiments show that the total time is dominated by the first stage (the fingerprints computation), which is the focus of this paper. We propose an implementation of simhash on Field Programmable Gate Arrays (FPGAs), by implementing a customized fingerprint computing engine in hardware that exploits parallelization and pipelining opportunities. We present a comprehensive experimental evaluation on large diverse real document datasets. Our experiments show a speedup of 362× in the simhash computation, and savings of up to 98% in overall near-duplicate detection execution time compared to using multi-core CPUs.
Keywords
Internet; document handling; field programmable gate arrays; multiprocessing systems; search engines; FPGA; Hamming distance; Web crawling; bit string fingerprint; document processing resources; field programmable gate arrays; fingerprint computing engine; multicore CPU; near duplicate document detection; quadratic number; simhash computation; Encyclopedias; Engines; Field programmable gate arrays; Fingerprint recognition; Hardware; Logic gates; Software; FPGA; document similarity; duplicate detection; hardware; hashing;
fLanguage
English
Publisher
ieee
Conference_Titel
Big Data, 2013 IEEE International Conference on
Conference_Location
Silicon Valley, CA
Type
conf
DOI
10.1109/BigData.2013.6691698
Filename
6691698
Link To Document