• DocumentCode
    659549
  • Title

    Efficient near-duplicate document detection using FPGAs

  • Author

    Xi Luo ; Najjar, Walid ; Hristidis, Vagelis

  • Author_Institution
    Comput. Sci. & Eng., UC Riverside, Riverside, CA, USA
  • fYear
    2013
  • fDate
    6-9 Oct. 2013
  • Firstpage
    54
  • Lastpage
    61
  • Abstract
    Detecting duplicate and near-duplicate documents is critical in applications like Web crawling since it helps save document processing resources. Simhash is a state-of-art method to assign a bit-string fingerprint to a document, such that similar documents have similar fingerprints. Finding the near-duplicates in a large collection of documents consists of two stages: (a) compute the simhash fingerprint of each document, and (b) find pairs of similar fingerprints by computing their Hamming distance. Previous work has focused on optimizing the second stage, i.e., avoiding the quadratic number of comparisons to compute the all to all Hamming distance. However, our experiments show that the total time is dominated by the first stage (the fingerprints computation), which is the focus of this paper. We propose an implementation of simhash on Field Programmable Gate Arrays (FPGAs), by implementing a customized fingerprint computing engine in hardware that exploits parallelization and pipelining opportunities. We present a comprehensive experimental evaluation on large diverse real document datasets. Our experiments show a speedup of 362× in the simhash computation, and savings of up to 98% in overall near-duplicate detection execution time compared to using multi-core CPUs.
  • Keywords
    Internet; document handling; field programmable gate arrays; multiprocessing systems; search engines; FPGA; Hamming distance; Web crawling; bit string fingerprint; document processing resources; field programmable gate arrays; fingerprint computing engine; multicore CPU; near duplicate document detection; quadratic number; simhash computation; Encyclopedias; Engines; Field programmable gate arrays; Fingerprint recognition; Hardware; Logic gates; Software; FPGA; document similarity; duplicate detection; hardware; hashing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Big Data, 2013 IEEE International Conference on
  • Conference_Location
    Silicon Valley, CA
  • Type

    conf

  • DOI
    10.1109/BigData.2013.6691698
  • Filename
    6691698