• DocumentCode
    249320
  • Title

    Practising Scalable Graph Similarity Joins in MapReduce

  • Author

    Yifan Chen ; Xiang Zhao ; Bin Ge ; Chuan Xiao ; Chi-Hung Chi

  • Author_Institution
    Sci. & Technol. on Inf. Syst. & Eng. Lab., Nat. Univ. of Defense Technol., Changsha, China
  • fYear
    2014
  • fDate
    June 27 2014-July 2 2014
  • Firstpage
    112
  • Lastpage
    119
  • Abstract
    Along with the emergence of massive graph-modeled data, it is of great importance to investigate graph similarity join due to its wide applications for multiple purposes, including data cleaning, near duplicate detection, etc. This paper considers graph similarity joins with edit distance constraints, which return pairs of graphs such that their edit distances are no larger than a given threshold. Leveraging the MapReduce programming model, we propose MGSJoin, a scalable algorithm following the filtering-verification framework for efficient graph similarity joins. It relies on counting overlapping graph signatures for filtering out non-promising candidates. With the potential issue of too many key-value pairs in the filtering phase, spectral Bloom filters are introduced to reduce the number of key-value pairs. Furthermore, we integrate the multiway join strategy to boost the verification. The superior efficiency and scalability of the proposed algorithms are demonstrated by extensive experimental results.
  • Keywords
    data analysis; data structures; information filtering; MGSJoin; MapReduce programming model; edit distance constraints; filtering-verification framework; multiway join strategy; overlapping graph signatures; scalable graph similarity joins; spectral Bloom filters; Abstracts; Big data; Complexity theory; Educational institutions; Filtering; Laboratories; Bloom filter; Graph similarity join; MapReduce; Multiway join;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Big Data (BigData Congress), 2014 IEEE International Congress on
  • Conference_Location
    Anchorage, AK
  • Print_ISBN
    978-1-4799-5056-0
  • Type

    conf

  • DOI
    10.1109/BigData.Congress.2014.25
  • Filename
    6906768