• DocumentCode
    140801
  • Title

    MassJoin: A mapreduce-based method for scalable string similarity joins

  • Author

    Dong Deng ; Guoliang Li ; Shuang Hao ; Jiannan Wang ; Jianhua Feng

  • Author_Institution
    Dept. of Comput. Sci., Tsinghua Univ., Beijing, China
  • fYear
    2014
  • fDate
    March 31 2014-April 4 2014
  • Firstpage
    340
  • Lastpage
    351
  • Abstract
    String similarity join is an essential operation in data integration. The era of big data calls for scalable algorithms to support large-scale string similarity joins. In this paper, we study scalable string similarity joins using MapReduce. We propose a MapReduce-based framework, called MASSJOIN, which supports both set-based similarity functions and character-based similarity functions. We extend the existing partition-based signature scheme to support set-based similarity functions. We utilize the signatures to generate key-value pairs. To reduce the transmission cost, we merge key-value pairs to significantly reduce the number of key-value pairs, from cubic to linear complexity, while not sacrificing the pruning power. To improve the performance, we incorporate “light-weight” filter units into the key-value pairs which can be utilized to prune large number of dissimilar pairs without significantly increasing the transmission cost. Experimental results on real-world datasets show that our method significantly outperformed state-of-the-art approaches.
  • Keywords
    Big Data; computational complexity; cost reduction; data integration; string matching; MASSJOIN; MapReduce-based framework; MassJoin; big data; character-based similarity functions; cubic complexity; data integration; key-value pairs; large-scale string similarity join; light-weight filter units; linear complexity; mapreduce-based method; partition-based signature scheme; scalable algorithm; scalable string similarity joins; set-based similarity functions; transmission cost reduction; Erbium; Filtering; Open systems;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2014 IEEE 30th International Conference on
  • Conference_Location
    Chicago, IL
  • Type

    conf

  • DOI
    10.1109/ICDE.2014.6816663
  • Filename
    6816663