• DocumentCode
    661901
  • Title

    Prefix filtering with data partitioning for similarity join

  • Author

    Bhirakit, Methus ; Chongstitvatana, Jaruloj

  • Author_Institution
    Dept. of Math. & Comput. Sci., Chulalongkorn Univ., Bangkok, Thailand
  • fYear
    2013
  • fDate
    4-6 Sept. 2013
  • Firstpage
    163
  • Lastpage
    167
  • Abstract
    Many applications, such as data integration, and data preparation, use similarity join as an important operation. In real-world applications, the challenge of similarity joins arises when data sets are large. Filter and verify methods have been proposed to reduce the running time of similarity join. The prefix filtering algorithm, which is one of the filter and verify methods, filters out some dissimilar strings by examining only the prefix of strings, instead of the whole strings. In this paper, we propose the data partitioning for prefix filtering method using in similarity join. For our approach, the database is divided into partitions and prefix filtering is performed for each partition of data. This proposed algorithm supports parallelism because filtering can be done on each partition independently. Furthermore, when the dataset is partitioned into smaller sets, a proper prefix length can be determined for each data partition. This also improves the selection of candidate strings, and reduces the verify time. An experiment is performed to compare the proposed algorithm to state-of-the-art algorithms. The experiment shows that our method achieves higher performance by reducing in the number of candidate strings and parallel execution.
  • Keywords
    database indexing; parallel algorithms; string matching; text analysis; data integration; data partitioning; data preparation; database partitions; dissimilar strings; parallel algorithm; parallel execution; prefix filtering algorithm; prefix length; similarity join; string prefix; text database; text document indexing; verify method; Computer science; Conferences; Similarity join; data partitioning; parallel join; prefix filtering;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Science and Engineering Conference (ICSEC), 2013 International
  • Conference_Location
    Nakorn Pathom
  • Print_ISBN
    978-1-4673-5322-9
  • Type

    conf

  • DOI
    10.1109/ICSEC.2013.6694772
  • Filename
    6694772