• DocumentCode
    2028496
  • Title

    Duplicate Records Cleansing with Length Filtering and Dynamic Weighting

  • Author

    Huang, Li ; Jin, Hai ; Yuan, Pingpeng ; Chu, Fan

  • Author_Institution
    Sch. of Comput. Sci. & Technol., Huazhong Univ. of Sci. & Technol., Wuhan, China
  • fYear
    2008
  • fDate
    3-5 Dec. 2008
  • Firstpage
    95
  • Lastpage
    102
  • Abstract
    Due to diversity of data formats, missing of certain properties, imprecise records in heterogeneous literature databases, there exist duplicate records when integrating heterogeneous databases. Duplicate records lower the efficiency of information retrieval. In this paper, we propose an approach, named length filtering and dynamic weighting (LFDW) for duplicate records cleansing. There are three steps in LFDW. The first step is length filtering. In this step, according to the length of record, those record pairs are sifted if there exists a big difference in their lengths. Secondly, this approach detects duplicate records using dynamic weighting properties. Specially, since author name is the important property of literature and one author may has different styles of name, a fuzzy name matching method is adopted to identify the same author who has different name style. Finally, to improve the performance of duplicate detection, we adopt a dynamic sliding-window algorithm when comparing records. The result indicates the time, recall and precision of LFDW are better than traditional ones.
  • Keywords
    data handling; distributed databases; fuzzy set theory; pattern matching; data format diversity; duplicate records cleansing; dynamic sliding-window algorithm; dynamic weighting; fuzzy name matching method; heterogeneous databases; information retrieval; length filtering; Computer science; Databases; Filtering algorithms; Filters; Grid computing; Heuristic algorithms; Information retrieval; Libraries; Runtime; Scalability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Semantics, Knowledge and Grid, 2008. SKG '08. Fourth International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-0-7695-3401-5
  • Electronic_ISBN
    978-0-7695-3401-5
  • Type

    conf

  • DOI
    10.1109/SKG.2008.88
  • Filename
    4725901