Title :
Duplicate Records Cleansing with Length Filtering and Dynamic Weighting
Author :
Huang, Li ; Jin, Hai ; Yuan, Pingpeng ; Chu, Fan
Author_Institution :
Sch. of Comput. Sci. & Technol., Huazhong Univ. of Sci. & Technol., Wuhan, China
Abstract :
Due to diversity of data formats, missing of certain properties, imprecise records in heterogeneous literature databases, there exist duplicate records when integrating heterogeneous databases. Duplicate records lower the efficiency of information retrieval. In this paper, we propose an approach, named length filtering and dynamic weighting (LFDW) for duplicate records cleansing. There are three steps in LFDW. The first step is length filtering. In this step, according to the length of record, those record pairs are sifted if there exists a big difference in their lengths. Secondly, this approach detects duplicate records using dynamic weighting properties. Specially, since author name is the important property of literature and one author may has different styles of name, a fuzzy name matching method is adopted to identify the same author who has different name style. Finally, to improve the performance of duplicate detection, we adopt a dynamic sliding-window algorithm when comparing records. The result indicates the time, recall and precision of LFDW are better than traditional ones.
Keywords :
data handling; distributed databases; fuzzy set theory; pattern matching; data format diversity; duplicate records cleansing; dynamic sliding-window algorithm; dynamic weighting; fuzzy name matching method; heterogeneous databases; information retrieval; length filtering; Computer science; Databases; Filtering algorithms; Filters; Grid computing; Heuristic algorithms; Information retrieval; Libraries; Runtime; Scalability;
Conference_Titel :
Semantics, Knowledge and Grid, 2008. SKG '08. Fourth International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-0-7695-3401-5
Electronic_ISBN :
978-0-7695-3401-5
DOI :
10.1109/SKG.2008.88