DocumentCode
2028496
Title
Duplicate Records Cleansing with Length Filtering and Dynamic Weighting
Author
Huang, Li ; Jin, Hai ; Yuan, Pingpeng ; Chu, Fan
Author_Institution
Sch. of Comput. Sci. & Technol., Huazhong Univ. of Sci. & Technol., Wuhan, China
fYear
2008
fDate
3-5 Dec. 2008
Firstpage
95
Lastpage
102
Abstract
Due to diversity of data formats, missing of certain properties, imprecise records in heterogeneous literature databases, there exist duplicate records when integrating heterogeneous databases. Duplicate records lower the efficiency of information retrieval. In this paper, we propose an approach, named length filtering and dynamic weighting (LFDW) for duplicate records cleansing. There are three steps in LFDW. The first step is length filtering. In this step, according to the length of record, those record pairs are sifted if there exists a big difference in their lengths. Secondly, this approach detects duplicate records using dynamic weighting properties. Specially, since author name is the important property of literature and one author may has different styles of name, a fuzzy name matching method is adopted to identify the same author who has different name style. Finally, to improve the performance of duplicate detection, we adopt a dynamic sliding-window algorithm when comparing records. The result indicates the time, recall and precision of LFDW are better than traditional ones.
Keywords
data handling; distributed databases; fuzzy set theory; pattern matching; data format diversity; duplicate records cleansing; dynamic sliding-window algorithm; dynamic weighting; fuzzy name matching method; heterogeneous databases; information retrieval; length filtering; Computer science; Databases; Filtering algorithms; Filters; Grid computing; Heuristic algorithms; Information retrieval; Libraries; Runtime; Scalability;
fLanguage
English
Publisher
ieee
Conference_Titel
Semantics, Knowledge and Grid, 2008. SKG '08. Fourth International Conference on
Conference_Location
Beijing
Print_ISBN
978-0-7695-3401-5
Electronic_ISBN
978-0-7695-3401-5
Type
conf
DOI
10.1109/SKG.2008.88
Filename
4725901
Link To Document