DocumentCode
566440
Title
Using Random Forest classifiers to detect duplicate gazetteer records
Author
Martins, Bruno ; Galhardas, Helena ; Goncalves, Nuno
Author_Institution
INESC-ID, Tech. Univ. of Lisbon, Porto Salvo, Portugal
fYear
2012
fDate
20-23 June 2012
Firstpage
1
Lastpage
4
Abstract
This paper presents an approach for detecting duplicate records in the context of digital gazetteers, using a state-of-the-art machine learning technique. It reports on a thorough evaluation of a machine learning approach designed for the task of classifying pairs of gazetteer records as either duplicates or not, built by using Random Forests and leveraging on different combinations of similarity scores for the feature vectors. Experimental results show that using feature vectors that combine multiple similarity scores, derived from place names, semantic relationships, place types and geospatial footprints, leads to an accuracy of 97.45%.
Keywords
geography; learning (artificial intelligence); pattern classification; duplicate records; feature vectors; gazetteer records; geospatial footprints; machine learning technique; place names; random forest classifiers; semantic relationships; similarity scores; Conferences; Data mining; Geospatial analysis; Machine learning; Manuals; Semantics; Support vector machine classification; Digital Gazetteers; Duplicate Detection; Random Forests; Supervised Machine Learning;
fLanguage
English
Publisher
ieee
Conference_Titel
Information Systems and Technologies (CISTI), 2012 7th Iberian Conference on
Conference_Location
Madrid
ISSN
2166-0727
Print_ISBN
978-1-4673-2843-2
Type
conf
Filename
6263211
Link To Document