• DocumentCode
    566440
  • Title

    Using Random Forest classifiers to detect duplicate gazetteer records

  • Author

    Martins, Bruno ; Galhardas, Helena ; Goncalves, Nuno

  • Author_Institution
    INESC-ID, Tech. Univ. of Lisbon, Porto Salvo, Portugal
  • fYear
    2012
  • fDate
    20-23 June 2012
  • Firstpage
    1
  • Lastpage
    4
  • Abstract
    This paper presents an approach for detecting duplicate records in the context of digital gazetteers, using a state-of-the-art machine learning technique. It reports on a thorough evaluation of a machine learning approach designed for the task of classifying pairs of gazetteer records as either duplicates or not, built by using Random Forests and leveraging on different combinations of similarity scores for the feature vectors. Experimental results show that using feature vectors that combine multiple similarity scores, derived from place names, semantic relationships, place types and geospatial footprints, leads to an accuracy of 97.45%.
  • Keywords
    geography; learning (artificial intelligence); pattern classification; duplicate records; feature vectors; gazetteer records; geospatial footprints; machine learning technique; place names; random forest classifiers; semantic relationships; similarity scores; Conferences; Data mining; Geospatial analysis; Machine learning; Manuals; Semantics; Support vector machine classification; Digital Gazetteers; Duplicate Detection; Random Forests; Supervised Machine Learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Systems and Technologies (CISTI), 2012 7th Iberian Conference on
  • Conference_Location
    Madrid
  • ISSN
    2166-0727
  • Print_ISBN
    978-1-4673-2843-2
  • Type

    conf

  • Filename
    6263211