• DocumentCode
    1388346
  • Title

    A Genetic Programming Approach to Record Deduplication

  • Author

    De Carvalho, Moisés G. ; Laender, Alberto H F ; Goncalves, Marcos André ; Da Silva, Altigran S.

  • Author_Institution
    Nokia INdT, Manaus, Brazil
  • Volume
    24
  • Issue
    3
  • fYear
    2012
  • fDate
    3/1/2012 12:00:00 AM
  • Firstpage
    399
  • Lastpage
    412
  • Abstract
    Several systems that rely on consistent data to offer high-quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi replicas, or near-duplicate entries in their repositories. Because of that, there have been significant investments from private and government organizations for developing methods for removing replicas from its data repositories. This is due to the fact that clean and replica-free repositories not only allow the retrieval of higher quality information but also lead to more concise data and to potential savings in computational time and resources to process this data. In this paper, we propose a genetic programming approach to record deduplication that combines several different pieces of evidence extracted from the data content to find a deduplication function that is able to identify whether two entries in a repository are replicas or not. As shown by our experiments, our approach outperforms an existing state-of-the-art method found in the literature. Moreover, the suggested functions are computationally less demanding since they use fewer evidence. In addition, our genetic programming approach is capable of automatically adapting these functions to a given fixed replica identification boundary, freeing the user from the burden of having to choose and tune this parameter.
  • Keywords
    genetic algorithms; information retrieval; replicated databases; computational time; data repositories; database administration; database integration; digital libraries; e-commerce brokers; fixed replica identification boundary; genetic programming; information retrieval; record deduplication; replica removal; replica-free repositories; Data mining; Databases; Genetic programming; Machine learning; Probabilistic logic; Training; Database administration; database integration.; evolutionary computing and genetic algorithms;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2010.234
  • Filename
    5645623