• DocumentCode
    745231
  • Title

    A distance-based approach to entity reconciliation in heterogeneous databases

  • Author

    Dey, Debabrate ; Sarkar, Sulmit ; De, Prabuddha

  • Author_Institution
    Dept. of Manage. Sci., Washington Univ., Seattle, WA, USA
  • Volume
    14
  • Issue
    3
  • fYear
    2002
  • Firstpage
    567
  • Lastpage
    582
  • Abstract
    In modern organizations, decision makers must often be able to quickly access information from diverse sources in order to make timely decisions. A critical problem facing many such organizations is the inability to easily reconcile the information contained in heterogeneous data sources. To overcome this limitation, an organization must resolve several types of heterogeneity problems that may exist across different sources. We examine one such problem called the entity heterogeneity problem, which arises when the same real-world entity type is represented using different identifiers in different applications. A decision-theoretic model to resolve the problem is proposed. Our model uses a distance measure to express the similarity between two entity instances. We have implemented the model and tested it on real-world data. The results indicate that the model performs quite well in terms of its ability to predict whether two entity instances should be matched or not. The model is shown to be computationally efficient. It also scales well to large relations from the perspective of the accuracy of prediction. Overall, the test results imply that this is certainly a viable approach in practical situations
  • Keywords
    business data processing; decision theory; distributed databases; relational databases; decision makers; decision theory model; distance-based approach; entity heterogeneity problem; entity reconciliation; heterogeneous data sources; heterogeneous databases; organizations; relational database; Databases;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2002.1000343
  • Filename
    1000343