• DocumentCode
    755639
  • Title

    Data alignment and integration [US government]

  • Author

    Pantel, Patrick ; Philpot, Andrew ; Hovy, Eduared

  • Author_Institution
    Inf. Sci. Inst., Univ. of Southern California, Marina del Rey, CA, USA
  • Volume
    38
  • Issue
    12
  • fYear
    2005
  • Firstpage
    43
  • Lastpage
    50
  • Abstract
    A general-purpose solution to the problem of matching entities within or across heterogeneous data sources can´t depend on the presence or reliability of auxiliary data such as structural information or metadata. Instead, it must leverage the available data (or observations) that describe the entities. Our technology, based on information theory principles, measures the importance of observations and then leverages them to quantify the similarity between entities, improving accuracy and reducing the time required to find related entities in a population. Applying this purely data-driven paradigm, we´ve built two systems: Guspin for automatically identifying equivalence classes or aliases, and Sift for automatically aligning data across databases. The key to our underlying technology is identifying the most informative observations and then matching entities that share them. Given the right types of observations, our model can potentially solve several serious and urgent problems that governments face, such as terrorist detection, identity theft, and data integration.
  • Keywords
    Internet; distributed databases; government data processing; information theory; Guspin system; Sift system; US government; data alignment; data integration; data-driven paradigm; entity matching problem; government problem; heterogeneous data sources; identity theft; information theory principle; metadata; terrorist detection; Air pollution; Automatic control; Control systems; Databases; Electronic mail; Merging; Monitoring; Protection; Terrorism; US Government; CARB; CEIDARS; Data sharing; Digital government; Facilities Registry System; Guspin; Information modeling; National Emission Inventory; Sift;
  • fLanguage
    English
  • Journal_Title
    Computer
  • Publisher
    ieee
  • ISSN
    0018-9162
  • Type

    jour

  • DOI
    10.1109/MC.2005.406
  • Filename
    1556484