DocumentCode
755639
Title
Data alignment and integration [US government]
Author
Pantel, Patrick ; Philpot, Andrew ; Hovy, Eduared
Author_Institution
Inf. Sci. Inst., Univ. of Southern California, Marina del Rey, CA, USA
Volume
38
Issue
12
fYear
2005
Firstpage
43
Lastpage
50
Abstract
A general-purpose solution to the problem of matching entities within or across heterogeneous data sources can´t depend on the presence or reliability of auxiliary data such as structural information or metadata. Instead, it must leverage the available data (or observations) that describe the entities. Our technology, based on information theory principles, measures the importance of observations and then leverages them to quantify the similarity between entities, improving accuracy and reducing the time required to find related entities in a population. Applying this purely data-driven paradigm, we´ve built two systems: Guspin for automatically identifying equivalence classes or aliases, and Sift for automatically aligning data across databases. The key to our underlying technology is identifying the most informative observations and then matching entities that share them. Given the right types of observations, our model can potentially solve several serious and urgent problems that governments face, such as terrorist detection, identity theft, and data integration.
Keywords
Internet; distributed databases; government data processing; information theory; Guspin system; Sift system; US government; data alignment; data integration; data-driven paradigm; entity matching problem; government problem; heterogeneous data sources; identity theft; information theory principle; metadata; terrorist detection; Air pollution; Automatic control; Control systems; Databases; Electronic mail; Merging; Monitoring; Protection; Terrorism; US Government; CARB; CEIDARS; Data sharing; Digital government; Facilities Registry System; Guspin; Information modeling; National Emission Inventory; Sift;
fLanguage
English
Journal_Title
Computer
Publisher
ieee
ISSN
0018-9162
Type
jour
DOI
10.1109/MC.2005.406
Filename
1556484
Link To Document