DocumentCode :
1688784
Title :
15th International Conference on Scientific and Statistical Database Management. SSDBM 2003
fYear :
2003
Abstract :
In most cases unique identifiers are required to join data from different databases. If global unique keys are absent or corrupted the supplement of data extracted from different sources becomes difficult. The main question is: does a given record relates to an entity, which is identical to an entity corresponding to another record, or not? This leads to a classification problem with at least two classes: identical and not identical. Classifying pairs of records needs a three-step procedure. The first step is to define suitable common properties (attributes) of data for all different sources. Secondly, to allow comparisons the values of the records are transformed to these common properties. Finally, the classification is performed on an almost finite subset, the range of an appropriate comparison function. Different classification techniques can be applied like Association Rules, Classification Trees, Neural networks or Record Linkage techniques. The unknown parameters of the classification rules are computed by sampling and supervised learning. Unbiased error rates can be estimated for instance by cross validation. Special attention must be paid to control the computing complexity of the identification process. The approach is illustrated for data from two library databases and from the planned German administrative record census, which will become a substitute of a regular census.
Keywords :
classification; data analysis; database management systems; digital libraries; statistical analysis; Association Rules; German administrative record census; classification tree; computing complexity; data analysis; data classification; data identification; data integration; data processing; data supplement; library database; neural network; record linkage; statistical method; statistics; unique identifier; unique key; Database management systems; Statistics;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Scientific and Statistical Database Management, 2003. 15th International Conference on
Conference_Location :
Cambridge, MA, USA
ISSN :
1099-3371
Print_ISBN :
0-7695-1964-4
Type :
conf
DOI :
10.1109/SSDM.2003.1214940
Filename :
1214940
Link To Document :
بازگشت