Abstract :
In most cases unique identifiers are required to join data from different databases. If global unique keys are absent or corrupted the supplement of data extracted from different sources becomes difficult. The main question is: does a given record relates to an entity, which is identical to an entity corresponding to another record, or not? This leads to a classification problem with at least two classes: identical and not identical. Classifying pairs of records needs a three-step procedure. The first step is to define suitable common properties (attributes) of data for all different sources. Secondly, to allow comparisons the values of the records are transformed to these common properties. Finally, the classification is performed on an almost finite subset, the range of an appropriate comparison function. Different classification techniques can be applied like Association Rules, Classification Trees, Neural networks or Record Linkage techniques. The unknown parameters of the classification rules are computed by sampling and supervised learning. Unbiased error rates can be estimated for instance by cross validation. Special attention must be paid to control the computing complexity of the identification process. The approach is illustrated for data from two library databases and from the planned German administrative record census, which will become a substitute of a regular census.
Keywords :
classification; data analysis; database management systems; digital libraries; statistical analysis; Association Rules; German administrative record census; classification tree; computing complexity; data analysis; data classification; data identification; data integration; data processing; data supplement; library database; neural network; record linkage; statistical method; statistics; unique identifier; unique key; Database management systems; Statistics;