Title of article :
Estimating the Parameters for Linking Unstandardized References with the Matrix Comparator
Author/Authors :
Al-Sarkhi ، Awaad University of Arkansas at Little Rock , R. Talburt ، John University of Arkansas at Little Rock
From page :
12
To page :
26
Abstract :
This paper discusses recent research on methods for estimating configuration parameters for the Matrix Comparator used for linking unstandardized or heterogeneously standardized references. The matrix comparator computes the aggregate similarity between the tokens (words) in a pair of references. The two most critical parameters for the matrix comparator for obtaining the best linking results are the value of the similarity threshold and the list of stop words to exclude from the comparison. Earlier research has shown that the standard deviation of the token frequency distribution is strongly predictive of how useful stop words will be in improving linking performance. The research results presented here demonstrate a method for using statistics from token frequency distribution to estimate the threshold value and stop word selection likely to give the best linking results. The model was made using linear regression and validated with independent datasets.
Keywords :
Entity resolution , Record linking , Matrix comparator , Stop words , Token frequency , F , measure
Journal title :
Journal of Information Technology Management (JITM)
Journal title :
Journal of Information Technology Management (JITM)
Record number :
2510185
Link To Document :
بازگشت