Title :
Diffusion Maps: A Superior Semantic Method to Improve Similarity Join Performance
Author :
Hawashin, Bilal ; Fotouhi, Farshad ; Grosky, William
Author_Institution :
Dept. of Comput. Sci., Wayne State Univ., Detroit, MI, USA
Abstract :
This paper adopts the use of the diffusion maps method for joining long string values, such as paper abstracts, movie summaries, product descriptions, and user feedback, to improve the performance of the existing similarity join methods. In this work, we showed that using attributes of long string values to detect similar records would significantly improve the overall similarity join performance. Most databases include attributes of long string values, and the existing similarity join methods are not efficient in finding the similarity among the values of these long attributes. In this paper, multiple methods were compared according to their ability in joining long string values semantically.
Keywords :
eigenvalues and eigenfunctions; learning (artificial intelligence); records management; string matching; SoftTfldf; cosine similarity; diffusion maps; eigenvectors; latent semantic indexing; semantic method; similar records; similarity join methods; similarity join performance; string values; SoftTfIdf; cosine similarity; diffusion maps; latent semantic indexing; long attributes; semantic similarity join method; supervised learning;
Conference_Titel :
Data Mining Workshops (ICDMW), 2010 IEEE International Conference on
Conference_Location :
Sydney, NSW
Print_ISBN :
978-1-4244-9244-2
Electronic_ISBN :
978-0-7695-4257-7
DOI :
10.1109/ICDMW.2010.77