• DocumentCode
    2190625
  • Title

    Diffusion Maps: A Superior Semantic Method to Improve Similarity Join Performance

  • Author

    Hawashin, Bilal ; Fotouhi, Farshad ; Grosky, William

  • Author_Institution
    Dept. of Comput. Sci., Wayne State Univ., Detroit, MI, USA
  • fYear
    2010
  • fDate
    13-13 Dec. 2010
  • Firstpage
    9
  • Lastpage
    16
  • Abstract
    This paper adopts the use of the diffusion maps method for joining long string values, such as paper abstracts, movie summaries, product descriptions, and user feedback, to improve the performance of the existing similarity join methods. In this work, we showed that using attributes of long string values to detect similar records would significantly improve the overall similarity join performance. Most databases include attributes of long string values, and the existing similarity join methods are not efficient in finding the similarity among the values of these long attributes. In this paper, multiple methods were compared according to their ability in joining long string values semantically.
  • Keywords
    eigenvalues and eigenfunctions; learning (artificial intelligence); records management; string matching; SoftTfldf; cosine similarity; diffusion maps; eigenvectors; latent semantic indexing; semantic method; similar records; similarity join methods; similarity join performance; string values; SoftTfIdf; cosine similarity; diffusion maps; latent semantic indexing; long attributes; semantic similarity join method; supervised learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining Workshops (ICDMW), 2010 IEEE International Conference on
  • Conference_Location
    Sydney, NSW
  • Print_ISBN
    978-1-4244-9244-2
  • Electronic_ISBN
    978-0-7695-4257-7
  • Type

    conf

  • DOI
    10.1109/ICDMW.2010.77
  • Filename
    5693276