• DocumentCode
    3433706
  • Title

    A method for calculating term similarity on large document collections

  • Author

    Bein, Wolfgang W. ; Coombs, Jeffrey S. ; Taghva, Kazem

  • Author_Institution
    Sch. of Comput. Sci., Nevada Univ., Las Vegas, NV, USA
  • fYear
    2003
  • fDate
    28-30 April 2003
  • Firstpage
    199
  • Lastpage
    203
  • Abstract
    We present an efficient algorithm called the Quadtree Heuristic for identifying a list of similar terms for each unique term in a large document collection. Term similarity is defined using the expected mutual information measure (EMIM). Since our aim for defining the similarity lists is to improve information retrieval (IR), we present the outcome of an experiment comparing the performance of an IR engine designed to use the similarity lists. Two methods were used to generate similarity lists: a brute-force technique and the Quadtree Heuristic. The performance of the list generated by the Quadtree Heuristic was commensurate with the brute force list.
  • Keywords
    information retrieval; quadtrees; EMIM; Expected Mutual Information Measure; IR engine; Quadtree Heuristic; brute force technique; information retrieval; large document collection; large document collections; similarity lists; term similarity; Computer science; Engines; Image retrieval; Information retrieval; Information science; Magnetic fields; Mutual information; Optical character recognition software; Performance evaluation; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Technology: Coding and Computing [Computers and Communications], 2003. Proceedings. ITCC 2003. International Conference on
  • Print_ISBN
    0-7695-1916-4
  • Type

    conf

  • DOI
    10.1109/ITCC.2003.1197526
  • Filename
    1197526