A method for calculating term similarity on large document collections

Author

Bein, Wolfgang W. ; Coombs, Jeffrey S. ; Taghva, Kazem

Author_Institution

Sch. of Comput. Sci., Nevada Univ., Las Vegas, NV, USA

fYear

2003

fDate

28-30 April 2003

Firstpage

199

Lastpage

203

Abstract

We present an efficient algorithm called the Quadtree Heuristic for identifying a list of similar terms for each unique term in a large document collection. Term similarity is defined using the expected mutual information measure (EMIM). Since our aim for defining the similarity lists is to improve information retrieval (IR), we present the outcome of an experiment comparing the performance of an IR engine designed to use the similarity lists. Two methods were used to generate similarity lists: a brute-force technique and the Quadtree Heuristic. The performance of the list generated by the Quadtree Heuristic was commensurate with the brute force list.

Keywords

information retrieval; quadtrees; EMIM; Expected Mutual Information Measure; IR engine; Quadtree Heuristic; brute force technique; information retrieval; large document collection; large document collections; similarity lists; term similarity; Computer science; Engines; Image retrieval; Information retrieval; Information science; Magnetic fields; Mutual information; Optical character recognition software; Performance evaluation; Testing;

fLanguage

English

Publisher

ieee

Conference_Titel

Information Technology: Coding and Computing [Computers and Communications], 2003. Proceedings. ITCC 2003. International Conference on

Print_ISBN

0-7695-1916-4

Type

conf

DOI

10.1109/ITCC.2003.1197526

Filename

1197526