Title :
Latent semantic indexing and large dataset: Study of term-weighting schemes
Author :
Zaman, A.N.K. ; Brown, Charles Grant
Author_Institution :
Comput. Sci. Program, Univ. of Northern British Columbia (UNBC), Prince George, BC, Canada
Abstract :
The primary purpose of an information retrieval (IR) system is to retrieve all the relevant documents, which are relevant to the user query. Latent Semantic Indexing/Analysis (LSI/LSA) based ad hoc document retrieval task investigates the performance of retrieval systems that search a static set of documents using new questions. Performance of LSI has been tested by others for several smaller datasets (e.g. MED, CISI abstracts) however, LSI has not been tested for a large dataset. So, we decided to test LSI for a very large dataset. We used TREC-8 LA Times dataset for our experimentation. We applied three different term weighting schemes and our own stop word list to judge the performance. Recall-precision graph and Coefficient of Variation (CV) were used to evaluate the retrieval performance of LSI based retrieval system. We found tf-idf term weighting scheme performs better than log-entropy and raw term frequency weighting schemes when the test collection became very large.
Keywords :
database indexing; query processing; very large databases; TREC-8 LA Times dataset; ad hoc document retrieval task; coefficient of variation; information retrieval system; latent semantic indexing; recall-precision graph; tf-idf term weighting scheme; user query; very large dataset; Artificial neural networks; Decision support systems; coefficient of variation; latent semantic indexing; recall-precision; retrieval performance; term-weighting;
Conference_Titel :
Digital Information Management (ICDIM), 2010 Fifth International Conference on
Conference_Location :
Thunder Bay, ON
Print_ISBN :
978-1-4244-7572-8
DOI :
10.1109/ICDIM.2010.5664669