Decomposition of term-document matrix representation for clustering analysis

Author

Yang, Jianxiong ; Watada, Junzo

Author_Institution

Grad. Sch. of Inf., Production & Syst., Waseda Univ., Kitakyushu, Japan

fYear

2011

fDate

27-30 June 2011

Firstpage

976

Lastpage

983

Abstract

Latent Semantic Indexing (LSI) is an information retrieval technique using a low-rank singular value decomposition (SVD) of term-document matrix. The aim of this method is to reduce the matrix dimension by finding a pattern in document collection with concurrently referring terms. The methods are implemented to calculate the weight of term-document in vector space model (VSM) for document clustering using fuzzy clustering algorithm. LSI is an attempt to exploit the underlying semantic structure of word usage in documents. During the query-matching phase of LSI, a user´s query is first projected into the term-document space, and then compared to all terms and documents represented in the vector space. Using some similarity measure, the nearest (most relevant) terms and documents are identified and returned to the user. The current LSI query-matching method requires computing the similarity measure about the query of every term and document in the vector space. In this paper, the Maximal Tree Algorithm is used within a recent LSI implementation to mitigate the computational time and computational complexity of query matching. The Maximal Tree data structure stores the term and document vectors in such a way that only those terms and documents are most likely qualified as the nearest neighbor to the query will be examined and retrieved. In a word, this novel algorithm is suitable for improving the accuracy of data miners.

Keywords

computational complexity; data mining; fuzzy set theory; indexing; pattern clustering; query processing; singular value decomposition; tree data structures; word processing; LSI query matching method; LSI technique; SVD; clustering analysis; computational complexity; document clustering; fuzzy clustering algorithm; information retrieval technique; latent semantic indexing; low-rank singular value decomposition; matrix dimension reduction; maximal tree algorithm; maximal tree data structure; nearest term identification; similarity measures; term-document matrix decomposition; term-document matrix representation; user query; vector space model; word usage; Accidents; Economics; Indexing; Large scale integration; Matrix decomposition; Semantics; Singular value decomposition; Fuzzy clustering; LSI; SVD; data mining;

fLanguage

English

Publisher

ieee

Conference_Titel

Fuzzy Systems (FUZZ), 2011 IEEE International Conference on

Conference_Location

Taipei

ISSN

1098-7584

Print_ISBN

978-1-4244-7315-1

Electronic_ISBN

1098-7584

Type

conf

DOI

10.1109/FUZZY.2011.6007525

Filename

6007525