Document Clustering via Matrix Representation

Author

Wang, Xufei ; Tang, Jiliang ; Liu, Huan

Author_Institution

Arizona State Univ., Tempe, AZ, USA

fYear

2011

fDate

11-14 Dec. 2011

Firstpage

804

Lastpage

813

Abstract

Vector Space Model (VSM) is widely used to represent documents and web pages. It is simple and easy to deal computationally, but it also oversimplifies a document into a vector, susceptible to noise, and cannot explicitly represent underlying topics of a document. A matrix representation of document is proposed in this paper: rows represent distinct terms and columns represent cohesive segments. The matrix model views a document as a set of segments, and each segment is a probability distribution over a limited number of latent topics which can be mapped to clustering structures. The latent topic extraction based on the matrix representation of documents is formulated as a constraint optimization problem in which each matrix (i.e., a document) A_i is factorized into a common base determined by non-negative matrices L and R^T, and a non-negative weight matrix M_i such that the sum of reconstruction error on all documents is minimized. Empirical evaluation demonstrates that it is feasible to use the matrix model for document clustering: (1) compared with vector representation, using matrix representation improves clustering quality consistently, and the proposed approach achieves a relative accuracy improvement up to 66% on the studied datasets, and (2) the proposed method outperforms baseline methods such as k-means and NMF, and complements the state-of-the-art methods like LDA and PLSI. Furthermore, the proposed matrix model allows more refined information retrieval at a segment level instead of at a document level, which enables the return of more relevant documents in information retrieval tasks.

Keywords

Web sites; constraint handling; document handling; information retrieval; matrix algebra; optimisation; pattern clustering; probability; LDA; NMF; PLSI; Web pages; clustering structures; cohesive segments; constraint optimization problem; document clustering; information retrieval tasks; k-means; matrix model; matrix representation; probability distribution; vector space model; Approximation methods; Bismuth; Clustering algorithms; Data mining; Matrix decomposition; Probability distribution; Vectors; Document Clustering; Document Representation; Matrix Representation; Non-Negative Matrix Approximation;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Mining (ICDM), 2011 IEEE 11th International Conference on

Conference_Location

Vancouver,BC

ISSN

1550-4786

Print_ISBN

978-1-4577-2075-8

Type

conf

DOI

10.1109/ICDM.2011.59

Filename

6137285