Latent Ontological Feature Discovery for Text Clustering

Author

Duong, V.T.T. ; Cao, Tru H. ; Chau, Cuong K. ; Quan, Tho T.

Author_Institution

Fac. of Inf. Technol. & Appl. Math., Ton Duc Thang Univ., Ho Chi Minh City, Vietnam

fYear

2009

fDate

13-17 July 2009

Firstpage

1

Lastpage

8

Abstract

The content of a text is mainly defined by keywords and named entities occurring in it. In particular for news articles, named entities are usually important to define their semantics. However, named entities have ontological features, namely, their aliases, types, and identifiers, which are hidden from their textual appearance. In this paper, we explore weighted combinations of those latent named entity features with keywords for text clustering. To that end, the traditional vector space model is adapted with multiple vectors defined over spaces of entity names, types, name-type pairs, identifiers, and keywords. Clustering quality is evaluated by both of the self purity-separation type and the relative comparison type of measures. Hard and fuzzy clustering experiments of the proposed model on selected data subsets of Reuters-21578 are conducted and evaluated.

Keywords

fuzzy set theory; pattern clustering; text analysis; Reuters-21578; clustering quality; fuzzy clustering; hard clustering; keyword; latent named entity feature; latent ontological feature discovery; news article; relative comparison type; self purity-separation type; semantics; text clustering; vector space model; Cities and towns; Clustering algorithms; Computer science; Entropy; Information retrieval; Information technology; Labeling; Mathematics; Ontologies; Vectors;

fLanguage

English

Publisher

ieee

Conference_Titel

Computing and Communication Technologies, 2009. RIVF '09. International Conference on

Conference_Location

Da Nang

Print_ISBN

978-1-4244-4566-0

Electronic_ISBN

978-1-4244-4568-4

Type

conf

DOI

10.1109/RIVF.2009.5174647

Filename

5174647