DocumentCode
2865875
Title
Categorization and keyword identification of unlabeled documents
Author
Kang, Ning ; Domeniconi, Carlotta ; Barbara, Daniel
Author_Institution
Dept. of ISE, George Mason Univ., Fairfax, VA, USA
fYear
2005
fDate
27-30 Nov. 2005
Abstract
In this paper, we first propose a global unsupervised feature selection approach for text, based on frequent itemset mining. As a result, each document is represented as a set of words that co-occur frequently in the given corpus of documents. We then introduce a locally adaptive clustering algorithm, designed to estimate (local) word relevance and, simultaneously, to group the documents. We present experimental results to demonstrate the feasibility of our approach. Furthermore, the analysis of the weights credited to terms provides evidence that the identified keywords can guide the process of label assignment to clusters. We take into consideration both spam email filtering and general classification datasets. Our analysis of the distribution of weights in the two cases provides insights on how the spam problem distinguishes from the general classification case.
Keywords
classification; data mining; feature extraction; information filtering; pattern clustering; text analysis; unsolicited e-mail; adaptive clustering; frequent itemset mining; general classification dataset; global unsupervised feature selection; keyword identification; label assignment; spam email filtering; text mining; unlabeled document categorization; word relevance; Algorithm design and analysis; Clustering algorithms; Data analysis; Data mining; Dictionaries; Filtering; Functional analysis; Indexing; Itemsets; Predictive models;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Mining, Fifth IEEE International Conference on
ISSN
1550-4786
Print_ISBN
0-7695-2278-5
Type
conf
DOI
10.1109/ICDM.2005.39
Filename
1565755
Link To Document