A Model for Term Selection in Text Categorization Problems

Author

Cannas, Laura Maria ; Dessì, Nicoletta ; Dessì, Stefania

Author_Institution

Dipt. di Mat. e Inf., Univ. degli Studi di Cagliari, Cagliari, Italy

fYear

2012

Firstpage

169

Lastpage

173

Abstract

In the last ten years, automatic Text Categorization (TC) has been gaining an increasing interest from the research community, due to the need to organize a massive number of digital documents. Following a machine learning paradigm, this paper presents a model which regards TC as a classification task supported by a wrapper approach and combines the utilization of a Genetic Algorithm (GA) with a filter. First, a filter is used to weigh the relevance of terms in documents. Then, the top-ranked terms are grouped in several nested sets of relatively small size. These sets are explored by a GA which extracts the subset of terms that best categorize documents. Experimental results on the Reuters-21578 dataset state the effectiveness of the proposed model and its competitiveness with the learning approaches proposed in the TC literature.

Keywords

genetic algorithms; information filtering; learning (artificial intelligence); natural language processing; pattern classification; text analysis; GA; TC; automatic text categorization problem; best categorize documents; classification task; digital documents; genetic algorithm; machine learning paradigm; natural language documents; term selection; text filter; top-ranked terms; Classification algorithms; Filtering algorithms; Genetic algorithms; Machine learning; Measurement; Support vector machines; Text categorization; genetic algorithm; hybrid model; term selection; text categorization;

fLanguage

English

Publisher

ieee

Conference_Titel

Database and Expert Systems Applications (DEXA), 2012 23rd International Workshop on

Conference_Location

Vienna

ISSN

1529-4188

Print_ISBN

978-1-4673-2621-6

Type

conf

DOI

10.1109/DEXA.2012.41

Filename

6327421