DocumentCode :
1070793
Title :
A Communication Perspective on Automatic Text Categorization
Author :
Capdevila, Marta ; Florez, O.W.M.
Author_Institution :
Signal & Commun. Process. Dept., Univ. of Vigo, Vigo
Volume :
21
Issue :
7
fYear :
2009
fDate :
7/1/2009 12:00:00 AM
Firstpage :
1027
Lastpage :
1041
Abstract :
The basic concern of a Communication System is to transfer information from its source to a destination some distance away. Textual documents also deal with the transmission of information. Particularly, from a text categorization system point of view, the information encoded by a document is the topic or category it belongs to. Following this initial intuition, a theoretical framework is developed where Automatic Text Categorization(ATC) is studied under a Communication System perspective. Under this approach, the problematic indexing feature space dimensionality reduction has been tackled by a two-level supervised scheme, implemented by a noisy terms filtering and a subsequent redundant terms compression. Gaussian probabilistic categorizers have been revisited and adapted to the concomitance of sparsity in ATC. Experimental results pertaining to 20 Newsgroups and Reuters-21578 collections validate the theoretical approaches. The noise filter and redundancy compressor allows an aggressive term vocabulary reduction (reduction factor greater than 0.99) with a minimum loss (lower than 3 percent) and, in some cases, gain (greater than 4 percent) of final classification accuracy. The adapted Gaussian Naive Bayes classifier reaches classification results similar to those obtained by state-of-the-art Multinomial Naive Bayes (MNB) and Support Vector Machines (SVMs).
Keywords :
Gaussian processes; data reduction; indexing; probability; text analysis; Gaussian Naive Bayes classifier; Gaussian probabilistic categorization; automatic text categorization; communication perspective; dimensionality reduction; information encoding; problematic indexing feature space; support vector machine; textual document; two-level supervised scheme; vocabulary reduction; Data communications; classifier design and evaluation; clustering; data compaction and compression; feature evaluation and selection.; text processing;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2009.22
Filename :
4752825
Link To Document :
بازگشت