مرکز منطقه ای اطلاع رساني علوم و فناوري - Word co-occurrence features for text classification

Title of article :

Word co-occurrence features for text classification

Author/Authors :

F?bio Figueiredo، نويسنده , , Leonardo Rocha Souza، نويسنده , , Thierson Couto، نويسنده , , Thiago Salles، نويسنده , , Marcos André Gonçalves، نويسنده , , Wagner Meira Jr.، نويسنده ,

Issue Information :

روزنامه با شماره پیاپی سال 2011

Pages :

From page :

843

To page :

858

Abstract :

In this article we propose a data treatment strategy to generate new discriminative features, called compound-features (or c-features), for the sake of text classification. These c-features are composed by terms that co-occur in documents without any restrictions on order or distance between terms within a document. This strategy precedes the classification task, in order to enhance documents with discriminative c-features. The idea is that, when c-features are used in conjunction with single-features, the ambiguity and noise inherent to their bag-of-words representation are reduced. We use c-features composed of two terms in order to make their usage computationally feasible while improving the classifier effectiveness. We test this approach with several classification algorithms and single-label multi-class text collections. Experimental results demonstrated gains in almost all evaluated scenarios, from the simplest algorithms such as kNN (13% gain in micro-average F1 in the 20 Newsgroups collection) to the most complex one, the state-of-the-art SVM (10% gain in macro-average F1 in the collection OHSUMED).

Keywords :

Classification , Text Mining , feature extraction

Journal title :

Information Systems

Serial Year :

2011

Journal title :

Information Systems

Record number :

1230218

Link To Document :

https://search.isc.ac/dl/search/defaultta.aspx?DTC=10&DC=1230218