Improve VSM text classification by title vector based document representation method

Author

Tian Xia ; Yi Du

Author_Institution

Dept. of Comput. & Inf., Shanghai Second Polytech. Univ., Shanghai, China

fYear

2011

fDate

3-5 Aug. 2011

Firstpage

210

Lastpage

213

Abstract

Text Classification is a daunting task because it is difficult to extract the semantics of natural language texts. Many problems must be resolved before natural-language processing techniques can be effectively applied to a large collection of texts. A significant one is to extract semantic information from corpus in plan text. In Vector Space Model, a document is conceptually represented by a vector of terms extracted from each document, with associated weights representing the importance of each term in the document and within the whole document collection. Likewise, an unclassified document is also modeled as a list of terms with associated weights representing the importance of the terms in it. Many techniques introduces much statistical information of terms to represent their semantic information. However, as always, document title is not taken into special consideration, while it obviously contains much semantic information. This paper proposes Title Vector to address this issue.

Keywords

classification; natural language processing; text analysis; vectors; VSM text classification; document representation method; natural-language processing; semantic information; statistical information; title vector; vector space model; Indexes; Semantics; Support vector machine classification; Testing; Text categorization; Training; Vectors; Text Classification; Title Vector; VSM; Vector Space Model;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Science & Education (ICCSE), 2011 6th International Conference on

Conference_Location

Singapore

Print_ISBN

978-1-4244-9717-1

Type

conf

DOI

10.1109/ICCSE.2011.6028619

Filename

6028619