• DocumentCode
    3431925
  • Title

    Improve VSM text classification by title vector based document representation method

  • Author

    Tian Xia ; Yi Du

  • Author_Institution
    Dept. of Comput. & Inf., Shanghai Second Polytech. Univ., Shanghai, China
  • fYear
    2011
  • fDate
    3-5 Aug. 2011
  • Firstpage
    210
  • Lastpage
    213
  • Abstract
    Text Classification is a daunting task because it is difficult to extract the semantics of natural language texts. Many problems must be resolved before natural-language processing techniques can be effectively applied to a large collection of texts. A significant one is to extract semantic information from corpus in plan text. In Vector Space Model, a document is conceptually represented by a vector of terms extracted from each document, with associated weights representing the importance of each term in the document and within the whole document collection. Likewise, an unclassified document is also modeled as a list of terms with associated weights representing the importance of the terms in it. Many techniques introduces much statistical information of terms to represent their semantic information. However, as always, document title is not taken into special consideration, while it obviously contains much semantic information. This paper proposes Title Vector to address this issue.
  • Keywords
    classification; natural language processing; text analysis; vectors; VSM text classification; document representation method; natural-language processing; semantic information; statistical information; title vector; vector space model; Indexes; Semantics; Support vector machine classification; Testing; Text categorization; Training; Vectors; Text Classification; Title Vector; VSM; Vector Space Model;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Science & Education (ICCSE), 2011 6th International Conference on
  • Conference_Location
    Singapore
  • Print_ISBN
    978-1-4244-9717-1
  • Type

    conf

  • DOI
    10.1109/ICCSE.2011.6028619
  • Filename
    6028619