• DocumentCode
    3500858
  • Title

    Using Linguistic Information to Classify Portuguese Text Documents

  • Author

    Goncalves, Tiago ; Quaresma, Paulo

  • Author_Institution
    Dept. de Inf., Univ. de Evora, Evora
  • fYear
    2008
  • fDate
    27-31 Oct. 2008
  • Firstpage
    94
  • Lastpage
    100
  • Abstract
    This paper examines the role of various linguistic structures on text classification applying the study to the Portuguese language. Besides using a bag-of-words representation where we evaluate different measures and use linguistic knowledge for term selection, we do several experiments using syntactic information representing documents as strings of words and strings of syntactic parse trees. To build the classifier we use the support vector machine (SVM) algorithm which is known to produce good results on text classification tasks and apply the study to a dataset of articles from the Publico newspaper. The results show that sentences´ syntactic structure is not useful for text classification (as initially expected), but part-of-speech information can be used as a term selection technique to construct the bag-of-words representation of documents.
  • Keywords
    document image processing; linguistics; support vector machines; text analysis; trees (mathematics); word processing; Portuguese language; Portuguese text documents; Publico newspaper; SVM; bag-of-words representation; linguistic informations; part-of-speech information; support vector machine; syntactic parse trees; term selection technique; text classification; Artificial intelligence; Information filtering; Information filters; Information technology; Machine learning; Machine learning algorithms; Natural languages; Support vector machine classification; Support vector machines; Text categorization; SVM; text classification;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Artificial Intelligence, 2008. MICAI '08. Seventh Mexican International Conference on
  • Conference_Location
    Atizapan de Zaragoza
  • Print_ISBN
    978-0-7695-3441-1
  • Type

    conf

  • DOI
    10.1109/MICAI.2008.17
  • Filename
    4682449