• DocumentCode
    1350309
  • Title

    Improvement on the Porter's Stemming Algorithm for Portuguese

  • Author

    Soares, M.V.B. ; Prati, R.C. ; Monard, M.C.

  • Author_Institution
    Lab. de Intel. Computacional, Univ. de Sao Paulo, Sao Carlos, Brazil
  • Volume
    7
  • Issue
    4
  • fYear
    2009
  • Firstpage
    472
  • Lastpage
    477
  • Abstract
    The amount of textual information digitally stored is growing every day. However, our capability of processing and analyzing that information is not growing at the same pace. To overcome this limitation, it is important to develop semi-automatic processes to extract relevant knowledge from textual information, such as the text mining process. One of the main and most expensive stages of the text mining process is the text pre-processing stage, where the unstructured text should be transformed to structured format such as an attribute-value table. The stemming process, i.e. linguistics normalization, is usually used to find the attributes of this table. However, the stemming process is strongly dependent on the language in which the original textual information is given. Furthermore, for most languages, the stemming algorithms proposed in the literature are computationally expensive. In this work, several improvements of the well know Porter stemming algorithm for the Portuguese language, which explore the characteristics of this language, are proposed. Experimental results show that the proposed algorithm executes in far less time without affecting the quality of the generated stems.
  • Keywords
    computational linguistics; data mining; natural language processing; text analysis; Portuguese language; information analysis; information processing; knowledge extraction; linguistics normalization; porter stemming algorithm; text mining process; text pre-processing stage; textual information; Data mining; Electronic switching systems; Impedance; Information analysis; Single event transient; Text mining; Attribute Reduction; Stemming; Text Mining; Text Pre-Processing;
  • fLanguage
    English
  • Journal_Title
    Latin America Transactions, IEEE (Revista IEEE America Latina)
  • Publisher
    ieee
  • ISSN
    1548-0992
  • Type

    jour

  • DOI
    10.1109/TLA.2009.5349047
  • Filename
    5349047