• DocumentCode
    917896
  • Title

    Assessing the effectiveness of feature groups in author recognition tasks with the SOM model

  • Author

    Tambouratzis, George

  • Author_Institution
    Inst. for Language & Speech Process., Athens, Greece
  • Volume
    36
  • Issue
    2
  • fYear
    2006
  • fDate
    3/1/2006 12:00:00 AM
  • Firstpage
    249
  • Lastpage
    259
  • Abstract
    The present paper focuses on studying the effectiveness of the self-organizing map (SOM) when applied to the task of categorizing a corpus of texts according to the style of their authors. This task is of particular importance for information retrieval applications using very large databases of documents. The emphasis of this article is to determine the extent to which the SOM possesses the ability to analyze such data, successfully uncovering the stylistic differences among authors in an unsupervised manner. To that end, a variety of feature vectors are studied, each of which either 1) comprises a single category of linguistic features or 2) spans several different categories of linguistic features, in order to determine the effectiveness of each feature category. It is shown that the highest accuracy is achieved when using a vector covering multiple linguistic categories. A comparison of the results obtained to the results of statistical methods indicates the ability of the SOM network to reveal the clustering potential of isolated parameter groups and its effectiveness in handling efficiently high-dimensional data vectors. Potential extensions to related text-organization techniques, such as the WEBSOM, thus become evident.
  • Keywords
    information retrieval; linguistics; pattern recognition; self-organising feature maps; text analysis; very large databases; SOM; author identification; author recognition task; document handling; feature vector; information retrieval; linguistic feature; self-organizing map; statistical method; text categorization; very large databases; Data analysis; Frequency; Information retrieval; Multi-layer neural network; Neural networks; Spatial databases; Speech analysis; Statistical analysis; Text recognition; Writing; Author identification; self-organizing map (SOM); stylometry;
  • fLanguage
    English
  • Journal_Title
    Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1094-6977
  • Type

    jour

  • DOI
    10.1109/TSMCC.2004.843242
  • Filename
    1624550