• DocumentCode
    715337
  • Title

    Clustering technical documents by stylistic features for authorship analysis

  • Author

    Berry, Daniel ; Sazonov, Edward

  • Author_Institution
    Dept. of Inf. Syst., Stat., & Manage. Sci., Univ. of Alabama, Tuscaloosa, AL, USA
  • fYear
    2015
  • fDate
    9-12 April 2015
  • Firstpage
    1
  • Lastpage
    5
  • Abstract
    While previous research has demonstrated the ability to discriminate between authors using purely stylistic features, the majority of studies have been conducted on large corpora of non-technical literature. We investigate the ability of unsupervised methods to recover the authorial structure of a collection of technical documents labeled by primary author. Experiments were conducted using 23 submitted conference and journal papers containing almost 100,000 words from a local engineering research group with papers authored by both the Principal Investigator and by graduate students. Stylistic information was extracted from the body of each text forming a feature vector representing the document. Spectral clustering was applied to the feature vectors and the resulting clustering had an Adjusted Rand Index of .306 which is significantly better than chance (p <; .05).
  • Keywords
    feature extraction; natural language processing; pattern clustering; text analysis; unsupervised learning; adjusted Rand index; authorship analysis; conference papers; feature vectors; graduate students; journal papers; nontechnical literature; principal investigator; spectral clustering; stylistic features; stylistic information extraction; technical document clustering; unsupervised methods; Accuracy; Clustering algorithms; Data mining; Feature extraction; Indexes; Plagiarism; Writing; Natural language processing; adjusted Rand index; authorship analysis; spectral clustering; technical writing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    SoutheastCon 2015
  • Conference_Location
    Fort Lauderdale, FL
  • Type

    conf

  • DOI
    10.1109/SECON.2015.7132936
  • Filename
    7132936