DocumentCode :
715337
Title :
Clustering technical documents by stylistic features for authorship analysis
Author :
Berry, Daniel ; Sazonov, Edward
Author_Institution :
Dept. of Inf. Syst., Stat., & Manage. Sci., Univ. of Alabama, Tuscaloosa, AL, USA
fYear :
2015
fDate :
9-12 April 2015
Firstpage :
1
Lastpage :
5
Abstract :
While previous research has demonstrated the ability to discriminate between authors using purely stylistic features, the majority of studies have been conducted on large corpora of non-technical literature. We investigate the ability of unsupervised methods to recover the authorial structure of a collection of technical documents labeled by primary author. Experiments were conducted using 23 submitted conference and journal papers containing almost 100,000 words from a local engineering research group with papers authored by both the Principal Investigator and by graduate students. Stylistic information was extracted from the body of each text forming a feature vector representing the document. Spectral clustering was applied to the feature vectors and the resulting clustering had an Adjusted Rand Index of .306 which is significantly better than chance (p <; .05).
Keywords :
feature extraction; natural language processing; pattern clustering; text analysis; unsupervised learning; adjusted Rand index; authorship analysis; conference papers; feature vectors; graduate students; journal papers; nontechnical literature; principal investigator; spectral clustering; stylistic features; stylistic information extraction; technical document clustering; unsupervised methods; Accuracy; Clustering algorithms; Data mining; Feature extraction; Indexes; Plagiarism; Writing; Natural language processing; adjusted Rand index; authorship analysis; spectral clustering; technical writing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
SoutheastCon 2015
Conference_Location :
Fort Lauderdale, FL
Type :
conf
DOI :
10.1109/SECON.2015.7132936
Filename :
7132936
Link To Document :
بازگشت