DocumentCode :
2053143
Title :
Stopword Graphs and Authorship Attribution in Text Corpora
Author :
Arun, R. ; Suresh, V. ; Madhavan, C. E Veni
Author_Institution :
Dept. of Comput. Sci. & Autom., Indian Inst. of Sci., Bangalore, India
fYear :
2009
fDate :
14-16 Sept. 2009
Firstpage :
192
Lastpage :
196
Abstract :
In this work we identify interactions of stopwords-noisewords- in text corpora as a fundamental feature to effect author classification. It is convenient to view such interactions as graphs wherein nodes are stopwords and the interaction between a pair of stopwords are represented as edge-weights. We define the interaction in terms of the distances between pairs of stopwords in text documents. Given a list of authors, graphs for each author is computed based on their undisputed writings. Authorship of a test document is attributed based on the closeness of the graph derived from it to the above graphs. Towards this, we define a closeness measure to compare such graphs based on the Kullback-Leibler divergence. We illustrate the accuracy of our approach by applying it on examples drawn from the Gutenberg archives. Our results show that the proposed approach is effective not only in binary author classification but also performs multiclass author classification for as many as 10 authors at a time and compares favourably with the state-of-the-art in author identification.
Keywords :
classification; linguistics; text analysis; Gutenberg archive; Kullback-Leibler divergence; author identification; authorship attribution; binary author classification; closeness measure; noiseword; stopword graph; text corpora; text document; Automation; Computer science; Fingers; Forensics; Plagiarism; Speech; Testing; Uncertainty; Writing; authorship attribution KL divergence; stylometry; writer invariant;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Semantic Computing, 2009. ICSC '09. IEEE International Conference on
Conference_Location :
Berkeley, CA
Print_ISBN :
978-1-4244-4962-0
Electronic_ISBN :
978-0-7695-3800-6
Type :
conf
DOI :
10.1109/ICSC.2009.101
Filename :
5298613
Link To Document :
بازگشت