Title :
Stopword Graphs and Authorship Attribution in Text Corpora
Author :
Arun, R. ; Suresh, V. ; Madhavan, C. E Veni
Author_Institution :
Dept. of Comput. Sci. & Autom., Indian Inst. of Sci., Bangalore, India
Abstract :
In this work we identify interactions of stopwords-noisewords- in text corpora as a fundamental feature to effect author classification. It is convenient to view such interactions as graphs wherein nodes are stopwords and the interaction between a pair of stopwords are represented as edge-weights. We define the interaction in terms of the distances between pairs of stopwords in text documents. Given a list of authors, graphs for each author is computed based on their undisputed writings. Authorship of a test document is attributed based on the closeness of the graph derived from it to the above graphs. Towards this, we define a closeness measure to compare such graphs based on the Kullback-Leibler divergence. We illustrate the accuracy of our approach by applying it on examples drawn from the Gutenberg archives. Our results show that the proposed approach is effective not only in binary author classification but also performs multiclass author classification for as many as 10 authors at a time and compares favourably with the state-of-the-art in author identification.
Keywords :
classification; linguistics; text analysis; Gutenberg archive; Kullback-Leibler divergence; author identification; authorship attribution; binary author classification; closeness measure; noiseword; stopword graph; text corpora; text document; Automation; Computer science; Fingers; Forensics; Plagiarism; Speech; Testing; Uncertainty; Writing; authorship attribution KL divergence; stylometry; writer invariant;
Conference_Titel :
Semantic Computing, 2009. ICSC '09. IEEE International Conference on
Conference_Location :
Berkeley, CA
Print_ISBN :
978-1-4244-4962-0
Electronic_ISBN :
978-0-7695-3800-6
DOI :
10.1109/ICSC.2009.101