Title :
Who wrote this paper? Learning for authorship de-identification using stylometric featuress
Author :
Hurtado, Jose ; Taweewitchakreeya, Napat ; Xingquan Zhu
Author_Institution :
Dept. of Comput. & Electr. Eng. & Comput. Sci., Florida Atlantic Univ., Boca Raton, FL, USA
Abstract :
In this paper, we propose to combine stylometric features and neural networks for authorship de-identification. Our research mainly focuses on scientific publications, because scholarly journals are publicly available with plenty of labeled data to learn an author´s style or traits. The main challenge of authorship de-identification is to identify features which can properly capture an author´s writing style. In the proposed design, we choose a combination of stylometric features, including lexical, syntactic, structural and content-specific features, to represent each author´s style and use them to build classification models. We manually collect publications from computer science and biomedicine domains and validate our designs by using a number of classification methods. Our experiments show that among four well-known classifiers, Multilayer Perceptron (MLP) classifiers achieve the best performance for authorship de-identification.
Keywords :
feature extraction; learning (artificial intelligence); multilayer perceptrons; pattern classification; text analysis; MLP classifier; author style learning; author trait learning; author writing style; authorship deidentification; biomedicine publications; classification method; classification model; computer science publications; content-specific features; feature identification; lexical features; multilayer perceptron classifier; neural networks; publicly available scholarly journals; scientific publications; structural features; stylometric features; syntactic features; Abstracts; Computer science; Feature extraction; Radio frequency; Support vector machines; Training data; Machine learning; artificial neural network; authorship de-identification; text classification;
Conference_Titel :
Information Reuse and Integration (IRI), 2014 IEEE 15th International Conference on
DOI :
10.1109/IRI.2014.7051981