Title :
Machine Learning for Author Affiliation within Web Forums -- Using Statistical Techniques on NLP Features for Online Group Identification
Author :
Ellen, Jeffrey ; Parameswaran, Shibin
Author_Institution :
Space & Naval Warfare Syst. Center Pacific, United States Navy, San Diego, CA, USA
Abstract :
Although there have been previous studies performing authorship attribution to a specific individual, we find a shortage of efforts to group authors based on their affiliations. This paper presents our work on classification of website forum posts by the author´s group affiliation. Specifically, we seek to classify translated website forum posts by the (inferred) political affiliation of the author. The two datasets that we attempt to classify consist of real-world data discussing current issues -- Israeli/Palestinian dialogue (Bitter Lemons corpus) and translated Extremist/Moderate forum entries (from internet websites). To achieve our goal of reliable authorship affiliation, we extract term frequency-based features (that are conventional in document classification) along with less commonly used linguistic style-based features. The resulting set of stylometric features are then utilized in two widely used supervised classification algorithms, namely k-Nearest Neighbor algorithm and Support Vector Machines. Specifically, we used k-NN with cosine distance and Support Vector Machines with two different kernel functions. In addition to the popular RBF kernels, we also evaluate the applicability and performance of the recently introduced arc-cosine kernels for group affiliation. The results of our experiments show strong performance across a range of pertinent metrics.
Keywords :
Web sites; natural language processing; statistical analysis; support vector machines; NLP features; RBF kernels; Web forums; Web site forum; author affiliation; authorship affiliation; cosine distance; document classification; k-nearest neighbor algorithm; kernel functions; linguistic style-based features; machine learning; online group identification; statistical technique; stylometric features; supervised classification; support vector machines; term frequency-based features; Classification algorithms; Feature extraction; Kernel; Machine learning; Measurement; Pragmatics; Support vector machines; Natural language processing; Stylometrics; Support vector machines; Text Classification; arccosine kernels; feature combination; feature extraction; k-nearest neighbor;
Conference_Titel :
Machine Learning and Applications and Workshops (ICMLA), 2011 10th International Conference on
Conference_Location :
Honolulu, HI
Print_ISBN :
978-1-4577-2134-2
DOI :
10.1109/ICMLA.2011.90