Title of article :
Ranking of predictor variables based on effect size criterion provides an accurate means of automatically classifying opinion column articles
Author/Authors :
Erika Fille Legara، نويسنده , , Christopher Monterola، نويسنده , , Cheryl Abundo، نويسنده ,
Issue Information :
روزنامه با شماره پیاپی سال 2011
Abstract :
We demonstrate an accurate procedure based on linear discriminant analysis that allows automatic authorship classification of opinion column articles. First, we extract the following stylometric features of 157 column articles from four authors: statistics on high frequency words, number of words per sentence, and number of sentences per paragraph. Then, by systematically ranking these features based on an effect size criterion, we show that we can achieve an average classification accuracy of 93% for the test set. In comparison, frequency size based ranking has an average accuracy of 80%. The highest possible average classification accuracy of our data merely relying on chance is 31%. By carrying out sensitivity analysis, we show that the effect size criterion is superior than frequency ranking because there exist low frequency words that significantly contribute to successful author discrimination. Consistent results are seen when the procedure is applied in classifying the undisputed Federalist papers of Alexander Hamilton and James Madison. To the best of our knowledge, the work is the first attempt in classifying opinion column articles, that by virtue of being shorter in length (as compared to novels or short stories), are more prone to over-fitting issues. The near perfect classification for the longer papers supports this claim. Our results provide an important insight on authorship attribution that has been overlooked in previous studies: that ranking discriminant variables based on word frequency counts is not necessarily an optimal procedure.
Journal title :
Physica A Statistical Mechanics and its Applications
Journal title :
Physica A Statistical Mechanics and its Applications