Title :
Feature selection techniques for gender prediction from blogs
Author :
Shahana, P.H. ; Omman, Bini
Author_Institution :
Dept. of Comput. Sci. & Eng., SCMS Sch. of Eng. & Technol., Ernakulam, India
Abstract :
The goal of this paper is to identify gender of blog authors. Features such as POS tags, unigram (words+punctuations), bigrams and word classes are considered. To synthesis/rank features we are using Mutual information, Chi-square and Information gain methods. The dataset is the collection of 3227 blogs originally derived from blogs set, and among them 1679 were written by male and 1548 were written by female. The results were obtained using 10-cross fold validation. Unigram of words gave better accuracy of 78.81% in comparison with the other features. We found that chi-square is the best in ranking features. The classification is done using Multinomial Naïve Bayes Classifier, and different kernel functions of SVM such as PolyKernel, Puk, Normalized PolyKernel and RBFkernel.
Keywords :
Bayes methods; Web sites; feature selection; gender issues; pattern classification; support vector machines; 10-cross fold validation; Chi-square; POS tags; RBFkernel; SVM; bigrams; blog authors; feature ranking; feature selection techniques; feature synthesis; gender prediction; information gain methods; multinomial naïve Bayes classifier; mutual information; normalized PolyKernel; unigram; word classes; Accuracy; Blogs; Feature extraction; Kernel; Speech; Support vector machines; Writing; Classification; Multinomial Naïve Bayes; SVM classifier; WEKA classifier; feature selection;
Conference_Titel :
Networks & Soft Computing (ICNSC), 2014 First International Conference on
Conference_Location :
Guntur
Print_ISBN :
978-1-4799-3485-0
DOI :
10.1109/CNSC.2014.6906657