Title : 
An extensive study of the Bag-of-Words approach for gender identification of Arabic articles
         
        
            Author : 
Alsmearat, Kholoud ; Al-Ayyoub, Mahmoud ; Al-Shalabi, Riyad
         
        
            Author_Institution : 
Jordan Univ. of Sci. & Technol., Irbid, Jordan
         
        
        
        
        
            Abstract : 
The prevalent use of Online Social Networks (OSN) and the anonymity and lack of accountability they inherent from being online give rise to many problems related to finding the connection between the massive amount of text data on OSN and the people who actually wrote them. Analyzing text data for such purposes is called authorship analysis. This work is focused on one specific type of authorship analysis, which is identifying the author´s gender. Gender identification has various applications from marketing to security. The focus of this work is on Arabic articles. The problem is basically a classification problem and the current approaches differ in the way they compute the features of each document. However, they all agree on following some “stylometric features” approach. Unlike these works, ours treat this problem as a variation of the Text Classification (TC) problem and follow the Bag-Of-Words (BOW) approach for feature selection. We perform an extensive set of experiments on the feature selection and classification phase and the results show that such an approach yield surprisingly high results.
         
        
            Keywords : 
data analysis; feature selection; natural language processing; pattern classification; social networking (online); text analysis; Arabic articles; BOW approach; OSN; TC problem; authorship analysis; bag-of-words approach; classification phase; classification problem; feature selection; gender identification; online social networks; stylometric feature approach; text classification problem; text data analysis; Algorithm design and analysis; Feature extraction; Principal component analysis; Support vector machines; Testing; Vectors; Writing;
         
        
        
        
            Conference_Titel : 
Computer Systems and Applications (AICCSA), 2014 IEEE/ACS 11th International Conference on
         
        
        
            DOI : 
10.1109/AICCSA.2014.7073254