Author/Authors :
el abdouli, abdeljalil hassan ii university of casablanca - ecole superieure de technologie, ced engineering sciences - ritm laboratory, morocco , hassouni, larbi hassan ii university of casablanca - ecole superieure de technologie, ced engineering sciences - ritm laboratory, morocco , anoun, houda hassan ii university of casablanca - ecole superieure de technologie, ced engineering sciences - ritm laboratory, morocco
Abstract :
Social networks are taking an increasingly important place in the field of communication within our society. The most used are Twitter, Facebook, Instagram, Tumblr, Dribble, LinkedIn, and Google+. Twitter is a popular social network where connected users can publish short messages limited to 140 characters called “tweets” in which users can share thoughts, post links or images. Twitter has gained wide popularity in Arab world and especially Morocco due to its simplicity of use and services offered by its platform, this information revolution in our society leads to an accumulation of a vast quantity of data that may contain a lot of valuable information. Analyzing these tweets of Moroccan users come with challenges becauseMoroccan users use a variety of languages and dialects, such as Standard Arabic, Moroccan Arabic called “Darija”, Moroccan Amazigh dialect called “Tamazight”, French, English and more. In addition, the tweets of Moroccan userscontain a lot of abbreviations, #hashtags, URLs, spelling mistakes, reduced syntactic structures, and manyabbreviations. In this paper, we propose a new approach to determine, from the data sent on Twitter, the subjects that interest Moroccan society and then locate on the Moroccan map the areas from where come the tweets related to these topics. Our proposed approach is based on a distributed system, which contains four main components: the Hadoopframework, the natural language processing, the clustering algorithm k-means, and a tool for plotting tweets graphically on Moroccan map. The first task of this system is to automatically extract the tweets. Next,it stores them in a distributed file system using HDFS (Hadoop DistributedFile System) of Apache Hadoopframework. Then we process this raw data and analyze it by using a distributed program using MapReduce of Hadoop framework, Python language, and Natural Language Processing (NLP) techniques. Afterward, we use a text mining technique,calledTF-IDF (Term Frequency-Inverse Document Frequency), toconvert the corpus generated by the previous step into a vector representation, where each dimension of the vector corresponds to a word, and then we implement the k-means algorithm to cluster all words into topics. Finally, we graphically plotthe topics on the Moroccan map by using the coordinates extracted from tweets, in order to discover the relation between the discoveredtopics and located Moroccan areas.
NaturalLanguageKeyword :
Hadoopframework , HDFS , Distributed program , MapReduce , Python Language , Natural Language Processing , TFIDF , K , means