Title :
Behavior extraction from tweets using character N-gram models
Author :
Yuji Yano ; Hashiyama, Tomonori ; Ichino, Junko ; Tano, Shun´ichi
Author_Institution :
Dept. of Human Media Syst., Univ. of Electro-Commun., Chofu, Japan
Abstract :
Human daily activities are stored in various kinds of data representations using ICT devices nowadays, named lifelogs. It is highly requested to retrieve useful information from lifelogs because these raw data are hard to handle. Extracting human activities from these logs is promising to enrich our life. Context-awareness services can be provided depending on user activities extracted from these logs. Recently, a lot of people post a message called tweet within Twitter to show what they are doing, thinking, feeling, and so on. Tweets have potential to record human activities, because many people post tweets so frequently every day. This paper focused on the tweets to retrieve human behavior from them. The length of tweets are limited within short sentence, so this causes some difficulties. The users will use domain-specific terms and will post grammatically incorrect sentences to fit with the constraints. These make us hard to analyze tweets with grammatical manner or with dictionaries. To tackle them, we are applying character n-gram tokenization and naive Bayes classifier to extract appropriate behavioral information from tweets. Using n-gram tokenizer, domain-specific words can be identified and incorrect grammar can be handled. Our approach is examined using real tweets in Japanese. The index of precision, recall and F-measure shows the promising results. Some experiments have been carried out to show the feasibility of our approach. At this point, our system applied to Japanese tweets but it is applicable to any other languages.
Keywords :
Bayes methods; behavioural sciences computing; information retrieval; natural language processing; pattern classification; social networking (online); F-measure index; ICT devices; Japanese tweets; Twitter; behavior extraction; character N-gram models; character n-gram tokenization; context-awareness services; data representations; human behavior retrieval; human daily activities; information retrieval; n-gram tokenizer; naive Bayes classifier; precision index; recall index; tweet message; Data mining; Dictionaries; Feature extraction; Grammar; Training; Training data; Twitter;
Conference_Titel :
Fuzzy Systems (FUZZ-IEEE), 2014 IEEE International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4799-2073-0
DOI :
10.1109/FUZZ-IEEE.2014.6891784