Title :
Methodologies of Internet portals users´ short messages texts authorship identification based on the methods of mathematical linguistics
Author :
Milhail, Sukhoparov ; Ilya, Lebedev
Author_Institution :
Dept. of Secure Inf. Technol., SPb NRUITMO, St. Petersburg, Russia
Abstract :
The article deals with the peculiarities of Internet portals, blogs and websites short messages texts authorship determination. The article focuses on possibility to search people who have several different accounts and send messages from them. Sentences dependence on the number of words in portals users´ comments is represented. The model of Internet portal text message is provided. Method of Internet portals users´ short messages texts authorship identification based on the naive Bayesian classifier is represented. The specific feature of the proposed method is not only frequency dictionary analysis based on messages selection to identify users, but their usage of rules and connections on the base of language syntactic information. The parts of speech frequency and connection frequency between parts of speech are given. The communication graph of parts of speech connections of limited natural language in commentaries is represented. Linguistic characteristics used to identify portal user are given. Structures are distinguished on the base of the communication graph between parts of speech as regards noun prepositional casal form of limited natural language used to identify text authorship. The experiment showing achievable indicators of Internet portal user identification probability depending on training sample is carried out. Probability diagrams of authorship identification based on selected characteristics are represented.
Keywords :
Bayes methods; Internet; Web sites; graph theory; natural language processing; pattern classification; portals; text analysis; Internet portal text message; Internet portals user short message text authorship identification; Websites; blogs; commentaries; communication graph; connection frequency; frequency dictionary analysis; language syntactic information; limited natural language; linguistic characteristics; mathematical linguistics; message selection; naive Bayesian classifier; noun prepositional casal form; parts of speech connections; people searching; portal user comment; portal user identification; probability diagrams; short message text authorship determination; speech frequency; Bayes methods; Internet; Natural languages; Portals; Pragmatics; Speech; Training; Authorship identification; Bayesian classifier; text information classification;
Conference_Titel :
Application of Information and Communication Technologies (AICT), 2014 IEEE 8th International Conference on
Conference_Location :
Astana
Print_ISBN :
978-1-4799-4120-9
DOI :
10.1109/ICAICT.2014.7035939