Title :
Vietnamese spam detection based on language classification
Author :
Tuan Anh, Nguyen ; Quang Anh, Tran ; Ngoc Binh, Nguyen
Author_Institution :
Libr. & Inf. Network Center, Hanoi Univ. of Technol., Hanoi
Abstract :
Language classification is the process of identifying the disposition of a presented text, such as classifying an email or a text document into a particular category. Classifying text can involve determining the genre of a book, categorizing a document, or in our case deciding whether an email is spam. The idea behind language classification is to teach the computer to be a filing clerk. Spam filters using a Bayesian combination of the spam probabilities of individual words that employ language classification read and filter your email by learning your personal email behavior (what you think is and isnpsilat spam). There are many spam filters written based on this technology and applied effectively for English and other languages. But they got a low effect when applied directly at Vietnamese spam. Because the token segmentation of the Bayesian filters is not suitable for Vietnamese specific characteristics. We, therefore, propose a Vietnamese segmentation for using token selection for building a Vietnamese spam filter based on language classification and Bayesian combination to sufficiently support Vietnamese. The result is very satisfactory. Thanks to this technique, our filter for Vietnamese spam is 9% more accurate when compared to other filters which use other segmentation technical.
Keywords :
Bayes methods; natural language processing; Bayesian filter segmentation; Vietnamese spam detection; language classification; personal email behavior; tex classification; text document; Bayesian methods; Books; Educational institutions; Humans; Information filtering; Information filters; Information technology; Libraries; Natural languages; Unsolicited electronic mail; Bayesian anti-spam; Language Classification; Spam; Vietnamese Segmentation;
Conference_Titel :
Communications and Electronics, 2008. ICCE 2008. Second International Conference on
Conference_Location :
Hoi an
Print_ISBN :
978-1-4244-2425-2
Electronic_ISBN :
978-1-4244-2426-9
DOI :
10.1109/CCE.2008.4578936