DocumentCode :
169839
Title :
A Hadoop Extension to Process Mail Folders and its Application to a Spam Dataset
Author :
Las-Casas, Pedro H. B. ; Santos Dias, Vinicius ; Ferreira, Ricardo ; Meira, Wagner ; Guedes, Dorgival
Author_Institution :
Comput. Sci. Dept., Fed. Univ. of Minas Gerais, Belo Horizonte, Brazil
fYear :
2014
fDate :
22-24 Oct. 2014
Firstpage :
108
Lastpage :
113
Abstract :
Even as the web 2.0 grows, e-mail continues to be one of the most used forms of communication in the Internet, being responsible for the generation of huge amounts of data. Spam traffic, for example, accounts for terabytes of data daily. It becomes necessary to create tools that are able to process these data efficiently, in large volumes, in order to understand their characteristics. Although mail servers are able to receive and store messages as they arrive, applying complex algorithms to a large set of mailboxes, either for characterization, security reasons or for data mining goals is challenging. Big data processing environments such as Hadoop are useful for the analysis of large data sets, although originally designed to handle text files in general. In this paper we present a Hadoop extension used to process and analyze large sets of e-mail, organized in mailboxes. To evaluate it, we used gigabytes of real spam traffic data collected around the world and we showed that our approach is efficient to process large amounts of mail data.
Keywords :
Big Data; Internet; data mining; unsolicited e-mail; Hadoop extension; Web 2.0; big data processing; data mining; e-mail; mail folder; spam dataset; spam traffic; Educational institutions; Electronic mail; Internet; Postal services; Programming; Servers; hadoop; mail; spam;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Architecture and High Performance Computing Workshop (SBAC-PADW), 2014 International Symposium on
Conference_Location :
Paris
Type :
conf
DOI :
10.1109/SBAC-PADW.2014.25
Filename :
6972024
Link To Document :
بازگشت