مرکز منطقه ای اطلاع رساني علوم و فناوري - A clustering based fast detection algorithm for large scale duplicate emails

DocumentCode :

2248100

Title :

A clustering based fast detection algorithm for large scale duplicate emails

Author :

Sun, Lin ; Liu, Bing-quan ; Wang, Bao-xun ; Wang, Xiao-long

Author_Institution :

MOE-MS Key Lab. of Natural Language Process. & Speech, Harbin Inst. of Technol., Harbin, China

Volume :

fYear :

2010

fDate :

11-14 July 2010

Firstpage :

3270

Lastpage :

3274

Abstract :

Duplicate emails, which exist on the internet widely and are mainly caused by mailing lists, not only waste storage resource but also bring users garbage. In this paper, according to the structure and text feature of email, we put forward the concept of Mail-Duplicate-Degree, and in this way the email duplicate is firstly defined. Based on this definition, we develop an algorithm based on clustering to detect duplicate emails. By introducing a hash function provided by TRIE tree to optimize the efficiency, the algorithm gets over the slow processing speed problem existing in traditional clustering methods. Experimental results on large-scale emails have shown that the algorithm has a high precision.

Keywords :

Internet; computer crime; cryptography; file organisation; optimisation; unsolicited e-mail; TRIE tree; clustering based fast detection algorithm; duplicate emails detection; hash function; internet; mail-duplicate-degree; optimisation; processing speed problem; users garbage; waste storage resource; Algorithm design and analysis; Clustering algorithms; Electronic mail; Feature extraction; Internet; Layout; Noise; Clustering; Duplicate email detection; Email; hash function;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Machine Learning and Cybernetics (ICMLC), 2010 International Conference on

Conference_Location :

Qingdao

Print_ISBN :

978-1-4244-6526-2

Type :

conf

DOI :

10.1109/ICMLC.2010.5580695

Filename :

5580695

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2248100