E-mail address categorization based on semantics of surnames

Author

Veluru, Suresh ; Rahulamathavan, Yogachandran ; Viswanath, Pramod ; Longley, Paul ; Rajarajan, Muttukrishnan

Author_Institution

Inf. Security Group, City Univ. London, London, UK

fYear

2013

fDate

16-19 April 2013

Firstpage

222

Lastpage

229

Abstract

Surname (family name) analysis is used in geography to understand population origins, migration, identity, social norms and cultural customs. Some of these are supposedly evolved over generations. Surnames exhibit good statistical properties that can be used to extract information in names data set such as automatic detection of ethnic or community groups in names. An e-mail address, often contains surname as a substring. This containment may be full or partial. An e-mail address categorization based on semantics of surnames is the objective of this paper. This is achieved in two phases. First phase deals with surname representation and clustering. Here, a vector space model is proposed where latent semantic analysis is performed. Clustering is done using the method called average-linkage method. In the second phase, an email is categorized as belonging to one of the categories (discovered in first phase). For this, substring matching is required, which is done in an efficient way by using suffix tree data structure. We perform experimental evaluation for the 500 most frequently occurring surnames in India and United Kingdom. Also, we categorize the e-mail addresses that have these surnames as substrings.

Keywords

data analysis; electronic mail; pattern clustering; semantic networks; string matching; tree data structures; vectors; automatic detection; average-linkage method; community groups; cultural customs; e-mail address categorization; ethnic groups; family name analysis; geography; latent semantic analysis; names data set; population identity; population migration; population origins; social norms; statistical properties; substring matching; suffix tree data structure; surname clustering; surname representation; vector space model; Clustering algorithms; Clustering methods; Data mining; Electronic mail; Matrix decomposition; Semantics; Vectors; Vector space model; average link clustering method; latent semantic analysis; suffix tree; surnames;

fLanguage

English

Publisher

ieee

Conference_Titel

Computational Intelligence and Data Mining (CIDM), 2013 IEEE Symposium on

Conference_Location

Singapore

Type

conf

DOI

10.1109/CIDM.2013.6597240

Filename

6597240