• DocumentCode
    3008459
  • Title

    E-mail address categorization based on semantics of surnames

  • Author

    Veluru, Suresh ; Rahulamathavan, Yogachandran ; Viswanath, Pramod ; Longley, Paul ; Rajarajan, Muttukrishnan

  • Author_Institution
    Inf. Security Group, City Univ. London, London, UK
  • fYear
    2013
  • fDate
    16-19 April 2013
  • Firstpage
    222
  • Lastpage
    229
  • Abstract
    Surname (family name) analysis is used in geography to understand population origins, migration, identity, social norms and cultural customs. Some of these are supposedly evolved over generations. Surnames exhibit good statistical properties that can be used to extract information in names data set such as automatic detection of ethnic or community groups in names. An e-mail address, often contains surname as a substring. This containment may be full or partial. An e-mail address categorization based on semantics of surnames is the objective of this paper. This is achieved in two phases. First phase deals with surname representation and clustering. Here, a vector space model is proposed where latent semantic analysis is performed. Clustering is done using the method called average-linkage method. In the second phase, an email is categorized as belonging to one of the categories (discovered in first phase). For this, substring matching is required, which is done in an efficient way by using suffix tree data structure. We perform experimental evaluation for the 500 most frequently occurring surnames in India and United Kingdom. Also, we categorize the e-mail addresses that have these surnames as substrings.
  • Keywords
    data analysis; electronic mail; pattern clustering; semantic networks; string matching; tree data structures; vectors; automatic detection; average-linkage method; community groups; cultural customs; e-mail address categorization; ethnic groups; family name analysis; geography; latent semantic analysis; names data set; population identity; population migration; population origins; social norms; statistical properties; substring matching; suffix tree data structure; surname clustering; surname representation; vector space model; Clustering algorithms; Clustering methods; Data mining; Electronic mail; Matrix decomposition; Semantics; Vectors; Vector space model; average link clustering method; latent semantic analysis; suffix tree; surnames;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Intelligence and Data Mining (CIDM), 2013 IEEE Symposium on
  • Conference_Location
    Singapore
  • Type

    conf

  • DOI
    10.1109/CIDM.2013.6597240
  • Filename
    6597240