• DocumentCode
    2008102
  • Title

    Highly Scalable SVM Modeling with Random Granulation for Spam Sender Detection

  • Author

    Tang, Yuchun ; He, Yuanchen ; Krasser, Sven

  • Author_Institution
    Secure Comput. Corp., Alpharetta, GA
  • fYear
    2008
  • fDate
    11-13 Dec. 2008
  • Firstpage
    659
  • Lastpage
    664
  • Abstract
    Spam sender detection based on email subject data is a complex large-scale text mining task. The dataset consists of email subject lines and the corresponding IP address of the email sender. A fast and accurate classifier is desirable in such an application. In this research, a highly scalable SVM modeling method, named Granular SVM with Random granulation (GSVM-RAND), is designed. GSVM-RAND applies bootstrapping to extract a number of subsets of samples from the original training dataset. Each training subset is then projected into a feature subspace randomly selected from the original feature space. Here we call a granule such a subset of samples in such a feature subspace. A local SVM is then modeled in each granule. For a new sample, it is firstly projected into each granule in which the local SVM is fired to make a prediction. After that, all SVM predictions are aggregated by Bayesian Sum Rule for a final decision. GSVM-RAND is easy to be parallelized and hence efficient and highly scalable. GSVM-RAND is also effective by integrating a large number of weak, low-correlated local SVMs.
  • Keywords
    Bayes methods; data mining; feature extraction; learning (artificial intelligence); pattern classification; random processes; sampling methods; support vector machines; text analysis; unsolicited e-mail; Bayesian sum rule; IP address; bootstrapping method; complex large-scale text mining; email subject data; feature subspace random selection; high scalable SVM modeling; random granulation; spam sender detection; subset extraction; support vector machine; Bayesian methods; Floods; Helium; Large-scale systems; Machine learning; Machine learning algorithms; Support vector machine classification; Support vector machines; Text mining; Unsolicited electronic mail; classification ensembling; data mining; email spam detection; granular computing; information security; machine learning; svm;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Applications, 2008. ICMLA '08. Seventh International Conference on
  • Conference_Location
    San Diego, CA
  • Print_ISBN
    978-0-7695-3495-4
  • Type

    conf

  • DOI
    10.1109/ICMLA.2008.51
  • Filename
    4725045