Title :
Co-training based semi-supervised Web spam detection
Author :
Wei Wang ; Xiao-Dong Lee ; An-Lei Hu ; Guang-Gang Geng
Author_Institution :
Comput. Network Inf. Center, China Internet Network Inf. Center, Beijing, China
Abstract :
Traditional Web spam classifiers use only labeled data (feature/label pairs) to train. Labeled spam instances, however, are very difficult, expensive, or time consuming to obtain, as they require the efforts of experienced human annotators. Meanwhile unlabeled samples are relatively easy to collect. Semi-supervised learning addresses the classification problem by using large amount of unlabeled data, together with the labeled data, to build better classifiers. This paper proposes two new semi-supervised learning algorithms to boost the performance of Web spam classifiers. The algorithms integrate the traditional co-training with the topological dependency based hyperlink learning. The proposed methods extend our previous work on self-training based semi-supervised Web spam detection. The experimental results with 100/200 labeled samples on the standard WEBSPAM-UK2006 benchmark showed that the algorithms are effective.
Keywords :
Internet; security of data; Web spam classifiers; classification problem; cotraining based semisupervised Web spam detection; human annotators; hyperlink learning; self-training based semisupervised Web spam detection; semisupervised learning algorithms; spam instances; topological dependency; Feature extraction; Information retrieval; Prediction algorithms; Semisupervised learning; Standards; Training; Unsolicited electronic mail;
Conference_Titel :
Fuzzy Systems and Knowledge Discovery (FSKD), 2013 10th International Conference on
Conference_Location :
Shenyang
DOI :
10.1109/FSKD.2013.6816301