DocumentCode
23170
Title
Analysis on the content features and their correlation of web pages for spam detection
Author
Ji Hua ; Zhang Huaxiang
Author_Institution
Dept. of Comput. Sci., Shandong Normal Univ., Jinan, China
Volume
12
Issue
3
fYear
2015
fDate
Mar. 2015
Firstpage
84
Lastpage
94
Abstract
In the global information era, people acquire more and more information from the Internet, but the quality of the search results is degraded strongly because of the presence of web spam. Web spam is one of the serious problems for search engines, and many methods have been proposed for spam detection. We exploit the content features of non-spam in contrast to those of spam. The content features for non-spam pages always possess lots of statistical regularities; but those for spam pages possess very few statistical regularities, because spam pages are made randomly in order to increase the page rank. In this paper, we summarize the regularities distributions of content features for non-spam pages, and propose the calculating probability formulae of the entropy and independent n-grams respectively. Furthermore, we put forward the calculation formulae of multi features correlation. Among them, the notable content features may be used as auxiliary information for spam detection.
Keywords
Internet; content management; entropy; probability; search engines; Internet; Web page correlation; Web spam; content feature analysis; entropy; independent n-grams; multifeatures correlation; page rank; probability formulae; regularity distributions; search engines; search result quality; spam detection; statistical regularities; Electronic mail; Entropy; Feature extraction; Gaussian distribution; Probability distribution; Search engines; content features; feature correlation; spam detection; web spam;
fLanguage
English
Journal_Title
Communications, China
Publisher
ieee
ISSN
1673-5447
Type
jour
DOI
10.1109/CC.2015.7084367
Filename
7084367
Link To Document