DocumentCode
3740494
Title
Blog, Forum or Newspaper? Web Genre Detection Using SVMs
Author
Philipp Berger;Patrick Hennig;Martin Schoenberg;Christoph Meinel
Author_Institution
Hasso-Plattner-Inst., Univ. of Potsdam, Potsdam, Germany
Volume
3
fYear
2015
Firstpage
64
Lastpage
68
Abstract
In recent years, blogs have become a very popular way to publish information, express opinions and hold discussions. Hence researchers and industry have interest in analyzing the blogosphere. Due to the increasing diversity of blog usage, the initial categorization into web genres is the first necessary step before any analyses. In this research, we focus on the distinction between traditional blogs, news portals, forums and miscellaneous websites. Especially the new distinction between news portals and blogs allows analyses to adapt to the network-specific characteristics of traditional media with high journalistic effort and more personal weblogs and their authors. We present a set of 80 features and extensively experiment with possible combinations and SVM parameters to identify the best constellation for the categorization into the four different web genres. Our experiments show a maximal accuracy of 83.5% overall. This high precision was reached using a combination of trained n-grams, structural properties (e.g. Twitter links) and quantitative properties like the text´s length and number of dates.
Keywords
"Blogs","Portals","Feature extraction","Support vector machines","HTML","Media","Twitter"
Publisher
ieee
Conference_Titel
Web Intelligence and Intelligent Agent Technology (WI-IAT), 2015 IEEE / WIC / ACM International Conference on
Type
conf
DOI
10.1109/WI-IAT.2015.59
Filename
7397424
Link To Document