• DocumentCode
    3740494
  • Title

    Blog, Forum or Newspaper? Web Genre Detection Using SVMs

  • Author

    Philipp Berger;Patrick Hennig;Martin Schoenberg;Christoph Meinel

  • Author_Institution
    Hasso-Plattner-Inst., Univ. of Potsdam, Potsdam, Germany
  • Volume
    3
  • fYear
    2015
  • Firstpage
    64
  • Lastpage
    68
  • Abstract
    In recent years, blogs have become a very popular way to publish information, express opinions and hold discussions. Hence researchers and industry have interest in analyzing the blogosphere. Due to the increasing diversity of blog usage, the initial categorization into web genres is the first necessary step before any analyses. In this research, we focus on the distinction between traditional blogs, news portals, forums and miscellaneous websites. Especially the new distinction between news portals and blogs allows analyses to adapt to the network-specific characteristics of traditional media with high journalistic effort and more personal weblogs and their authors. We present a set of 80 features and extensively experiment with possible combinations and SVM parameters to identify the best constellation for the categorization into the four different web genres. Our experiments show a maximal accuracy of 83.5% overall. This high precision was reached using a combination of trained n-grams, structural properties (e.g. Twitter links) and quantitative properties like the text´s length and number of dates.
  • Keywords
    "Blogs","Portals","Feature extraction","Support vector machines","HTML","Media","Twitter"
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence and Intelligent Agent Technology (WI-IAT), 2015 IEEE / WIC / ACM International Conference on
  • Type

    conf

  • DOI
    10.1109/WI-IAT.2015.59
  • Filename
    7397424