Title :
Adaptive Post Recognition
Author :
Berger, Philipp ; Hennig, Patrick ; Petrick, Dominic ; Pursche, Marcel ; Meinel, Christoph
Author_Institution :
Hasso-Plattner-Inst., Univ. of Potsdam, Potsdam, Germany
Abstract :
Blogs, news portal and discussion forums are of high interest for today´s social interaction research. But the automatic information extraction from the raw html page of those media channels is still a well-known problem. We introduce a novel approach to infer website templates based on the syndication format of blogs and news portals, called feeds. In comparison to related approaches that infer templates by clustering generic pages, we do not rely on a manual annotated training set. Instead, we use the feeds and their linked articles as training set to identify characteristic XPaths. Those paths identify the exact article content and article properties like title, author and publishing date. Further, we can use those paths to detect article pages that are no longer linked from feeds. We show the precision gain by comparing the article content extraction with an alternative approach e.g. boilerplate.
Keywords :
Web sites; hypermedia markup languages; information retrieval; pattern clustering; portals; Website template; adaptive post recognition; article page detection; automatic information extraction; blogs; characteristic XPath identification; discussion forum; feeds; generic page clustering; media channels; news portal; raw HTML page; social interaction research; syndication format; training set; Blogs; Containers; Data mining; Feeds; HTML; Web pages; XML;
Conference_Titel :
Advances in Social Networks Analysis and Mining (ASONAM), 2014 IEEE/ACM International Conference on
DOI :
10.1109/ASONAM.2014.7092993