DocumentCode :
3587592
Title :
Adaptive Post Recognition
Author :
Berger, Philipp ; Hennig, Patrick ; Petrick, Dominic ; Pursche, Marcel ; Meinel, Christoph
Author_Institution :
Hasso-Plattner-Inst., Univ. of Potsdam, Potsdam, Germany
fYear :
2014
Firstpage :
1
Lastpage :
8
Abstract :
Blogs, news portal and discussion forums are of high interest for today´s social interaction research. But the automatic information extraction from the raw html page of those media channels is still a well-known problem. We introduce a novel approach to infer website templates based on the syndication format of blogs and news portals, called feeds. In comparison to related approaches that infer templates by clustering generic pages, we do not rely on a manual annotated training set. Instead, we use the feeds and their linked articles as training set to identify characteristic XPaths. Those paths identify the exact article content and article properties like title, author and publishing date. Further, we can use those paths to detect article pages that are no longer linked from feeds. We show the precision gain by comparing the article content extraction with an alternative approach e.g. boilerplate.
Keywords :
Web sites; hypermedia markup languages; information retrieval; pattern clustering; portals; Website template; adaptive post recognition; article page detection; automatic information extraction; blogs; characteristic XPath identification; discussion forum; feeds; generic page clustering; media channels; news portal; raw HTML page; social interaction research; syndication format; training set; Blogs; Containers; Data mining; Feeds; HTML; Web pages; XML;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Advances in Social Networks Analysis and Mining (ASONAM), 2014 IEEE/ACM International Conference on
Type :
conf
DOI :
10.1109/ASONAM.2014.7092993
Filename :
7092993
Link To Document :
بازگشت