مرکز منطقه ای اطلاع رساني علوم و فناوري - Learning to Extract Content from News Webpages

DocumentCode :

2308290

Title :

Learning to Extract Content from News Webpages

Author :

Spengler, Alex ; Gallinari, Patrick

Author_Institution :

Lab. d´´Inf., Univ. Pierre et Marie Curie, Paris

fYear :

2009

fDate :

26-29 May 2009

Firstpage :

709

Lastpage :

714

Abstract :

We consider the problem of content extraction from online news Web pages. To explore to what extent the syntactic markup and the visual structure of a Web page facilitate the extraction of its content, we compare two state-of-the-art classifiers as first instantiations of a general framework that allows for proper model comparison. To this end, we introduce the publicly available NEWS600 corpus, a set of 604 real world news Web pages which have been annotated with 30 semantic labels. An empirical analysis of the two models on this dataset shows that the inclusion of structural information is indeed advantageous.

Keywords :

Web sites; classification; information retrieval; random processes; support vector machines; automatic content extraction; empirical analysis; multiclass support vector machine; online news Web pages; sequential conditional random field; state-of-the-art classifier; syntactic markup; visual structure; Content based retrieval; Content management; Data mining; Information analysis; Information retrieval; Navigation; Scattering; Speech; Support vector machine classification; Support vector machines; conditional random fields; web content extraction; web content mining;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Advanced Information Networking and Applications Workshops, 2009. WAINA '09. International Conference on

Conference_Location :

Bradford

Print_ISBN :

978-1-4244-3999-7

Electronic_ISBN :

978-0-7695-3639-2

Type :

conf

DOI :

10.1109/WAINA.2009.97

Filename :

5136732

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2308290