DocumentCode
2492387
Title
ECON: An Approach to Extract Content from Web News Page
Author
Guo, Yan ; Tang, Huifeng ; Song, Linhai ; Wang, Yu ; Ding, Guodong
Author_Institution
Key Lab. of Network Sci. & Technol., Chinese Acad. of Sci., Beijing, China
fYear
2010
fDate
6-8 April 2010
Firstpage
314
Lastpage
320
Abstract
This paper provides a simple but effective approach, named ECON, to fully-automatically extract content from Web news page. ECON uses a DOM tree to represent the Web news page and leverages the substantial features of the DOM tree. ECON finds a snippet-node by which a part of the content of news is wrapped firstly, then backtracks from the snippet-node until a summary-node is found, and the entire content of news is wrapped by the summary-node. During the process of backtracking, ECON removes noise. Experimental results showed that ECON can achieve high accuracy and fully satisfy the requirements for scalable extraction. Moreover, ECON can be applied to Web news page written in many popular languages such as Chinese, English, French, German, Italian, Japanese, Portuguese, Russian, Spanish, Arabic. ECON can be implemented much easily.
Keywords
Web sites; content management; trees (mathematics); Arabic language; Chinese language; DOM tree; ECON; English language; French language; German language; Italian language; Japanese language; Portuguese language; Russian language; Spanish language; Web news page; backtracking process; content extract; noise removal; snippet-node; summary-node; Computers; Data mining; Humans; Induction generators; Information retrieval; Laboratories; Natural languages; Paper technology; Programmable logic arrays; Web mining; Web content extraction; Web mining; information extraction;
fLanguage
English
Publisher
ieee
Conference_Titel
Web Conference (APWEB), 2010 12th International Asia-Pacific
Conference_Location
Busan
Print_ISBN
978-1-7695-4012-2
Electronic_ISBN
978-1-4244-6600-9
Type
conf
DOI
10.1109/APWeb.2010.11
Filename
5474120
Link To Document