ECON: An Approach to Extract Content from Web News Page

Author

Guo, Yan ; Tang, Huifeng ; Song, Linhai ; Wang, Yu ; Ding, Guodong

Author_Institution

Key Lab. of Network Sci. & Technol., Chinese Acad. of Sci., Beijing, China

fYear

2010

fDate

6-8 April 2010

Firstpage

314

Lastpage

320

Abstract

This paper provides a simple but effective approach, named ECON, to fully-automatically extract content from Web news page. ECON uses a DOM tree to represent the Web news page and leverages the substantial features of the DOM tree. ECON finds a snippet-node by which a part of the content of news is wrapped firstly, then backtracks from the snippet-node until a summary-node is found, and the entire content of news is wrapped by the summary-node. During the process of backtracking, ECON removes noise. Experimental results showed that ECON can achieve high accuracy and fully satisfy the requirements for scalable extraction. Moreover, ECON can be applied to Web news page written in many popular languages such as Chinese, English, French, German, Italian, Japanese, Portuguese, Russian, Spanish, Arabic. ECON can be implemented much easily.

Keywords

Web sites; content management; trees (mathematics); Arabic language; Chinese language; DOM tree; ECON; English language; French language; German language; Italian language; Japanese language; Portuguese language; Russian language; Spanish language; Web news page; backtracking process; content extract; noise removal; snippet-node; summary-node; Computers; Data mining; Humans; Induction generators; Information retrieval; Laboratories; Natural languages; Paper technology; Programmable logic arrays; Web mining; Web content extraction; Web mining; information extraction;

fLanguage

English

Publisher

ieee

Conference_Titel

Web Conference (APWEB), 2010 12th International Asia-Pacific

Conference_Location

Busan

Print_ISBN

978-1-7695-4012-2

Electronic_ISBN

978-1-4244-6600-9

Type

conf

DOI

10.1109/APWeb.2010.11

Filename

5474120