Title :
ECON: An Approach to Extract Content from Web News Page
Author :
Guo, Yan ; Tang, Huifeng ; Song, Linhai ; Wang, Yu ; Ding, Guodong
Author_Institution :
Key Lab. of Network Sci. & Technol., Chinese Acad. of Sci., Beijing, China
Abstract :
This paper provides a simple but effective approach, named ECON, to fully-automatically extract content from Web news page. ECON uses a DOM tree to represent the Web news page and leverages the substantial features of the DOM tree. ECON finds a snippet-node by which a part of the content of news is wrapped firstly, then backtracks from the snippet-node until a summary-node is found, and the entire content of news is wrapped by the summary-node. During the process of backtracking, ECON removes noise. Experimental results showed that ECON can achieve high accuracy and fully satisfy the requirements for scalable extraction. Moreover, ECON can be applied to Web news page written in many popular languages such as Chinese, English, French, German, Italian, Japanese, Portuguese, Russian, Spanish, Arabic. ECON can be implemented much easily.
Keywords :
Web sites; content management; trees (mathematics); Arabic language; Chinese language; DOM tree; ECON; English language; French language; German language; Italian language; Japanese language; Portuguese language; Russian language; Spanish language; Web news page; backtracking process; content extract; noise removal; snippet-node; summary-node; Computers; Data mining; Humans; Induction generators; Information retrieval; Laboratories; Natural languages; Paper technology; Programmable logic arrays; Web mining; Web content extraction; Web mining; information extraction;
Conference_Titel :
Web Conference (APWEB), 2010 12th International Asia-Pacific
Conference_Location :
Busan
Print_ISBN :
978-1-7695-4012-2
Electronic_ISBN :
978-1-4244-6600-9
DOI :
10.1109/APWeb.2010.11