• DocumentCode
    2492387
  • Title

    ECON: An Approach to Extract Content from Web News Page

  • Author

    Guo, Yan ; Tang, Huifeng ; Song, Linhai ; Wang, Yu ; Ding, Guodong

  • Author_Institution
    Key Lab. of Network Sci. & Technol., Chinese Acad. of Sci., Beijing, China
  • fYear
    2010
  • fDate
    6-8 April 2010
  • Firstpage
    314
  • Lastpage
    320
  • Abstract
    This paper provides a simple but effective approach, named ECON, to fully-automatically extract content from Web news page. ECON uses a DOM tree to represent the Web news page and leverages the substantial features of the DOM tree. ECON finds a snippet-node by which a part of the content of news is wrapped firstly, then backtracks from the snippet-node until a summary-node is found, and the entire content of news is wrapped by the summary-node. During the process of backtracking, ECON removes noise. Experimental results showed that ECON can achieve high accuracy and fully satisfy the requirements for scalable extraction. Moreover, ECON can be applied to Web news page written in many popular languages such as Chinese, English, French, German, Italian, Japanese, Portuguese, Russian, Spanish, Arabic. ECON can be implemented much easily.
  • Keywords
    Web sites; content management; trees (mathematics); Arabic language; Chinese language; DOM tree; ECON; English language; French language; German language; Italian language; Japanese language; Portuguese language; Russian language; Spanish language; Web news page; backtracking process; content extract; noise removal; snippet-node; summary-node; Computers; Data mining; Humans; Induction generators; Information retrieval; Laboratories; Natural languages; Paper technology; Programmable logic arrays; Web mining; Web content extraction; Web mining; information extraction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Conference (APWEB), 2010 12th International Asia-Pacific
  • Conference_Location
    Busan
  • Print_ISBN
    978-1-7695-4012-2
  • Electronic_ISBN
    978-1-4244-6600-9
  • Type

    conf

  • DOI
    10.1109/APWeb.2010.11
  • Filename
    5474120