• DocumentCode
    2698236
  • Title

    A generic Web news extraction approach

  • Author

    Dong, Yongquan ; Li, Qingzhong ; Yan, Zhongmin ; Ding, Yanhui

  • Author_Institution
    Sch. of Comput. Sci. & Technol., Shandong Univ., Jinan
  • fYear
    2008
  • fDate
    20-23 June 2008
  • Firstpage
    179
  • Lastpage
    183
  • Abstract
    With the development of the Internet, the Web is becoming the largest data repository ever available in the history of humankind. Major efforts have been made in order to provide efficient access to relevant information within the Web pages. Most previous works rely on the template of the Web sites. When information like news needs to be extracted from different sites, it must create a template for every site which will spend much time and huge cost. In this paper, we present a generic news extraction method to easily identify news content based on a set of combined heuristics and to exact every part of news according to a predefined schema. Experimental results indicate that our approach is effective in extracting news across Websites.
  • Keywords
    Internet; humanities; information retrieval; Internet; Web sites; generic Web news extraction approach; news content identification; Automation; Color; Computer science; Costs; Data mining; History; Internet; Navigation; Publishing; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information and Automation, 2008. ICIA 2008. International Conference on
  • Conference_Location
    Changsha
  • Print_ISBN
    978-1-4244-2183-1
  • Electronic_ISBN
    978-1-4244-2184-8
  • Type

    conf

  • DOI
    10.1109/ICINFA.2008.4607992
  • Filename
    4607992