• DocumentCode
    1911902
  • Title

    A Template Independent Method for Large Online News Content Extraction

  • Author

    Wu, Yu-Chieh ; Yang, Jie-Chi

  • Author_Institution
    Dept. of Commun. & Manage., Ming-Chuan Univ., Taipei, Taiwan
  • fYear
    2012
  • fDate
    20-22 Sept. 2012
  • Firstpage
    254
  • Lastpage
    257
  • Abstract
    Online news provides a convenient way for users to read novel news. Building online news corpus is important to many text mining and data mining issues. The creation of web news data required to construct a set of HTML parsing rules to identify content text. When a website rapidly change the layout style, the parsing rules (wrapper) should be reconstructed. In this paper, we address this issue and propose a news content recognition algorithm that is portable to different language and various domains. Our method first scans the entire HTML document and detects a set of candidate blocks. Second, the proposed weighted scoring function that combines stop word language models and HTML penalty functions is used to rank the importance of each candidate. We then check the block which obtains the highest score and a predefined threshold value. To validate the approach, we conduct experiments by using 533 online news HTML files from 24 web sites. The empirical study shows that our method achieves ~95% macro F-measure rate in recognizing news content.
  • Keywords
    data mining; grammars; hypermedia markup languages; information resources; text analysis; HTML document; HTML parsing rules; HTML penalty functions; Web site; data mining; large online news content extraction; layout style; macro F-measure rate; news content recognition; online news corpus; stop word language models; template independent method; text mining; weighted scoring function; Data mining; Equations; HTML; Mathematical model; Testing; Web sites; content text recognition; information extraction; language model; text corpus construction; text mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advanced Applied Informatics (IIAIAAI), 2012 IIAI International Conference on
  • Conference_Location
    Fukuoka
  • Print_ISBN
    978-1-4673-2719-0
  • Type

    conf

  • DOI
    10.1109/IIAI-AAI.2012.58
  • Filename
    6337198