• DocumentCode
    3677891
  • Title

    Extraction of Web News from Web Pages Using a Ternary Tree Approach

  • Author

    Debina Laishram;Merin Sebastian

  • Author_Institution
    Dept. of Comput. Sci. &
  • fYear
    2015
  • fDate
    5/1/2015 12:00:00 AM
  • Firstpage
    628
  • Lastpage
    633
  • Abstract
    The spread of information available in the World Wide Web, it appears that the pursuit of quality data is effortless and simple but it has been a significant matter of concern. Various extractors, wrappers systems with advanced techniques have been studied that retrieves the desired data from a collection of web pages. In this paper we propose a method for extracting the news content from multiple news web sites considering the occurrence of similar pattern in their representation such as date, place and the content of the news that overcomes the cost and space constraint observed in previous studies which work on single web document at a time. The method is an unsupervised web extraction technique which builds a pattern representing the structure of the pages using the extraction rules learned from the web pages by creating a ternary tree which expands when a series of common tags are found in the web pages. The pattern can then be used to extract news from other new web pages. The analysis and the results on real time web sites validate the effectiveness of our approach.
  • Keywords
    "Data mining","HTML","Web pages","Noise","Head","Business","Semantics"
  • Publisher
    ieee
  • Conference_Titel
    Advances in Computing and Communication Engineering (ICACCE), 2015 Second International Conference on
  • Print_ISBN
    978-1-4799-1733-4
  • Type

    conf

  • DOI
    10.1109/ICACCE.2015.38
  • Filename
    7306759