Extraction of Web News from Web Pages Using a Ternary Tree Approach

Author

Debina Laishram;Merin Sebastian

Author_Institution

Dept. of Comput. Sci. &

fYear

2015

fDate

5/1/2015 12:00:00 AM

Firstpage

628

Lastpage

633

Abstract

The spread of information available in the World Wide Web, it appears that the pursuit of quality data is effortless and simple but it has been a significant matter of concern. Various extractors, wrappers systems with advanced techniques have been studied that retrieves the desired data from a collection of web pages. In this paper we propose a method for extracting the news content from multiple news web sites considering the occurrence of similar pattern in their representation such as date, place and the content of the news that overcomes the cost and space constraint observed in previous studies which work on single web document at a time. The method is an unsupervised web extraction technique which builds a pattern representing the structure of the pages using the extraction rules learned from the web pages by creating a ternary tree which expands when a series of common tags are found in the web pages. The pattern can then be used to extract news from other new web pages. The analysis and the results on real time web sites validate the effectiveness of our approach.

Keywords

"Data mining","HTML","Web pages","Noise","Head","Business","Semantics"

Publisher

ieee

Conference_Titel

Advances in Computing and Communication Engineering (ICACCE), 2015 Second International Conference on

Print_ISBN

978-1-4799-1733-4

Type

conf

DOI

10.1109/ICACCE.2015.38

Filename

7306759