DocumentCode
3677891
Title
Extraction of Web News from Web Pages Using a Ternary Tree Approach
Author
Debina Laishram;Merin Sebastian
Author_Institution
Dept. of Comput. Sci. &
fYear
2015
fDate
5/1/2015 12:00:00 AM
Firstpage
628
Lastpage
633
Abstract
The spread of information available in the World Wide Web, it appears that the pursuit of quality data is effortless and simple but it has been a significant matter of concern. Various extractors, wrappers systems with advanced techniques have been studied that retrieves the desired data from a collection of web pages. In this paper we propose a method for extracting the news content from multiple news web sites considering the occurrence of similar pattern in their representation such as date, place and the content of the news that overcomes the cost and space constraint observed in previous studies which work on single web document at a time. The method is an unsupervised web extraction technique which builds a pattern representing the structure of the pages using the extraction rules learned from the web pages by creating a ternary tree which expands when a series of common tags are found in the web pages. The pattern can then be used to extract news from other new web pages. The analysis and the results on real time web sites validate the effectiveness of our approach.
Keywords
"Data mining","HTML","Web pages","Noise","Head","Business","Semantics"
Publisher
ieee
Conference_Titel
Advances in Computing and Communication Engineering (ICACCE), 2015 Second International Conference on
Print_ISBN
978-1-4799-1733-4
Type
conf
DOI
10.1109/ICACCE.2015.38
Filename
7306759
Link To Document