Title :
Extraction of Web News from Web Pages Using a Ternary Tree Approach
Author :
Debina Laishram;Merin Sebastian
Author_Institution :
Dept. of Comput. Sci. &
fDate :
5/1/2015 12:00:00 AM
Abstract :
The spread of information available in the World Wide Web, it appears that the pursuit of quality data is effortless and simple but it has been a significant matter of concern. Various extractors, wrappers systems with advanced techniques have been studied that retrieves the desired data from a collection of web pages. In this paper we propose a method for extracting the news content from multiple news web sites considering the occurrence of similar pattern in their representation such as date, place and the content of the news that overcomes the cost and space constraint observed in previous studies which work on single web document at a time. The method is an unsupervised web extraction technique which builds a pattern representing the structure of the pages using the extraction rules learned from the web pages by creating a ternary tree which expands when a series of common tags are found in the web pages. The pattern can then be used to extract news from other new web pages. The analysis and the results on real time web sites validate the effectiveness of our approach.
Keywords :
"Data mining","HTML","Web pages","Noise","Head","Business","Semantics"
Conference_Titel :
Advances in Computing and Communication Engineering (ICACCE), 2015 Second International Conference on
Print_ISBN :
978-1-4799-1733-4
DOI :
10.1109/ICACCE.2015.38