DocumentCode :
1998530
Title :
LBDA: A novel framework for extracting content from web pages
Author :
Vijendran, Anna Saro ; Deepa, C.
Author_Institution :
Dept. Of MCA, SNR Sons Coll., Coimbatore, India
fYear :
2013
fDate :
19-21 Dec. 2013
Firstpage :
1
Lastpage :
7
Abstract :
The internet presents an enormous amount of useful information which is usually formatted for web users, but it is a complex task to extract the relevant data from various web sources. Recently, many approaches for data extraction from web pages have been proposed and each having their own merits and limitations. This paper provides a simple but effective approach, named layout based detachment approach (LBDA). The proposed approach extracts the main content from the web page and removes the irrelevant information like header, footer contents, navigation bars, advertisements and other noisy images. The proposed methodology uses the following techniques: tag tree parsing to get the analysis structure, block acquiring page segmentation method to remove unwanted tags, and data extraction to retrieve the necessary contents. It can eliminate noise and extract the main content blocks from web page effectively and display the essential content to the users. The performance is evaluated based on the following metrics like precision, recall, accuracy, execution time and memory usage. The implementation results obviously show that our proposed LBDA approach is performed better than the existing heuristic approach.
Keywords :
Internet; information retrieval; LBDA approach; Web page; accuracy metric; block acquiring page segmentation method; content extraction; content retrieval; data extraction; execution time metric; layout based detachment approach; memory usage metric; precision metric; recall metric; tag tree parsing technique; Data mining; HTML; Layout; Noise; Visualization; Web pages; DOM tree analysis; Web mining; Web page content extraction; Web structure mining;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Advanced Computing and Communication Systems (ICACCS), 2013 International Conference on
Conference_Location :
Coimbatore
Type :
conf
DOI :
10.1109/ICACCS.2013.6938748
Filename :
6938748
Link To Document :
بازگشت