DocumentCode
1998530
Title
LBDA: A novel framework for extracting content from web pages
Author
Vijendran, Anna Saro ; Deepa, C.
Author_Institution
Dept. Of MCA, SNR Sons Coll., Coimbatore, India
fYear
2013
fDate
19-21 Dec. 2013
Firstpage
1
Lastpage
7
Abstract
The internet presents an enormous amount of useful information which is usually formatted for web users, but it is a complex task to extract the relevant data from various web sources. Recently, many approaches for data extraction from web pages have been proposed and each having their own merits and limitations. This paper provides a simple but effective approach, named layout based detachment approach (LBDA). The proposed approach extracts the main content from the web page and removes the irrelevant information like header, footer contents, navigation bars, advertisements and other noisy images. The proposed methodology uses the following techniques: tag tree parsing to get the analysis structure, block acquiring page segmentation method to remove unwanted tags, and data extraction to retrieve the necessary contents. It can eliminate noise and extract the main content blocks from web page effectively and display the essential content to the users. The performance is evaluated based on the following metrics like precision, recall, accuracy, execution time and memory usage. The implementation results obviously show that our proposed LBDA approach is performed better than the existing heuristic approach.
Keywords
Internet; information retrieval; LBDA approach; Web page; accuracy metric; block acquiring page segmentation method; content extraction; content retrieval; data extraction; execution time metric; layout based detachment approach; memory usage metric; precision metric; recall metric; tag tree parsing technique; Data mining; HTML; Layout; Noise; Visualization; Web pages; DOM tree analysis; Web mining; Web page content extraction; Web structure mining;
fLanguage
English
Publisher
ieee
Conference_Titel
Advanced Computing and Communication Systems (ICACCS), 2013 International Conference on
Conference_Location
Coimbatore
Type
conf
DOI
10.1109/ICACCS.2013.6938748
Filename
6938748
Link To Document