Title :
Noise removing from Web pages using neural network
Author :
Htwe, Thanda ; Hla, Khin Haymar Saw
Author_Institution :
Univ. of Comput. Studies, Yangon, Myanmar
Abstract :
With the exponentially growing amount of information available on the Internet, an effective technique for users to discern the useful information from the unnecessary information is urgently required. Cleaning web pages for web data extraction becomes critical for improving performance of information retrieval and information extraction. So, we investigate to remove various noise patterns in Web pages instead of extracting relevant content from Web pages to get main content information. In this paper, we propose an approach that detect multiple noise patterns and remove these noise patterns from Web pages of any Web sites. The method first build DOM tree for any Web page. Our approach is based on the basic idea of Case-Based Reasoning (CBR) to find noise pattern in current Web page by matching similar noise pattern kept in Case-Based. We also apply a back propagation neural network algorithm to classify the stored various noise patterns by matching similar noise data in current Web page. We have implemented our method on several commercial Web sites and News Web sites to evaluate the performance and improvement of our approach. Experiments show that results leads to a more accurate and effectiveness of the approach.
Keywords :
Internet; backpropagation; case-based reasoning; information retrieval; interference suppression; neural nets; DOM tree; Internet; Web data extraction; Web pages noise removing; Web sites; back propagation neural network algorithm; multiple noise pattern detection; noise case based reasoning; Cleaning; Computer networks; Content based retrieval; Data mining; IP networks; Information retrieval; Navigation; Neural networks; Pattern matching; Web pages; DOM; Noise detection; information extraction; neural network; noise elimination;
Conference_Titel :
Computer and Automation Engineering (ICCAE), 2010 The 2nd International Conference on
Conference_Location :
Singapore
Print_ISBN :
978-1-4244-5585-0
Electronic_ISBN :
978-1-4244-5586-7
DOI :
10.1109/ICCAE.2010.5451952