DocumentCode :
691461
Title :
Improving web data extraction by noise removal
Author :
Narwal, Neetu
fYear :
2013
fDate :
20-21 Sept. 2013
Firstpage :
388
Lastpage :
395
Abstract :
Internet is the largest information repository consisting of large volume of data on almost every arena. As the websites are getting more and more complex the demand to make the web page adaptive according to the web user and hardware platform has become more challenging. There exist various web extraction systems but the varied requirement has given a challenge to the researchers to focus on devising new methodology to perform the task of information extraction more accurately and specific to the need of the user. In our work we have focused on the area of noise removal from the web page, as after removing the noise from the web page, the performance of web mining technique like classification, clustering and search engine web crawling etc. improves by many folds. We have developed an algorithm to extract the Visual Blocks of a Web page of a web site using DOM and Visual Characteristics, and then it is converted to the Pattern Tree. The Pattern Tree of different web pages of a single web site is mapped to find the similarity pattern among the web pages. For each node the Node Importance Measure is calculated, which is used to discriminate noise and main element of the web page. It is generally observed that Web Pages of a single web site often follow similar layout pattern, and the noise elements are repeated in almost all web pages.
Keywords :
Web sites; data mining; noise; DOM; Internet; Web crawling; Web data extraction; Web extraction systems; Web mining technique; Web page; Web sites; classification; clustering; information extraction; information repository; layout pattern; node importance measure; noise elements; noise removal; pattern tree; search engine; similarity pattern; visual blocks; visual characteristics; DOM; Node Importance; Noise; Pattern Tree; Similarity Count; Style Importance;
fLanguage :
English
Publisher :
iet
Conference_Titel :
Communication and Computing (ARTCom 2013), Fifth International Conference on Advances in Recent Technologies in
Conference_Location :
Bangalore
Print_ISBN :
978-1-84919-842-4
Type :
conf
DOI :
10.1049/cp.2013.2241
Filename :
6843017
Link To Document :
بازگشت