DocumentCode
691461
Title
Improving web data extraction by noise removal
Author
Narwal, Neetu
fYear
2013
fDate
20-21 Sept. 2013
Firstpage
388
Lastpage
395
Abstract
Internet is the largest information repository consisting of large volume of data on almost every arena. As the websites are getting more and more complex the demand to make the web page adaptive according to the web user and hardware platform has become more challenging. There exist various web extraction systems but the varied requirement has given a challenge to the researchers to focus on devising new methodology to perform the task of information extraction more accurately and specific to the need of the user. In our work we have focused on the area of noise removal from the web page, as after removing the noise from the web page, the performance of web mining technique like classification, clustering and search engine web crawling etc. improves by many folds. We have developed an algorithm to extract the Visual Blocks of a Web page of a web site using DOM and Visual Characteristics, and then it is converted to the Pattern Tree. The Pattern Tree of different web pages of a single web site is mapped to find the similarity pattern among the web pages. For each node the Node Importance Measure is calculated, which is used to discriminate noise and main element of the web page. It is generally observed that Web Pages of a single web site often follow similar layout pattern, and the noise elements are repeated in almost all web pages.
Keywords
Web sites; data mining; noise; DOM; Internet; Web crawling; Web data extraction; Web extraction systems; Web mining technique; Web page; Web sites; classification; clustering; information extraction; information repository; layout pattern; node importance measure; noise elements; noise removal; pattern tree; search engine; similarity pattern; visual blocks; visual characteristics; DOM; Node Importance; Noise; Pattern Tree; Similarity Count; Style Importance;
fLanguage
English
Publisher
iet
Conference_Titel
Communication and Computing (ARTCom 2013), Fifth International Conference on Advances in Recent Technologies in
Conference_Location
Bangalore
Print_ISBN
978-1-84919-842-4
Type
conf
DOI
10.1049/cp.2013.2241
Filename
6843017
Link To Document