DocumentCode
166059
Title
Structural analysis and regular expressions based noise elimination from web pages for web content mining
Author
Dutta, Arin ; Paria, Sudipta ; Golui, Tanmoy ; Kole, Dipak Kumar
Author_Institution
Dept. of Inf. Technol., St. Thomas Coll. of Eng. & Technol., Kolkata, India
fYear
2014
fDate
24-27 Sept. 2014
Firstpage
1445
Lastpage
1451
Abstract
Commercial websites usually contain noisy information blocks along with main content. Noisy information degrades the performance of web content mining. Web content mining is used for discovering useful knowledge or information from the web page. In this paper, we propose noise elimination method that uses tag based filtering followed by structural analysis of the web page. The proposed tag based filtering method is implemented by regular expression. Firstly, the filtering method is used to remove several predefined HTML tags present in the web page. Then the concise web page is taken for structural analysis to remove remaining noise. Most of the time Noisy blocks share same contents and layouts or presentation styles in every web page of a website. In structural analysis phase, we compare the HTML contents of the crawled web pages from a website to capture common blocks having same contents and layouts or presentation styles and remove them. Filtering method eliminates considerable amount of noisy contents before structural analysis. Noisy contents in crawled web pages get reduced significantly. The overall space and time complexity is less compared to other noise elimination approach. The experiment is conducted on several popular commercial websites and the results are shown exposing the efficiency of the proposed method.
Keywords
Internet; Web sites; data mining; information filtering; HTML tag based filtering; Web content mining; Web pages; Web sites; information blocks; noise elimination method; regular expressions; structural analysis; Data mining; HTML; Information filtering; Noise; Noise measurement; Web pages; Crawling; Filtering; Negative Tags; Noise; Regular Expression; Structural Analysis; Web Mining;
fLanguage
English
Publisher
ieee
Conference_Titel
Advances in Computing, Communications and Informatics (ICACCI, 2014 International Conference on
Conference_Location
New Delhi
Print_ISBN
978-1-4799-3078-4
Type
conf
DOI
10.1109/ICACCI.2014.6968377
Filename
6968377
Link To Document