Title :
Structural analysis and regular expressions based noise elimination from web pages for web content mining
Author :
Dutta, Arin ; Paria, Sudipta ; Golui, Tanmoy ; Kole, Dipak Kumar
Author_Institution :
Dept. of Inf. Technol., St. Thomas Coll. of Eng. & Technol., Kolkata, India
Abstract :
Commercial websites usually contain noisy information blocks along with main content. Noisy information degrades the performance of web content mining. Web content mining is used for discovering useful knowledge or information from the web page. In this paper, we propose noise elimination method that uses tag based filtering followed by structural analysis of the web page. The proposed tag based filtering method is implemented by regular expression. Firstly, the filtering method is used to remove several predefined HTML tags present in the web page. Then the concise web page is taken for structural analysis to remove remaining noise. Most of the time Noisy blocks share same contents and layouts or presentation styles in every web page of a website. In structural analysis phase, we compare the HTML contents of the crawled web pages from a website to capture common blocks having same contents and layouts or presentation styles and remove them. Filtering method eliminates considerable amount of noisy contents before structural analysis. Noisy contents in crawled web pages get reduced significantly. The overall space and time complexity is less compared to other noise elimination approach. The experiment is conducted on several popular commercial websites and the results are shown exposing the efficiency of the proposed method.
Keywords :
Internet; Web sites; data mining; information filtering; HTML tag based filtering; Web content mining; Web pages; Web sites; information blocks; noise elimination method; regular expressions; structural analysis; Data mining; HTML; Information filtering; Noise; Noise measurement; Web pages; Crawling; Filtering; Negative Tags; Noise; Regular Expression; Structural Analysis; Web Mining;
Conference_Titel :
Advances in Computing, Communications and Informatics (ICACCI, 2014 International Conference on
Conference_Location :
New Delhi
Print_ISBN :
978-1-4799-3078-4
DOI :
10.1109/ICACCI.2014.6968377