• DocumentCode
    166059
  • Title

    Structural analysis and regular expressions based noise elimination from web pages for web content mining

  • Author

    Dutta, Arin ; Paria, Sudipta ; Golui, Tanmoy ; Kole, Dipak Kumar

  • Author_Institution
    Dept. of Inf. Technol., St. Thomas Coll. of Eng. & Technol., Kolkata, India
  • fYear
    2014
  • fDate
    24-27 Sept. 2014
  • Firstpage
    1445
  • Lastpage
    1451
  • Abstract
    Commercial websites usually contain noisy information blocks along with main content. Noisy information degrades the performance of web content mining. Web content mining is used for discovering useful knowledge or information from the web page. In this paper, we propose noise elimination method that uses tag based filtering followed by structural analysis of the web page. The proposed tag based filtering method is implemented by regular expression. Firstly, the filtering method is used to remove several predefined HTML tags present in the web page. Then the concise web page is taken for structural analysis to remove remaining noise. Most of the time Noisy blocks share same contents and layouts or presentation styles in every web page of a website. In structural analysis phase, we compare the HTML contents of the crawled web pages from a website to capture common blocks having same contents and layouts or presentation styles and remove them. Filtering method eliminates considerable amount of noisy contents before structural analysis. Noisy contents in crawled web pages get reduced significantly. The overall space and time complexity is less compared to other noise elimination approach. The experiment is conducted on several popular commercial websites and the results are shown exposing the efficiency of the proposed method.
  • Keywords
    Internet; Web sites; data mining; information filtering; HTML tag based filtering; Web content mining; Web pages; Web sites; information blocks; noise elimination method; regular expressions; structural analysis; Data mining; HTML; Information filtering; Noise; Noise measurement; Web pages; Crawling; Filtering; Negative Tags; Noise; Regular Expression; Structural Analysis; Web Mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Advances in Computing, Communications and Informatics (ICACCI, 2014 International Conference on
  • Conference_Location
    New Delhi
  • Print_ISBN
    978-1-4799-3078-4
  • Type

    conf

  • DOI
    10.1109/ICACCI.2014.6968377
  • Filename
    6968377