• DocumentCode
    260930
  • Title

    Learning based web crawl forum

  • Author

    Hemakumar, K. ; Prakash, B.

  • fYear
    2014
  • fDate
    27-28 Feb. 2014
  • Firstpage
    1
  • Lastpage
    7
  • Abstract
    The main objective of this project is to crawl applicable forum content from the web with minimal overhead. Forum threads usually contain the information content that is the target of the forum crawlers. The system that is to be created for learn URL patterns across multiple sites and automatically finds a forum´s entry page given a page from the forum. The forum has different layouts, styles and a generic crawler that blindly follows the duplicate links and uninformative page will crawl duplicate pages. The test results will show that the proposed system achieved effectiveness and coverage on a large set of test forums.
  • Keywords
    data mining; social networking (online); URL patterns; data mining; duplicate links; forum content; forum crawlers; forum entry page; forum layouts; forum styles; forum threads; generic crawler; information content; learning based Web crawl forum; uninformative page; Crawlers; Data mining; Educational institutions; Feature extraction; Indexes; Internet; Uniform resource locators; EIT path; ITF regex; URL type; forum crawling; page classification; page type;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Communication and Embedded Systems (ICICES), 2014 International Conference on
  • Conference_Location
    Chennai
  • Print_ISBN
    978-1-4799-3835-3
  • Type

    conf

  • DOI
    10.1109/ICICES.2014.7033889
  • Filename
    7033889