• DocumentCode
    499027
  • Title

    Learnable topical crawler through online semi-supervised clustering

  • Author

    Wu, Qing-Yao ; Ye, Yunming ; Fu, Jian

  • Author_Institution
    Shenzhen Grad. Sch., Harbin Inst. of Technol., Harbin, China
  • Volume
    1
  • fYear
    2009
  • fDate
    12-15 July 2009
  • Firstpage
    231
  • Lastpage
    236
  • Abstract
    The performance of a traditional topical crawler heavily depends on the quality and comprehensiveness of the initial training samples. However, this is often impossible in real applications since preparing good initial training samples is difficult and time-consuming. It is ideal and appealing for a topical crawler if it can learn knowledge concerning the target topics from the ever-changing environment and adapt itself to these changes during successive crawling process. In this paper, we present a semi-supervised clustering method for building a learnable topical crawler. Our approach employs a constrained k-means clustering algorithm to detect new samples from crawled pages, which is fed to page classifier and link predictor for updating the learned models. This approach enables topical crawling systems with incremental learning capability and in turn improves crawling performance. Comparison experiments have been carried out between our approach and another traditional relevance score based sample generation approach. The experimental results have shown that our approach achieves better performance.
  • Keywords
    data mining; learning (artificial intelligence); pattern clustering; constrained k-means clustering algorithm; incremental learning capability; learnable topical crawler; link predictor; online semisupervised clustering; page classifier; relevance score based sample generation; topical crawling system; Crawlers; Cybernetics; Machine learning; Constrained k-means; sample generation; semi-supervised clustering; topical crawler;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics, 2009 International Conference on
  • Conference_Location
    Baoding
  • Print_ISBN
    978-1-4244-3702-3
  • Electronic_ISBN
    978-1-4244-3703-0
  • Type

    conf

  • DOI
    10.1109/ICMLC.2009.5212484
  • Filename
    5212484