• DocumentCode
    2540253
  • Title

    A cognitive crawler using structure pattern for incremental crawling and content extraction

  • Author

    Xi, Shijia ; Sun, Fuchun ; Wang, Jianmin

  • Author_Institution
    Tsinghua Univ., Beijing, China
  • fYear
    2010
  • fDate
    7-9 July 2010
  • Firstpage
    238
  • Lastpage
    244
  • Abstract
    In this paper, we design a cognitive crawler to dramatically reduce the website crawling cost and extract useful content from web pages in an unsupervised procedure. The main idea of reducing the crawling cost is to retrieving those lately modified pages and newly added pages only. However, in reality, it is impossible for traditional crawler to judge whether a page has been modified or newly added without doing a whole crawling. We propose a method to predict those lately modified pages and newly added pages without do any actual crawling; we also find a feasible and stable feature "structure pattern" to better indicates the modified probability of certain page. In the meanwhile, we develop a hybrid clustering method combined with K-means and agglomerative hierarchical clustering to automatically find all the structure patterns in certain website. Using structure pattern, we developed an unsupervised algorithm to generate website\´s templates; using templates, crawler can extract useful information of web pages much more easily and precisely. We also introduce feasible formulas to predict pages\´ modified probabilities and crawling time intervals. To evaluate the performance of an incremental crawling algorithm, we proposed three new indicators. Using the algorithm proposed, we could extract content of pages with high performance. The experimental results illustrate that structure pattern is very useful and the performance of this cognitive crawler is quite promising and it can save huge amount of bandwidth and is qualified for different websites of various scales.
  • Keywords
    Internet; Web sites; information retrieval; pattern clustering; unsupervised learning; Web pages; Website crawling; Website templates; agglomerative hierarchical clustering; cognitive crawler; content extraction; hybrid clustering method; incremental crawling algorithm; k-means clustering; structure pattern; unsupervised algorithm; Cognitive informatics; Sun; Incremental crawler; content extraction; hybrid clustering; structure pattern; template generation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cognitive Informatics (ICCI), 2010 9th IEEE International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4244-8041-8
  • Type

    conf

  • DOI
    10.1109/COGINF.2010.5599733
  • Filename
    5599733