DocumentCode
2540253
Title
A cognitive crawler using structure pattern for incremental crawling and content extraction
Author
Xi, Shijia ; Sun, Fuchun ; Wang, Jianmin
Author_Institution
Tsinghua Univ., Beijing, China
fYear
2010
fDate
7-9 July 2010
Firstpage
238
Lastpage
244
Abstract
In this paper, we design a cognitive crawler to dramatically reduce the website crawling cost and extract useful content from web pages in an unsupervised procedure. The main idea of reducing the crawling cost is to retrieving those lately modified pages and newly added pages only. However, in reality, it is impossible for traditional crawler to judge whether a page has been modified or newly added without doing a whole crawling. We propose a method to predict those lately modified pages and newly added pages without do any actual crawling; we also find a feasible and stable feature "structure pattern" to better indicates the modified probability of certain page. In the meanwhile, we develop a hybrid clustering method combined with K-means and agglomerative hierarchical clustering to automatically find all the structure patterns in certain website. Using structure pattern, we developed an unsupervised algorithm to generate website\´s templates; using templates, crawler can extract useful information of web pages much more easily and precisely. We also introduce feasible formulas to predict pages\´ modified probabilities and crawling time intervals. To evaluate the performance of an incremental crawling algorithm, we proposed three new indicators. Using the algorithm proposed, we could extract content of pages with high performance. The experimental results illustrate that structure pattern is very useful and the performance of this cognitive crawler is quite promising and it can save huge amount of bandwidth and is qualified for different websites of various scales.
Keywords
Internet; Web sites; information retrieval; pattern clustering; unsupervised learning; Web pages; Website crawling; Website templates; agglomerative hierarchical clustering; cognitive crawler; content extraction; hybrid clustering method; incremental crawling algorithm; k-means clustering; structure pattern; unsupervised algorithm; Cognitive informatics; Sun; Incremental crawler; content extraction; hybrid clustering; structure pattern; template generation;
fLanguage
English
Publisher
ieee
Conference_Titel
Cognitive Informatics (ICCI), 2010 9th IEEE International Conference on
Conference_Location
Beijing
Print_ISBN
978-1-4244-8041-8
Type
conf
DOI
10.1109/COGINF.2010.5599733
Filename
5599733
Link To Document