Title :
Blog extraction with template-independent wrapper
Author :
Zhang, Zhixuan ; Zhang, Chuang ; Lin, Zhiqing ; Xiao, Bo
Author_Institution :
Pattern Recognition & Intell. Syst. Lab.(PRIS), Beijing Univ. of Posts & Telecommun., Beijing, China
Abstract :
Rich information is contributed to blogs by millions of users all around the world with the development of blogsphere. However, few work has been done on the study of blog extraction so far. Unlike the traditional template-dependent wrapper, not only blog articles but also blogroll is extracted with template-independent wrapper in this paper. In our method, blog extraction is formalized as a machine learning problem and a template-independent wrapper is learned by using labeled blog pages from a single site. Testing pages are obtained from 10 popular Chinese blog sites. And experimental results on 300 real blog pages indicate that the proposed method can correctly extract data from blogs with the accuracy of 90% or even above.
Keywords :
Web sites; data mining; learning (artificial intelligence); Chinese blog sites; Web sites; blog extraction; blogsphere; labeled blog pages; machine learning; template-independent wrapper; Feature extraction; Information services; Internet; Testing; Visualization; Web pages; data extraction; template-independent; web mining;
Conference_Titel :
Network Infrastructure and Digital Content, 2010 2nd IEEE International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4244-6851-5
DOI :
10.1109/ICNIDC.2010.5657967