DocumentCode :
2965042
Title :
Conditional Random Fields Model for Web Content Extraction
Author :
Fu, Lei ; Xia, YingJu ; Meng, Yao ; Yu, Hao
Author_Institution :
Fujitsu R&D Center CO., Ltd., Beijing, China
fYear :
2010
fDate :
20-25 Sept. 2010
Firstpage :
30
Lastpage :
34
Abstract :
The web contains an abundance of semi-structured information, but not all the information is useful for users, it always contains so many noises such as the advertisement, navigation information, and so on. Identifying which parts of the web page contain target content and classify them into right category (such as title, author, time, content and so on) become a significant problem that must be solved. To this problem, one kind of approach is based on writing rules or scripts manually to extract the content. This kind of method requires considerable time and effort, besides, hand-built rules are brittle: they often fail in some cases and break down when the structure of the web page changes. The other kind of approach is based on DOM tree analysis of the web page, this kind of method depends on the structure of the DOM tree heavily. Furthermore, both of the above two kinds of methods are difficult to assign a suitable category label to the extracted content in their implementation. In this paper, to better solve the drawbacks of the traditional methods, using Conditional Random Fields sequence labeling model, we correctly extract the real content of the web page and assign a suitable category label to each part of the content simultaneously. The accuracy of the extraction and labeling can achieve above 96%.
Keywords :
Internet; content management; trees (mathematics); DOM tree analysis; Web content extraction; Web page; advertisement; conditional random fields model; conditional random fields sequence labeling model; navigation information; semistructured information; Data mining; Feature extraction; Labeling; Mathematical model; Noise; Training; Web pages; Conditional Random Fields; DOM tree; Web Content Extraction; sequence labeling;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computing in the Global Information Technology (ICCGI), 2010 Fifth International Multi-Conference on
Conference_Location :
Valencia
Print_ISBN :
978-1-4244-8068-5
Electronic_ISBN :
978-0-7695-4181-5
Type :
conf
DOI :
10.1109/ICCGI.2010.9
Filename :
5628933
Link To Document :
بازگشت