DocumentCode
2374960
Title
Automatic web page segmentation and information extraction using conditional random fields
Author
Gong, Yunfei ; Liu, Qiang
Author_Institution
Sch. of Software, Tsinghua Univ., Beijing, China
fYear
2012
fDate
23-25 May 2012
Firstpage
334
Lastpage
340
Abstract
With the rapid development of Internet, Web pages have been more and more complex. Useful information is mixed with a lot of redundant information. In the current Web information extraction systems, manual or semi-manual methods are the majority. To improve the efficiency of information extraction, it requires us to further research the automatic method of Web information extraction. Firstly, we analyze the Web page´s basic object according to the Functional-based Object Model. Then we give an automatic method to segment the Web page into semantic blocks using conditional random fields (CRFs). In order to further improve the effect of the semantic block segmentation, combining DOM structure and tree edit distance, the optimization algorithm of the semantic block is given. Finally, we give an automatic Web information extraction tool. Based on this tool, relevant experiments are carried out to evaluate the efficiency of information extraction. Compared to DOM-based Web information extraction systems, the experimental results show the increase in accuracy and recall rate.
Keywords
Internet; Web sites; information retrieval; statistical analysis; Internet; Web information extraction systems; automatic Web page segmentation; conditional random fields; document object model; functional-based object model; optimization algorithm; semantic block segmentation; semimanual methods; tree edit distance; Data mining; Educational institutions; Web pages; CRFs; DOM; Function-based Object Model; information extraction; semantic block segmentation;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer Supported Cooperative Work in Design (CSCWD), 2012 IEEE 16th International Conference on
Conference_Location
Wuhan
Print_ISBN
978-1-4673-1211-0
Type
conf
DOI
10.1109/CSCWD.2012.6221840
Filename
6221840
Link To Document