Title :
A CRF-based approach for web object extraction
Author :
Rui Liu ; Xiong, Rui ; Gao, Kun
Author_Institution :
State Key Lab. of Software Dev. Environ., Beihang Univ., Beijing, China
Abstract :
A method for extracting Web object is presented in this paper. Firstly, Web object blocks are obtained by blocking the web page and calculating the information entropy of it. Then it uses Conditional Random Field model as a probability and statistics model, and builds a series of feature templates according to the characteristics of objects themselves. Feature functions are generated based on the result of Chinese word segmentation and feature templates. It uses a limited memory BFGS algorithm to estimate parameters of the model, and labels property sequences of Web object blocks by Viterbi algorithm. Experiment result shows that the proposed method is an effective way to extract science data.
Keywords :
Internet; Viterbi detection; entropy; information retrieval; object detection; random processes; word processing; Chinese word segmentation; Viterbi algorithm; Web object extraction; conditional random field model; feature templates; information entropy; Feature extraction; Laboratories; Viterbi algorithm; Conditional Random Field; Information Extraction; Machine Learning; Web Object;
Conference_Titel :
Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on
Conference_Location :
Chengdu
Print_ISBN :
978-1-4244-5537-9
DOI :
10.1109/ICCSIT.2010.5563787