DocumentCode
1932709
Title
Segmenting the Web document with document object model
Author
Luo, Jianli ; Shen, Jie ; Xie, Cuihua
Author_Institution
Dept. of Comput. Sci., Yangzhou Univ., Jiangsu, China
fYear
2004
fDate
15-18 Sept. 2004
Firstpage
449
Lastpage
452
Abstract
We present a model about DOM-based Web document segmentation using the semistructure information of Web pages. This model builds DOM tree of the Web page by parsing HTML tags which organize structure of the Web page. By improving traditional plain text segmentation algorithms, we expand these algorithms to suit Web text segmentation. Then, with the boundaries between the nodes in the DOM tree, precision of segmentation results can be increased further.
Keywords
Internet; grammars; hypermedia markup languages; information retrieval; text analysis; tree data structures; DOM tree; HTML tag; Web document segmentation; Web page; document object model; text segmentation algorithm; HTML; Image segmentation; Indexing; Information filtering; Information filters; Information retrieval; Internet telephony; Natural language processing; Personal digital assistants; Web pages;
fLanguage
English
Publisher
ieee
Conference_Titel
Services Computing, 2004. (SCC 2004). Proceedings. 2004 IEEE International Conference on
Print_ISBN
0-7695-2225-4
Type
conf
DOI
10.1109/SCC.2004.1358040
Filename
1358040
Link To Document