DocumentCode
1825893
Title
Combining DOM tree and geometric layout analysis for online medical journal article segmentation
Author
Zou, Jie ; Le, Daniel ; Thoma, George R.
fYear
2006
fDate
38869
Firstpage
119
Lastpage
128
Abstract
We describe an HTML Web page segmentation algorithm, which is applied to segment online medical journal articles (regular HTML and PDF-converted-HTML files). The Web page content is modeled by a zone tree structure based primarily on the geometric layout of the Web page. For a given journal article, a zone tree is generated by combining DOM tree analysis and recursive X-Y cut algorithm. Combining with other visual cues, such as background color, font size, font color and so on, the page is segmented into homogeneous regions. Evaluation is conducted with 104 articles from 11 journals. Out of 9726 ground-truth zones, 9376 zones are correctly segmented, for an accuracy of 96.40%. Segmenting the entire Web page into zones can significantly expedite and increase the accuracy of the subsequent information retrieval steps
Keywords
Internet; bibliographic systems; hypermedia markup languages; information retrieval; medical information systems; HTML Web page; X-Y cut algorithm; document object model tree analysis; geometric layout analysis; information retrieval; online medical journal article segmentation; zone tree structure; Algorithm design and analysis; Content based retrieval; Government; HTML; Information analysis; Information retrieval; Software libraries; Storage automation; Text analysis; Web pages; HTML document segmentation; document layout analysis; document object model (DOM); web information retrieval;
fLanguage
English
Publisher
ieee
Conference_Titel
Digital Libraries, 2006. JCDL '06. Proceedings of the 6th ACM/IEEE-CS Joint Conference on
Conference_Location
Chapel Hill, NC
Print_ISBN
1-59593-354-9
Type
conf
DOI
10.1145/1141753.1141777
Filename
4119108
Link To Document