• DocumentCode
    1825893
  • Title

    Combining DOM tree and geometric layout analysis for online medical journal article segmentation

  • Author

    Zou, Jie ; Le, Daniel ; Thoma, George R.

  • fYear
    2006
  • fDate
    38869
  • Firstpage
    119
  • Lastpage
    128
  • Abstract
    We describe an HTML Web page segmentation algorithm, which is applied to segment online medical journal articles (regular HTML and PDF-converted-HTML files). The Web page content is modeled by a zone tree structure based primarily on the geometric layout of the Web page. For a given journal article, a zone tree is generated by combining DOM tree analysis and recursive X-Y cut algorithm. Combining with other visual cues, such as background color, font size, font color and so on, the page is segmented into homogeneous regions. Evaluation is conducted with 104 articles from 11 journals. Out of 9726 ground-truth zones, 9376 zones are correctly segmented, for an accuracy of 96.40%. Segmenting the entire Web page into zones can significantly expedite and increase the accuracy of the subsequent information retrieval steps
  • Keywords
    Internet; bibliographic systems; hypermedia markup languages; information retrieval; medical information systems; HTML Web page; X-Y cut algorithm; document object model tree analysis; geometric layout analysis; information retrieval; online medical journal article segmentation; zone tree structure; Algorithm design and analysis; Content based retrieval; Government; HTML; Information analysis; Information retrieval; Software libraries; Storage automation; Text analysis; Web pages; HTML document segmentation; document layout analysis; document object model (DOM); web information retrieval;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Digital Libraries, 2006. JCDL '06. Proceedings of the 6th ACM/IEEE-CS Joint Conference on
  • Conference_Location
    Chapel Hill, NC
  • Print_ISBN
    1-59593-354-9
  • Type

    conf

  • DOI
    10.1145/1141753.1141777
  • Filename
    4119108