• DocumentCode
    1932709
  • Title

    Segmenting the Web document with document object model

  • Author

    Luo, Jianli ; Shen, Jie ; Xie, Cuihua

  • Author_Institution
    Dept. of Comput. Sci., Yangzhou Univ., Jiangsu, China
  • fYear
    2004
  • fDate
    15-18 Sept. 2004
  • Firstpage
    449
  • Lastpage
    452
  • Abstract
    We present a model about DOM-based Web document segmentation using the semistructure information of Web pages. This model builds DOM tree of the Web page by parsing HTML tags which organize structure of the Web page. By improving traditional plain text segmentation algorithms, we expand these algorithms to suit Web text segmentation. Then, with the boundaries between the nodes in the DOM tree, precision of segmentation results can be increased further.
  • Keywords
    Internet; grammars; hypermedia markup languages; information retrieval; text analysis; tree data structures; DOM tree; HTML tag; Web document segmentation; Web page; document object model; text segmentation algorithm; HTML; Image segmentation; Indexing; Information filtering; Information filters; Information retrieval; Internet telephony; Natural language processing; Personal digital assistants; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Services Computing, 2004. (SCC 2004). Proceedings. 2004 IEEE International Conference on
  • Print_ISBN
    0-7695-2225-4
  • Type

    conf

  • DOI
    10.1109/SCC.2004.1358040
  • Filename
    1358040