• DocumentCode
    2611291
  • Title

    HTML Tree Parsing Algorithm Based on Pre-extracted Data

  • Author

    Song, Mingqiu ; Zhang, Ruixue ; Gang, Duo

  • Author_Institution
    Inst. of Syst. Eng., Dalian Univ. of Technol., Dalian, China
  • fYear
    2009
  • fDate
    27-28 June 2009
  • Firstpage
    249
  • Lastpage
    254
  • Abstract
    In the paper, a new method of extracting HTML Tree from web pages is proposed. Its main idea is that the parts of web pages which are not easy to parse including tags and attributes should be handled previously, then the remaining parts are tidied and parsed, and then both the two former extracted parts are deposited in the tree. As integrated the tidying process and the parsing process, the new method does not only keep the web data integrity but also simplify the complexity of algorithms. The test shows that it can parse all kinds of web pages and provide concrete fault tolerance mechanisms.
  • Keywords
    Internet; hypermedia markup languages; program compilers; tree data structures; HTML tree parsing algorithm; Web data integrity; Web pages; fault tolerance mechanisms; parsing process; preextracted data; Data engineering; Data mining; Displays; HTML; Information resources; Mobile handsets; SGML; Systems engineering and theory; Tree data structures; Web pages; HTML parsing; information extracting; web pages tidying;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Mobile Business, 2009. ICMB 2009. Eighth International Conference on
  • Conference_Location
    Dalian
  • Print_ISBN
    978-0-7695-3691-0
  • Type

    conf

  • DOI
    10.1109/ICMB.2009.50
  • Filename
    5169267