DocumentCode :
2611291
Title :
HTML Tree Parsing Algorithm Based on Pre-extracted Data
Author :
Song, Mingqiu ; Zhang, Ruixue ; Gang, Duo
Author_Institution :
Inst. of Syst. Eng., Dalian Univ. of Technol., Dalian, China
fYear :
2009
fDate :
27-28 June 2009
Firstpage :
249
Lastpage :
254
Abstract :
In the paper, a new method of extracting HTML Tree from web pages is proposed. Its main idea is that the parts of web pages which are not easy to parse including tags and attributes should be handled previously, then the remaining parts are tidied and parsed, and then both the two former extracted parts are deposited in the tree. As integrated the tidying process and the parsing process, the new method does not only keep the web data integrity but also simplify the complexity of algorithms. The test shows that it can parse all kinds of web pages and provide concrete fault tolerance mechanisms.
Keywords :
Internet; hypermedia markup languages; program compilers; tree data structures; HTML tree parsing algorithm; Web data integrity; Web pages; fault tolerance mechanisms; parsing process; preextracted data; Data engineering; Data mining; Displays; HTML; Information resources; Mobile handsets; SGML; Systems engineering and theory; Tree data structures; Web pages; HTML parsing; information extracting; web pages tidying;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Mobile Business, 2009. ICMB 2009. Eighth International Conference on
Conference_Location :
Dalian
Print_ISBN :
978-0-7695-3691-0
Type :
conf
DOI :
10.1109/ICMB.2009.50
Filename :
5169267
Link To Document :
بازگشت