An Approach of Extracting Web Information Based on HTMLParser

Author

Lin, Shan ; Hu, Yanzhong

Author_Institution

Sch. of Comput. Sci. & Technol., Hubei Inst. of Technol., Wuhan, China

fYear

2010

fDate

24-25 July 2010

Firstpage

284

Lastpage

287

Abstract

Now many applications need to analyze various detail contents of web pages. How to extract web information quickly and effectively becomes very important. Web information is primarily expressed by HTML. HTMLParser is an open project of SourceForge.net and can parse HTML in either a linear or a nested fashion. This paper analyzes the principle of extracting web information based on HTMLParser. In addition, it gives an approach of implementing web information extraction with the classes and methods provided by HTMLParser. At last, we demonstrate the detailed process of web information extraction by an example.

Keywords

Internet; data handling; program compilers; HTMLParser; SourceForge.net project; Web information extraction; linear parsing; nested parsing; Data mining; Filtering theory; HTML; Information filters; Transforms; Web pages; HTMLParser; filter; visitor; web information extraction;

fLanguage

English

Publisher

ieee

Conference_Titel

Information Technology and Computer Science (ITCS), 2010 Second International Conference on

Conference_Location

Kiev

Print_ISBN

978-1-4244-7293-2

Electronic_ISBN

978-1-4244-7294-9

Type

conf

DOI

10.1109/ITCS.2010.76

Filename

5557131

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=1806909