Using weight-controlled token matching to extract data from HTML files

Author

Xu, Yan ; Ling, Tok Wang

Author_Institution

Sch. of Comput., Nat. Univ. of Singapore, Singapore

Volume

1

fYear

2001

fDate

3-6 Dec. 2001

Firstpage

341

Abstract

Most of the data stored in HTML files on the Web are semistructured. Extracting data and packing them into semistructured data models has received a lot of attention recently. We introduce a method that generates wrappers automatically for HTML files. The wrapper is generated from labeled training examples. We use weight-controlled token matching to locate the delimiters of the data of interest to the users. A list of tokens near the data is evaluated and each token is given a weight. We define a list of tokens to be the delimiter if the tokens are so important that the sum of the weights is larger than a threshold. A prototype is designed and a GUI is used to help build wrappers and extract data from the Web. Our method requires a small number of training examples and is flexible enough to deal with missing and misordered items. Compared to other approaches that may be too restrictive, our approach tolerates small modifications of HTML files.

Keywords

graphical user interfaces; hypermedia markup languages; information resources; information retrieval; GUI; HTML files; Web data; automatic wrapper generation; data extraction; delimiters data; labeled training examples; semistructured data. model; weight-controlled token matching; Books; Data mining; Data models; Databases; HTML; Marine vehicles; Search engines; Uniform resource locators; World Wide Web; XML;

fLanguage

English

Publisher

ieee

Conference_Titel

Web Information Systems Engineering, 2001. Proceedings of the Second International Conference on

Print_ISBN

0-7695-1393-X

Type

conf

DOI

10.1109/WISE.2001.996495

Filename

996495