DocumentCode
379248
Title
Using weight-controlled token matching to extract data from HTML files
Author
Xu, Yan ; Ling, Tok Wang
Author_Institution
Sch. of Comput., Nat. Univ. of Singapore, Singapore
Volume
1
fYear
2001
fDate
3-6 Dec. 2001
Firstpage
341
Abstract
Most of the data stored in HTML files on the Web are semistructured. Extracting data and packing them into semistructured data models has received a lot of attention recently. We introduce a method that generates wrappers automatically for HTML files. The wrapper is generated from labeled training examples. We use weight-controlled token matching to locate the delimiters of the data of interest to the users. A list of tokens near the data is evaluated and each token is given a weight. We define a list of tokens to be the delimiter if the tokens are so important that the sum of the weights is larger than a threshold. A prototype is designed and a GUI is used to help build wrappers and extract data from the Web. Our method requires a small number of training examples and is flexible enough to deal with missing and misordered items. Compared to other approaches that may be too restrictive, our approach tolerates small modifications of HTML files.
Keywords
graphical user interfaces; hypermedia markup languages; information resources; information retrieval; GUI; HTML files; Web data; automatic wrapper generation; data extraction; delimiters data; labeled training examples; semistructured data. model; weight-controlled token matching; Books; Data mining; Data models; Databases; HTML; Marine vehicles; Search engines; Uniform resource locators; World Wide Web; XML;
fLanguage
English
Publisher
ieee
Conference_Titel
Web Information Systems Engineering, 2001. Proceedings of the Second International Conference on
Print_ISBN
0-7695-1393-X
Type
conf
DOI
10.1109/WISE.2001.996495
Filename
996495
Link To Document