DocumentCode :
379248
Title :
Using weight-controlled token matching to extract data from HTML files
Author :
Xu, Yan ; Ling, Tok Wang
Author_Institution :
Sch. of Comput., Nat. Univ. of Singapore, Singapore
Volume :
1
fYear :
2001
fDate :
3-6 Dec. 2001
Firstpage :
341
Abstract :
Most of the data stored in HTML files on the Web are semistructured. Extracting data and packing them into semistructured data models has received a lot of attention recently. We introduce a method that generates wrappers automatically for HTML files. The wrapper is generated from labeled training examples. We use weight-controlled token matching to locate the delimiters of the data of interest to the users. A list of tokens near the data is evaluated and each token is given a weight. We define a list of tokens to be the delimiter if the tokens are so important that the sum of the weights is larger than a threshold. A prototype is designed and a GUI is used to help build wrappers and extract data from the Web. Our method requires a small number of training examples and is flexible enough to deal with missing and misordered items. Compared to other approaches that may be too restrictive, our approach tolerates small modifications of HTML files.
Keywords :
graphical user interfaces; hypermedia markup languages; information resources; information retrieval; GUI; HTML files; Web data; automatic wrapper generation; data extraction; delimiters data; labeled training examples; semistructured data. model; weight-controlled token matching; Books; Data mining; Data models; Databases; HTML; Marine vehicles; Search engines; Uniform resource locators; World Wide Web; XML;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Web Information Systems Engineering, 2001. Proceedings of the Second International Conference on
Print_ISBN :
0-7695-1393-X
Type :
conf
DOI :
10.1109/WISE.2001.996495
Filename :
996495
Link To Document :
بازگشت