DocumentCode :
1738220
Title :
Scriptor: using deictics, dialog, and supervised learning to convey instructions
Author :
Nuri, M. ; Hughes, Stephen ; Lewis, Michael
Author_Institution :
Carnegie Mellon Univ., Pittsburgh, PA, USA
Volume :
2
fYear :
2000
fDate :
2000
Firstpage :
1128
Abstract :
HTML pages are designed to convey semantic information to human users through visual emphases, demarcations, spatial cues and repeating patterns which act as “perceptual markup”. This human-centric syntax is not easy for machines to identify. Naturally-occurring HTML, especially the machine-generated variety, rarely follows strict markup rules and provides no semantic cues. The visual cues humans use to extract information from a Web page, however, must be reflected in the page´s markup. If a human could convey the relationship between visual cues, available to the program as markup patterns, and semantic categories, passed to the program as user-supplied labels, the program would have been instructed in “how to extract information from that page”. Scriptor is a program which, run in tandem with a Web browser, allows a user to interactively design a data extraction script for the Web site. It is intended for highly structured repetitive information such as is found in classified listings, online stores, tables for weather, stock or airline schedules, course listings, and other similar sources. Scriptor interleaves a variety of learning methods to allow the specification of extraction rules using extremely simple methods. These consist of repeating pattern recognition, supervised learning, deictics through highlighting, and dialogs in which the user selects the desired result for a set of possible extraction rules. Learning is augmented by direct instructions such as: “label text following `~´ as `Author´ ”. Performance data for the authors and naive subjects are presented for a collection of Web pages showing the potential of this form of highly interactive instruction. Our results demonstrate that very simple programming by example techniques can generate effective parse rules in highly repetitive domains
Keywords :
authoring systems; automatic programming; data mining; hypermedia markup languages; information resources; learning (artificial intelligence); pattern recognition; HTML pages; Scriptor; Web browser; World Wide Web pages; classified listings; course listings; data extraction script; deictics; demarcations; dialogue; extraction rule specification; highlighting; highly structured repetitive information; human-centric syntax; information extraction; interactive instruction; markup rules; online stores; perceptual markup; performance; repeating pattern recognition; semantic categories; semantic cues; semantic information; spatial cues; supervised learning; tables; text labelling; user-supplied labels; visual cues; visual emphases; Data mining; HTML; Humans; Learning systems; Pattern recognition; Supervised learning; Temperature; Weather forecasting; Web page design; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Systems, Man, and Cybernetics, 2000 IEEE International Conference on
Conference_Location :
Nashville, TN
ISSN :
1062-922X
Print_ISBN :
0-7803-6583-6
Type :
conf
DOI :
10.1109/ICSMC.2000.886003
Filename :
886003
Link To Document :
بازگشت