Title :
Improving Rule Generation Precision for Domain Knowledge based Wrappers
Author :
Jeong, Chang-Hoo ; Jhun, Sung-Jin ; Lim, Myung-Eun ; Myaeng, Sung Hyon
Author_Institution :
Korea Inst. of Sci. & Technol., Seoul
Abstract :
Wrappers play an important role in extracting specified information from various sources. Wrapper rules by which information is extracted are often created from the domain-specific knowledge. Domain-specific knowledge helps recognizing the meaning the text representing various entities and values and detecting their formats. However, such domain knowledge becomes powerless when value-representing data are not labeled with appropriate textual descriptions or there is nothing but a hyper link when certain text labels or values are expected. In order to alleviate these problems, we propose a probabilistic method for recognizing the entity type, i.e. generating wrapper rules, when there is no label associated with value-representing text. In addition, we have devised a method for using the information reachable by following hyperlinks when textual data are not immediately available on the target Web page. Our experimental work shows that the proposed methods help increasing precision of the resulting wrapper, particularly extracting the title information, the most important entity on a Web page. The proposed methods can be useful in making a more efficient and correct information extraction system for various sources of information without user intervention
Keywords :
Internet; Web sites; information retrieval systems; knowledge based systems; Web page; domain knowledge based wrapper; information extraction system; probabilistic method; rule generation precision; value-representing data; Data mining; Humans; Information resources; Information retrieval; Learning systems; Search engines; Text recognition; Web and internet services; Web pages; Web services;
Conference_Titel :
Computational Intelligence for Modelling, Control and Automation, 2005 and International Conference on Intelligent Agents, Web Technologies and Internet Commerce, International Conference on
Conference_Location :
Vienna
Print_ISBN :
0-7695-2504-0
DOI :
10.1109/CIMCA.2005.1631292