Title :
Research on complex structure-oriented accurate web information extraction rules
Author :
Xie, Tao ; Shi, Shengsheng ; Quan, Fuliang ; Yuan, Chunfeng ; Huang, Yihua
Author_Institution :
Dept. of Comput. Sci. & Technol., Nanjing Univ., Nanjing, China
Abstract :
With the rapid growth of web information, there is an increasing need to easily and efficiently acquire accurate information from the massive and heterogamous web. Web information extraction is such a research area to meet these needs. In this paper, we analyze the shortcomings of related researches and systems and find that when extracting accurate web information with complex structures, few systems can do so without being too much of a burden to users. Aiming at overcoming this type of pitfalls, this paper will study and propose a comprehensive model and framework that can combine the automatic web data analysis and extraction with the user interaction-based semi-supervised web data extraction. The new model and framework has a good trade-off between the automatic generation of extraction rules and their expression capability towards the accurate information extraction. Based on this, we further present a multi-functional data extraction rule system that will use a variety of structural and textual extraction rules of different functions to achieve powerful expression capability. Furthermore, to offer powerful expression mechanism for data extraction, this paper will describe a well-designed, XML-based data extraction language which works well for rule generation based on both automatic web structure analysis and user interaction.
Keywords :
Internet; information retrieval; user interfaces; Web information extraction; XML-based data extraction language; semi-supervised Web data extraction; structure-oriented information extraction; user interaction; Databases; HTML; Variable speed drives; accuracy web information extraction; extraction language; extraction rule model;
Conference_Titel :
Progress in Informatics and Computing (PIC), 2010 IEEE International Conference on
Conference_Location :
Shanghai
Print_ISBN :
978-1-4244-6788-4
DOI :
10.1109/PIC.2010.5687442