Title :
Structured web information extraction using repetitive subject pattern
Author :
Thamviset, Wachirawut ; Wongthanavasu, Sartra
Author_Institution :
Dept. of Comput. Sci., Khon Kaen Univ., Khon Kaen, Thailand
Abstract :
Data records on a dynamic web page are often generated from databases with fixed templates or layouts by server-side scripts. Generally, each data record on the web page has a subject item that can be used to identify a data record. This paper reports a novel semi-supervised information extraction system that lets end-users give only one subject item of sample data record. The system then builds a wrapper and extracts the relevant data records automatically. The techniques for the proposed system are a repetitive subject pattern for discovery data records, a subject tree clustering algorithm for clustering target data records, and a subject tree alignment for aligning data items and create an extraction pattern. For performance evaluation purpose, the proposed system is empirically tested on twelve popular real world websites both Thai and English. It provides the outstanding result by reporting 100 percentage of accuracy for correct extracted records. In addition, the proposed system shows higher degree of being user friendly when compared with other similar systems.
Keywords :
Web sites; data mining; information retrieval; learning (artificial intelligence); pattern clustering; trees (mathematics); English; Thai; automatic data record extraction; data record discovery; dynamic Web page; real world Web sites; repetitive subject pattern; sample data records; semisupervised information extraction system; server-side scripts; structured Web information extraction; subject tree alignment; subject tree clustering algorithm; target data record clustering; Accuracy; Data mining; HTML; Layout; USA Councils; Web pages; Wrapper induction; information extraction from WWW; web technologies;
Conference_Titel :
Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), 2012 9th International Conference on
Conference_Location :
Phetchaburi
Print_ISBN :
978-1-4673-2026-9
DOI :
10.1109/ECTICon.2012.6254247