Title :
Tree-Structured Template Generation for Web Pages
Author :
Chuang, Shui-Lung ; Hsu, Jane Yung-jen
Author_Institution :
Academia Sinica, Taiwan
Abstract :
As the web becomes an increasingly important source of information, tools for modeling, searching, and extracting information from Web pages are indispensable. By modeling the structure of a Web page defined by its markup tags, one can easily extract target information using structural templates. This paper introduces the Tree Template Automatic Generator (TTAG) that learns tree-structured templates from training Web pages. TTAG was applied to both query-based and frequently updated Web sites, and produced effective templates from a small number of examples. The experiments show that TTAG is a powerful extraction tool for semi-structured information sources.
Keywords :
Automata; Data mining; Databases; HTML; Information resources; Information science; Internet; Power generation; Seminars; Web pages;
Conference_Titel :
Web Intelligence, 2004. WI 2004. Proceedings. IEEE/WIC/ACM International Conference on
Print_ISBN :
0-7695-2100-2
DOI :
10.1109/WI.2004.10101