Title :
The XML-based Information Extraction on Data-intensive Page
Author_Institution :
Dalian Maritime Univ., Dalian
Abstract :
This paper puts forward an XML-based information extraction method which applies XSLT and XPath technology to construct extraction rules. The aim of this method is to extract useful information from data-intensive pages. This paper firstly analyzes the traits of data- intensive pages. Aiming at those traits, we proposed a path induction method to conclude record pattern of pages, to obtain the path expression of useful information, and eventually to construct extraction rules. Furthermore, this paper presents the method of optimization of extraction rules in order to getting more robust rules.
Keywords :
XML; knowledge acquisition; XML-based information extraction; XPath; XSLT; data-intensive page; data-intensive pages; extraction rules; path expression; path induction method; record pattern; Computer networks; Concurrent computing; Data mining; Databases; HTML; Optimization methods; Parallel processing; Robustness; Web pages; XML;
Conference_Titel :
Network and Parallel Computing Workshops, 2007. NPC Workshops. IFIP International Conference on
Conference_Location :
Liaoning
Print_ISBN :
978-0-7695-2943-1
DOI :
10.1109/NPC.2007.153