Title :
Information Extraction from Web Documents Based on Unranked Tree Automaton Inference
Author :
Huang Zhaohua ; Yang Fan
Author_Institution :
Sch. of Software, East China Jiao Tong Univ., Nanchang, China
Abstract :
Information extraction (IE) aims at extracting specific information from a collection of documents. A lot of previous work on IE from semi-structured documents (in XML or HTML) uses learning techniques based on strings. Some recent work converts the document to a ranked tree and uses tree automaton induction. This paper introduces an algorithm that uses unranked trees to induce an automaton. Experiments show that this gives the best results obtained so far for IE from semi-structured documents based on learning.
Keywords :
Internet; XML; automata theory; document handling; inference mechanisms; information retrieval; learning (artificial intelligence); trees (mathematics); HTML; IE; Web documents; XML; document collection; information extraction; learning techniques; ranked tree; semi structured documents; tree automaton induction; unranked tree automaton inference; Multimedia communication; Security; (k; automaton; grammatical inference; information extraction; l) -contextual tree language;
Conference_Titel :
Multimedia Information Networking and Security (MINES), 2012 Fourth International Conference on
Conference_Location :
Nanjing
Print_ISBN :
978-1-4673-3093-0
DOI :
10.1109/MINES.2012.128