DocumentCode :
588761
Title :
Information Extraction from Web Documents Based on Unranked Tree Automaton Inference
Author :
Huang Zhaohua ; Yang Fan
Author_Institution :
Sch. of Software, East China Jiao Tong Univ., Nanchang, China
fYear :
2012
fDate :
2-4 Nov. 2012
Firstpage :
195
Lastpage :
198
Abstract :
Information extraction (IE) aims at extracting specific information from a collection of documents. A lot of previous work on IE from semi-structured documents (in XML or HTML) uses learning techniques based on strings. Some recent work converts the document to a ranked tree and uses tree automaton induction. This paper introduces an algorithm that uses unranked trees to induce an automaton. Experiments show that this gives the best results obtained so far for IE from semi-structured documents based on learning.
Keywords :
Internet; XML; automata theory; document handling; inference mechanisms; information retrieval; learning (artificial intelligence); trees (mathematics); HTML; IE; Web documents; XML; document collection; information extraction; learning techniques; ranked tree; semi structured documents; tree automaton induction; unranked tree automaton inference; Multimedia communication; Security; (k; automaton; grammatical inference; information extraction; l) -contextual tree language;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Multimedia Information Networking and Security (MINES), 2012 Fourth International Conference on
Conference_Location :
Nanjing
Print_ISBN :
978-1-4673-3093-0
Type :
conf
DOI :
10.1109/MINES.2012.128
Filename :
6405661
Link To Document :
بازگشت