Information Extraction from Web Documents Based on Unranked Tree Automaton Inference

Author

Huang Zhaohua ; Yang Fan

Author_Institution

Sch. of Software, East China Jiao Tong Univ., Nanchang, China

fYear

2012

fDate

2-4 Nov. 2012

Firstpage

195

Lastpage

198

Abstract

Information extraction (IE) aims at extracting specific information from a collection of documents. A lot of previous work on IE from semi-structured documents (in XML or HTML) uses learning techniques based on strings. Some recent work converts the document to a ranked tree and uses tree automaton induction. This paper introduces an algorithm that uses unranked trees to induce an automaton. Experiments show that this gives the best results obtained so far for IE from semi-structured documents based on learning.

Keywords

Internet; XML; automata theory; document handling; inference mechanisms; information retrieval; learning (artificial intelligence); trees (mathematics); HTML; IE; Web documents; XML; document collection; information extraction; learning techniques; ranked tree; semi structured documents; tree automaton induction; unranked tree automaton inference; Multimedia communication; Security; (k; automaton; grammatical inference; information extraction; l) -contextual tree language;

fLanguage

English

Publisher

ieee

Conference_Titel

Multimedia Information Networking and Security (MINES), 2012 Fourth International Conference on

Conference_Location

Nanjing

Print_ISBN

978-1-4673-3093-0

Type

conf

DOI

10.1109/MINES.2012.128

Filename

6405661