مرکز منطقه ای اطلاع رساني علوم و فناوري - A new method of information extraction from PDF files

DocumentCode :

2327576

Title :

A new method of information extraction from PDF files

Author :

Yuan, Fang ; Bo Lu

Author_Institution :

Coll. of Math. & Comput. Sci., Hebei Univ., Baoding, China

Volume :

fYear :

2005

fDate :

18-21 Aug. 2005

Firstpage :

1738

Abstract :

With the rapid increase of the PDF files in Internet, how to manage and search PDF files efficiently and quickly has become an urgent problem to be solved. The most important step of solving this problem is to extract information from the PDF files. This paper presents a new method for extracting information from PDF files. It first parses PDF files to get text and format information and injects tags into text information to transform it into semi-structured text, and finally, one pattern match algorithm based on tree model is applied to obtain the solution. A further experiment proved this method was effective.

Keywords :

document image processing; feature extraction; pattern matching; text analysis; tree data structures; PDF file; information extraction; pattern matching; tree model; Computer science; Data mining; Educational institutions; Electronic mail; Engineering management; Information science; Information technology; Internet; Mathematics; Pattern matching; Information extraction; PDF; Pattern match algorithm based on tree model; Semi-structured data;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on

Conference_Location :

Guangzhou, China

Print_ISBN :

0-7803-9091-1

Type :

conf

DOI :

10.1109/ICMLC.2005.1527225

Filename :

1527225

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2327576