Title :
A new method of information extraction from PDF files
Author :
Yuan, Fang ; Bo Lu
Author_Institution :
Coll. of Math. & Comput. Sci., Hebei Univ., Baoding, China
Abstract :
With the rapid increase of the PDF files in Internet, how to manage and search PDF files efficiently and quickly has become an urgent problem to be solved. The most important step of solving this problem is to extract information from the PDF files. This paper presents a new method for extracting information from PDF files. It first parses PDF files to get text and format information and injects tags into text information to transform it into semi-structured text, and finally, one pattern match algorithm based on tree model is applied to obtain the solution. A further experiment proved this method was effective.
Keywords :
document image processing; feature extraction; pattern matching; text analysis; tree data structures; PDF file; information extraction; pattern matching; tree model; Computer science; Data mining; Educational institutions; Electronic mail; Engineering management; Information science; Information technology; Internet; Mathematics; Pattern matching; Information extraction; PDF; Pattern match algorithm based on tree model; Semi-structured data;
Conference_Titel :
Machine Learning and Cybernetics, 2005. Proceedings of 2005 International Conference on
Conference_Location :
Guangzhou, China
Print_ISBN :
0-7803-9091-1
DOI :
10.1109/ICMLC.2005.1527225