Text and Layout Information Extraction from Document Files of Various Formats Based on the Analysis of Page Description Language

Author

Hirano, Takashi ; Okano, Yuichi ; Okada, Yasuhiro ; Yoda, Fumio

Author_Institution

Mitsubishi Electr. Corp., Kamakura

Volume

1

fYear

2007

fDate

23-26 Sept. 2007

Firstpage

262

Lastpage

266

Abstract

We propose a document analysis method, which extracts text and layout information from document files of various formats. This method analyzes the page description language (PDL) data generated from a printed document. By converting the document to PDL data, this method can handle various document formats. Graphic elements such as text objects, image objects, and path objects in the PDL data are analyzed to extract text and layout information (character size, character position, and table position). By applying OCR to the image objects and the path objects, text images in source documents and vectorized font characters in engineering drawings are converted to text. Moreover, tables in various documents are detected by analyzing path objects. Therefore, it is possible to extract the full content information from document files of various formats as long as the document is printable.

Keywords

information retrieval; optical character recognition; text analysis; OCR; document analysis method; document format handling; graphic elements; image object; layout information extraction; optical character recognition; page description language analysis; path object; text information extraction; text object; Data analysis; Data mining; Graphics; Image analysis; Image converters; Information analysis; Layout; Optical character recognition software; Page description languages; Text analysis;

fLanguage

English

Publisher

ieee

Conference_Titel

Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on

Conference_Location

Parana

ISSN

1520-5363

Print_ISBN

978-0-7695-2822-9

Type

conf

DOI

10.1109/ICDAR.2007.4378716

Filename

4378716