DocumentCode
2021945
Title
Text and Layout Information Extraction from Document Files of Various Formats Based on the Analysis of Page Description Language
Author
Hirano, Takashi ; Okano, Yuichi ; Okada, Yasuhiro ; Yoda, Fumio
Author_Institution
Mitsubishi Electr. Corp., Kamakura
Volume
1
fYear
2007
fDate
23-26 Sept. 2007
Firstpage
262
Lastpage
266
Abstract
We propose a document analysis method, which extracts text and layout information from document files of various formats. This method analyzes the page description language (PDL) data generated from a printed document. By converting the document to PDL data, this method can handle various document formats. Graphic elements such as text objects, image objects, and path objects in the PDL data are analyzed to extract text and layout information (character size, character position, and table position). By applying OCR to the image objects and the path objects, text images in source documents and vectorized font characters in engineering drawings are converted to text. Moreover, tables in various documents are detected by analyzing path objects. Therefore, it is possible to extract the full content information from document files of various formats as long as the document is printable.
Keywords
information retrieval; optical character recognition; text analysis; OCR; document analysis method; document format handling; graphic elements; image object; layout information extraction; optical character recognition; page description language analysis; path object; text information extraction; text object; Data analysis; Data mining; Graphics; Image analysis; Image converters; Information analysis; Layout; Optical character recognition software; Page description languages; Text analysis;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on
Conference_Location
Parana
ISSN
1520-5363
Print_ISBN
978-0-7695-2822-9
Type
conf
DOI
10.1109/ICDAR.2007.4378716
Filename
4378716
Link To Document