Title :
Text and Layout Information Extraction from Document Files of Various Formats Based on the Analysis of Page Description Language
Author :
Hirano, Takashi ; Okano, Yuichi ; Okada, Yasuhiro ; Yoda, Fumio
Author_Institution :
Mitsubishi Electr. Corp., Kamakura
Abstract :
We propose a document analysis method, which extracts text and layout information from document files of various formats. This method analyzes the page description language (PDL) data generated from a printed document. By converting the document to PDL data, this method can handle various document formats. Graphic elements such as text objects, image objects, and path objects in the PDL data are analyzed to extract text and layout information (character size, character position, and table position). By applying OCR to the image objects and the path objects, text images in source documents and vectorized font characters in engineering drawings are converted to text. Moreover, tables in various documents are detected by analyzing path objects. Therefore, it is possible to extract the full content information from document files of various formats as long as the document is printable.
Keywords :
information retrieval; optical character recognition; text analysis; OCR; document analysis method; document format handling; graphic elements; image object; layout information extraction; optical character recognition; page description language analysis; path object; text information extraction; text object; Data analysis; Data mining; Graphics; Image analysis; Image converters; Information analysis; Layout; Optical character recognition software; Page description languages; Text analysis;
Conference_Titel :
Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on
Conference_Location :
Parana
Print_ISBN :
978-0-7695-2822-9
DOI :
10.1109/ICDAR.2007.4378716