DocumentCode
2830213
Title
Text extraction on Windows®-based documents
Author
Ray, Brian ; Chiang, Chia-Chu ; Melescue, Jim
Author_Institution
Dept. of Comput. Sci., Arkansas Univ., AR, USA
fYear
2005
fDate
16-18 Aug. 2005
Firstpage
205
Lastpage
210
Abstract
Syntel LLC is the developer of a mail presorting application called AutoMail®, which needs to alter bank statements that are being printed. For this and other applications, it is sometimes impossible to exert any control over the document creation software, but changes to the printed documents must nevertheless be made. The purpose of this project is to retrieve data which has been sent to the Microsoft Windows® printing subsystem, parse the data, modify sections of text contained within each document, and continue the print process, leaving the document unmolested except for the altered sections of text. This is done by processing enhanced metafile (EMF) documents, and generating XML documents formatted to be easily read by the software modules responsible for actually altering the text data. During some phase of the print process on Microsoft Windows operating systems, each page will exist as an EMF document. Each EMF document consists of a number of entries describing drawing operations. Those drawing operations which are found to pertain to text output in the important spatial regions of the document are converted to plain text. This text, along with certain formatting and positioning information, is written to the XML file. All other drawing operations are included in the XML file as "black box" entities, so that the document can be repackaged after processing. Repackaging is accomplished by creating new text drawing operations, reinserting the other drawing operations, and using the Windows® API to print the resulting EMF document.
Keywords
XML; data structures; operating systems (computers); postal services; text analysis; Microsoft Windows operating system; Microsoft Windows printing subsystem; Windows API; Windows-based documents; XML documents; XML file; data parsing; data retrieval; document processing; enhanced metafile documents; text data alteration; text drawing operation; text extraction; Application software; Computer science; Information retrieval; Operating systems; Postal services; Printers; Printing; Programming; Systems engineering and theory; XML;
fLanguage
English
Publisher
ieee
Conference_Titel
Systems Engineering, 2005. ICSEng 2005. 18th International Conference on
Print_ISBN
0-7695-2359-5
Type
conf
DOI
10.1109/ICSENG.2005.80
Filename
1562853
Link To Document