Title :
Automatic Content Extraction on Semi-structured Documents
Author :
Santos, José Eduardo Bastos dos
Author_Institution :
Perceptive Software, Shawnee, OK, USA
Abstract :
Extracting specific content from certain types of documents can be a very challenging task, especially when developing a not so tailored solution and refraining from using explicit contextual information. In this paper, we address the problem of automatically extracting data from semi-structured documents through an unsupervised process based on an analysis of the document´s own morphological composition. We also discuss how this approach can be applied to different types of documents, with special attention being paid to college transcripts. The success of our method is supported by extensive tests, from which we have drawn some authentic examples.
Keywords :
content management; document handling; authentic example; automatic content extraction; automatic data extraction; college transcripts; contextual information; morphological composition; semistructured document; unsupervised process; Accuracy; Conferences; Educational institutions; Feature extraction; Layout; Text analysis; automatic zoning; college transcripts; data extraction; document image understanding; geometric and logical layout analysis; invoices; page decomposition;
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4577-1350-7
Electronic_ISBN :
1520-5363
DOI :
10.1109/ICDAR.2011.249