Automatic Content Extraction on Semi-structured Documents

Author

Santos, José Eduardo Bastos dos

Author_Institution

Perceptive Software, Shawnee, OK, USA

fYear

2011

fDate

18-21 Sept. 2011

Firstpage

1235

Lastpage

1239

Abstract

Extracting specific content from certain types of documents can be a very challenging task, especially when developing a not so tailored solution and refraining from using explicit contextual information. In this paper, we address the problem of automatically extracting data from semi-structured documents through an unsupervised process based on an analysis of the document´s own morphological composition. We also discuss how this approach can be applied to different types of documents, with special attention being paid to college transcripts. The success of our method is supported by extensive tests, from which we have drawn some authentic examples.

Keywords

content management; document handling; authentic example; automatic content extraction; automatic data extraction; college transcripts; contextual information; morphological composition; semistructured document; unsupervised process; Accuracy; Conferences; Educational institutions; Feature extraction; Layout; Text analysis; automatic zoning; college transcripts; data extraction; document image understanding; geometric and logical layout analysis; invoices; page decomposition;

fLanguage

English

Publisher

ieee

Conference_Titel

Document Analysis and Recognition (ICDAR), 2011 International Conference on

Conference_Location

Beijing

ISSN

1520-5363

Print_ISBN

978-1-4577-1350-7

Electronic_ISBN

1520-5363

Type

conf

DOI

10.1109/ICDAR.2011.249

Filename

6065507