DocumentCode :
2148572
Title :
Automatic Content Extraction on Semi-structured Documents
Author :
Santos, José Eduardo Bastos dos
Author_Institution :
Perceptive Software, Shawnee, OK, USA
fYear :
2011
fDate :
18-21 Sept. 2011
Firstpage :
1235
Lastpage :
1239
Abstract :
Extracting specific content from certain types of documents can be a very challenging task, especially when developing a not so tailored solution and refraining from using explicit contextual information. In this paper, we address the problem of automatically extracting data from semi-structured documents through an unsupervised process based on an analysis of the document´s own morphological composition. We also discuss how this approach can be applied to different types of documents, with special attention being paid to college transcripts. The success of our method is supported by extensive tests, from which we have drawn some authentic examples.
Keywords :
content management; document handling; authentic example; automatic content extraction; automatic data extraction; college transcripts; contextual information; morphological composition; semistructured document; unsupervised process; Accuracy; Conferences; Educational institutions; Feature extraction; Layout; Text analysis; automatic zoning; college transcripts; data extraction; document image understanding; geometric and logical layout analysis; invoices; page decomposition;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location :
Beijing
ISSN :
1520-5363
Print_ISBN :
978-1-4577-1350-7
Electronic_ISBN :
1520-5363
Type :
conf
DOI :
10.1109/ICDAR.2011.249
Filename :
6065507
Link To Document :
بازگشت