Title :
A corpus for comparative evaluation of OCR software and postcorrection techniques
Author :
Mihov, Stoyan ; Schulz, Klaus U. ; Ringlstetter, Christoph ; Dojchinova, Veselka ; Nakova, Vanja ; Kalpakchieva, Kristina ; Gerasimov, Ognjan ; Gotscharek, Annette ; Gercke, Claudia
Author_Institution :
Bulgarian Acad. of Sci., Sofia, Bulgaria
fDate :
29 Aug.-1 Sept. 2005
Abstract :
We describe a new corpus collected for comparative evaluation of OCR-software and postcorrection techniques. The corpus is freely available for academic groups and use. The major part of the corpus (2306 files) consists of Bulgarian documents. Many of these documents come with Cyrillic and Latin symbols. A smaller corpus with German documents has been added. All original documents represent real-life paper documents collected from enterprises and organizations. Most genres of written language and various document types are covered. The corpus contains the corresponding image files, rich meta-data textual files obtained via OCR recognition, ground truth data for hundreds of example pages, and alignment software for experiments.
Keywords :
document handling; meta data; natural languages; optical character recognition; Bulgarian documents; Cyrillic symbol; German documents; Latin symbols; OCR software; meta-data; optical character recognition; postcorrection techniques; Application software; Character recognition; Computational Intelligence Society; Image recognition; Image reconstruction; Natural languages; Optical character recognition software; Testing; Text analysis; Cyrillic documents; Optical character recognition; comparative; evaluation; ground truth data; meta-data.; mixed-alphabet documents; postcorrection of OCR results; public corpora;
Conference_Titel :
Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on
Print_ISBN :
0-7695-2420-6
DOI :
10.1109/ICDAR.2005.6