• DocumentCode
    2015505
  • Title

    The hOCR Microformat for OCR Workflow and Results

  • Author

    Breuel, Thomas M. ; Kaiserslautern, U.

  • Volume
    2
  • fYear
    2007
  • fDate
    23-26 Sept. 2007
  • Firstpage
    1063
  • Lastpage
    1067
  • Abstract
    Large scale scanning and document conversion efforts have led to a renewed interest in OCR systems and workflows. This paper describes a new format for representing both intermediate and final OCR results, developed in response to the needs of a newly developed OCR system and ground truth data release. The format embeds OCR information invisibly inside the HTML and CSS standards and therefore can represent a wide range of linguistic and typographic phenomena with already well-defined, widely understood markup and can be processed using widely available and known tools. The format is based on a new, multi-level abstraction of OCR results based on logical markup, common typesetting models, and OCR engine-specific markup, making it suitable both for the support of existing workflows and the development of future model-based OCR engines.
  • Keywords
    document handling; search engines; CSS standards; HTML standards; OCR engine-specific markup; OCR workflow; document conversion efforts; ground truth data release; hOCR microformat; logical markup; typesetting models; typographic phenomena; Cascading style sheets; Databases; Engines; HTML; History; Large-scale systems; Optical character recognition software; Typesetting; Writing; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on
  • Conference_Location
    Parana
  • ISSN
    1520-5363
  • Print_ISBN
    978-0-7695-2822-9
  • Type

    conf

  • DOI
    10.1109/ICDAR.2007.4377078
  • Filename
    4377078