• DocumentCode
    2147389
  • Title

    The Four and a Half Challenges of Humanities Data

  • Author

    Küster, Marc Wilhelm

  • Author_Institution
    Univ. of Appl. Sci. Worms, Worms, Germany
  • fYear
    2011
  • fDate
    18-21 Sept. 2011
  • Firstpage
    1017
  • Lastpage
    1023
  • Abstract
    The lead medium of the humanities is text, but text with special characteristics that can be quite different from a normal monolingual article in most modern scripts. Text that can be derived from manuscripts, from retro digitization of previous scholarly publications such as critical editions and dictionaries, from books printed centuries ago, applying conventions no longer in force today. The keynote identifies four major challenges for recognizing humanities data: Unusual characters, unusual layouts, unusual semantics and unusual segmentations. Each challenge is illustrated with concrete examples taken from a variety of times and places, starting with cuneiform tablets, an extract from a Greek manuscript, a page from a multilingual critical edition, a renaissance print, a lemma from a scholarly dictionary, and some more. In addition, scholarly humanities data is typically marked up using domain-specific rich XML-based formats based on the TEI P5 guidelines. Any format that an OCR program produces must be sufficiently rich to permit for a mapping on TEI-compliant markup in order to be capable of reproducing the full richness of the original. A closer view at the Text Grid virtual research environment for the humanities and its Text-Image Link Editor (TBLE) demonstrates how scholars currently tackle these tasks. It analyzes where automatization can facilitate their task and enable new dimensions of research.
  • Keywords
    XML; dictionaries; humanities; natural languages; optical character recognition; publishing; text analysis; Greek manuscript; OCR program; TEI P5 guideline; TEI-compliant markup; TextGrid virtual research environment; XML-based format; cuneiform tablet; humanity data recognition; modern script; multilingual critical edition; normal monolingual article; renaissance print; scholarly dictionary; scholarly publication; text characteristics; text-image link editor; unusual character; unusual layout; unusual segmentation; unusual semantics; Character recognition; Dictionaries; Image segmentation; Layout; Optical character recognition software; Semantics; Shape; OCR; TEI; XML; eHumanities; humanities;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2011 International Conference on
  • Conference_Location
    Beijing
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4577-1350-7
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2011.206
  • Filename
    6065464