• DocumentCode
    2530052
  • Title

    Document digitization technology and its application for digital library in China

  • Author

    Ding, Xiaoqing ; Wen, Di ; Peng, Liangrui ; Liu, Changsong

  • Author_Institution
    Dept. of Electron. Eng., Tsinghua Univ., Beijing, China
  • fYear
    2004
  • fDate
    2004
  • Firstpage
    46
  • Lastpage
    53
  • Abstract
    We introduce the research of document digitization technology and its applications for constructing digital libraries in China. We focus on two major objectives of document digitization technologies: performance and efficiency. Taking the most representative TH-OCR product as an example, the up-to-date research achievements on both kernel OCR technologies and peripheral technologies in China are presented. The kernel technologies include high performance multilingual (Chinese, Japanese, Korean and English) text recognition, layout analysis, understanding and reconstruction; the peripheral technologies include the network document digitization workflow and intelligent proofreading, which greatly improve the efficiency. The applications of TH-OCR has two types of final output digital documents, one is the reconstructed electronic document with full text and layout information of the original paper-based document, the other is the multilevel document with OCR output text layer under the image layer. Numerous applications indicate that current technologies can greatly facilitate the mass-volume digitization labour in building digital library infrastructure.
  • Keywords
    digital libraries; document image processing; optical character recognition; text analysis; TH-OCR product; digital library; document digitization technology; electronic document; intelligent proofreading; kernel OCR technology; layout analysis; mass-volume digitization labour; multilingual character recognition; network document digitization workflow; paper-based document; peripheral technology; text recognition; Automation; Books; Character recognition; Humans; Image reconstruction; Intelligent networks; Kernel; Laboratories; Optical character recognition software; Software libraries;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Image Analysis for Libraries, 2004. Proceedings. First International Workshop on
  • Print_ISBN
    0-7695-2088-X
  • Type

    conf

  • DOI
    10.1109/DIAL.2004.1263236
  • Filename
    1263236