• DocumentCode
    3141086
  • Title

    Duplicate detection for symbolically compressed documents

  • Author

    Lee, Dar-Shyang ; Hull, Jonathan J.

  • Author_Institution
    Ricoh Silicon Valley Inc., Menlo Park, CA, USA
  • fYear
    1999
  • fDate
    20-22 Sep 1999
  • Firstpage
    305
  • Lastpage
    308
  • Abstract
    A new family of symbolic compression algorithms has recently been developed that includes the ongoing JBIG2 standardization effort as well as related commercial products. These techniques are specifically designed for binary document images. They cluster individual blobs in a document and store the sequence of occurrence of blobs and representative blob templates, hence the name symbolic compression. This paper describes a method for duplicate detection on symbolically compressed document images. It recognizes the text in an image by deciphering the sequence of occurrence of blobs in the compressed representation. We propose a Hidden Markov Model (HMM) method for solving such deciphering problems and suggest applications in multilingual document duplicate detection
  • Keywords
    hidden Markov models; image matching; standardisation; visual databases; JBIG2 standardization; binary document images; blob templates; deciphering problems; duplicate detection; hidden Markov model; multilingual document duplicate detection; symbolic compression algorithms; symbolically compressed document images; symbolically compressed documents; Clustering algorithms; Compression algorithms; Hidden Markov models; Image coding; Image recognition; Image storage; Pattern matching; Silicon; Standardization; Text recognition;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 1999. ICDAR '99. Proceedings of the Fifth International Conference on
  • Conference_Location
    Bangalore
  • Print_ISBN
    0-7695-0318-7
  • Type

    conf

  • DOI
    10.1109/ICDAR.1999.791785
  • Filename
    791785