• DocumentCode
    2149369
  • Title

    A Fast Appearance-Based Full-Text Search Method for Historical Newspaper Images

  • Author

    Terasawa, Kengo ; Shima, Takahiro ; Kawashima, Toshio

  • Author_Institution
    Grad. Sch. of Syst. Inf. Sci., Future Univ. Hakodate, Hakodate, Japan
  • fYear
    2011
  • fDate
    18-21 Sept. 2011
  • Firstpage
    1379
  • Lastpage
    1383
  • Abstract
    This paper presents a fast appearance-based full-text search method for historical newspaper images. Since historical newspapers differ from recent newspapers in image quality, type fonts and language usages, optical character recognition (OCR) does not provide sufficient quality. Instead of OCR approach, we adopted appearance-based approach, that means we matched character to character with its shapes. Assuming proper character segmentation and proper feature description, full-text search problem is reduced to sequence matching problem of feature vector. To increase computational efficiency, we adopted pseudo-code expression called LSPC, which is a compact sketch of feature vector while retaining a good deal of its information. Experimental result showed that our method can retrieve a query string from a text of over eight million characters within a second. In addition, we predict that more sophisticated algorithm could be designed for LSPC. As an example, we established the Extended Boyer-Moore-Horspool algorithm that can reduce the computational cost further especially when the query string becomes longer.
  • Keywords
    character sets; feature extraction; full-text databases; history; image matching; optical character recognition; publishing; query processing; vectors; LSPC; OCR; appearance-based approach; appearance-based full-text search method; character matching; character segmentation; computational cost; computational efficiency; extended Boyer-Moore-Horspool algorithm; feature description; feature vector; full-text search problem; historical newspaper images; image quality; language usages; optical character recognition; pseudo-code expression; query string; sequence matching problem; sufficient quality; type fonts; Feature extraction; Image segmentation; Text analysis; Vectors; Boyer-Moore-Horspool algorithm; Locality-Sensitive Pseudo-Code; historical document images; string matching; word spotting;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2011 International Conference on
  • Conference_Location
    Beijing
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4577-1350-7
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2011.277
  • Filename
    6065536