• DocumentCode
    311091
  • Title

    Clustering OCR-ed texts for browsing document image database

  • Author

    Tsuda, Koji ; Senda, Shuji ; Minoh, Michihiko ; Ikeda, Katsuo

  • Author_Institution
    Dept. of Inf. Sci., Kyoto Univ., Japan
  • Volume
    1
  • fYear
    1995
  • fDate
    14-16 Aug 1995
  • Firstpage
    171
  • Abstract
    Document clustering is a powerful tool for browsing throughout a document database. Similar documents are gathered into several clusters and a representative document of each cluster is shown to users. To make users infer the content of the database from several representatives, the documents must be separated into tight clusters, in which documents are connected with high similarities. At the same time, clustering must be fast for user interaction. We propose an O(n2) time, O(n) space cluster extraction method. It is faster than the ordinal clustering methods, and its clusters compare favorably with those produced by Complete Link for tightness. When we deal with OCR-ed documents, term loss caused by recognition faults can change similarities between documents. We also examined the effect of recognition faults to the performance of document clustering
  • Keywords
    document image processing; feature extraction; human factors; interactive systems; optical character recognition; visual databases; word processing; Complete Link; OCR text clustering; cluster extraction method; document clustering; document image database browsing; ordinal clustering methods; recognition faults; term loss; user interaction; Clustering algorithms; Clustering methods; Frequency; Image databases; Information retrieval; Information science; Merging; Object detection; Optical character recognition software;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on
  • Conference_Location
    Montreal, Que.
  • Print_ISBN
    0-8186-7128-9
  • Type

    conf

  • DOI
    10.1109/ICDAR.1995.598969
  • Filename
    598969