Document image summarization without OCR

Author

Bloomberg, Dan S. ; Chen, Francine R.

Author_Institution

Xerox Palo Alto Res. Center, CA, USA

Volume

fYear

1996

fDate

16-19 Sep 1996

Firstpage

229

Abstract

A system for selecting excerpts directly from imaged text without performing optical character recognition is described. The images are segmented to find text regions, text lines and words, and sentence and paragraph boundaries are identified. A set of word equivalence classes is computed based on the rank blur hit-miss transform. This information is used to identify stop words and keywords. Sentences for presentation as part of a summary are then selected based on keywords and on the location of the sentences

Keywords

document image processing; image segmentation; transforms; document image summarization; image segmentation; imaged text; keywords; paragraph boundaries; rank blur hit-miss transform; sentence; stop word identification; text lines; text regions; word equivalence classes; words; Character generation; Character recognition; Data mining; Graphics; Image analysis; Image processing; Image segmentation; Natural languages; Optical character recognition software; Shape;

fLanguage

English

Publisher

ieee

Conference_Titel

Image Processing, 1996. Proceedings., International Conference on

Conference_Location

Lausanne

Print_ISBN

0-7803-3259-8

Type

conf

DOI

10.1109/ICIP.1996.560744

Filename

560744

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=2639464