• DocumentCode
    2060789
  • Title

    Extraction of indicative summary sentences from imaged documents

  • Author

    Chen, Francine R. ; Bloomberg, Dan S.

  • Author_Institution
    Xerox Palo Alto Res. Center, CA, USA
  • Volume
    1
  • fYear
    1997
  • fDate
    18-20 Aug 1997
  • Firstpage
    227
  • Abstract
    A system for selecting sentences from an imaged document for presentation as part of a document summary is presented. The extracts are identified without the use of optical character recognition. The sentences are selected based on a set of discrete features characterizing the words within a sentence and the location of the sentence within the imaged document. Each sentence is scored based on the values of the discrete features using a statistically based classifier. The imaged document is processed to identify the word locations, the reading order of words, and the location of sentence and paragraph boundaries in the text. The words are grouped into equivalence classes to mimic the terms in a text document. A sample extract for a technical document is shown, and evaluation against a set of abstracts created by a professional abstracting company is given. These results are compared with text-based abstracts
  • Keywords
    abstracting; document image processing; equivalence classes; abstracting; discrete features; document summary; equivalence classes; imaged documents; indicative summary sentence extraction; paragraph boundaries; reading order; sentence location; sentence scoring; sentence selection; statistically based classifier; technical document; text document terms; text-based abstracts; word location identification; Abstracts; Aging; Character recognition; Data mining; Facsimile; Natural languages; Optical character recognition software; Optical sensors;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 1997., Proceedings of the Fourth International Conference on
  • Conference_Location
    Ulm
  • Print_ISBN
    0-8186-7898-4
  • Type

    conf

  • DOI
    10.1109/ICDAR.1997.619846
  • Filename
    619846