• DocumentCode
    3569890
  • Title

    Extracting anchorable information units from PDF files

  • Author

    Chakraborty, A. ; Liu, P. ; Hsu, L.

  • Author_Institution
    Siemens Corp. Res. Inc., Princeton, NJ, USA
  • Volume
    1
  • fYear
    2003
  • Abstract
    Document processing and understanding is important for a variety of applications such as office automation, creation of electronic manuals, online documentation and annotation etc. The first step towards this process often involves the extraction of relevant keywords and phrases from the documents so that they can be automatically hyperlinked within and outside the document so that we can create an electronic document. This paper describes a novel method for extracting anchorable information units (AIUs), also known as hotspots from any type of portable document format (PDF) files, which may either be created using either an editor or by scanning in documents. The AIUs are used to make these documents more intelligent for content cross-referencing to/from related multimedia documents within an electronic document publishing environment. Domain specific knowledge about the documents are used to aid the extraction process. Once the location and extent of the texts are found, the content is extracted through the use of an optical character recognition (OCR) software if necessary. For the case of object extraction for highlighting, first the images are extracted and then a variety of image processing algorithms are applied.
  • Keywords
    document image processing; feature extraction; multimedia systems; optical character recognition; OCR software; PDF files; anchorable information units extraction; document processing; domain specific knowledge; electronic document; electronic document publishing environment; image extraction; image processing algorithms; multimedia documents; object extraction; optical character recognition; portable document format; Automation; Character recognition; Data mining; Documentation; Focusing; Image processing; Optical character recognition software; Publishing; Web sites; World Wide Web;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Multimedia and Expo, 2003. ICME '03. Proceedings. 2003 International Conference on
  • Print_ISBN
    0-7803-7965-9
  • Type

    conf

  • DOI
    10.1109/ICME.2003.1220882
  • Filename
    1220882