DocumentCode
3569890
Title
Extracting anchorable information units from PDF files
Author
Chakraborty, A. ; Liu, P. ; Hsu, L.
Author_Institution
Siemens Corp. Res. Inc., Princeton, NJ, USA
Volume
1
fYear
2003
Abstract
Document processing and understanding is important for a variety of applications such as office automation, creation of electronic manuals, online documentation and annotation etc. The first step towards this process often involves the extraction of relevant keywords and phrases from the documents so that they can be automatically hyperlinked within and outside the document so that we can create an electronic document. This paper describes a novel method for extracting anchorable information units (AIUs), also known as hotspots from any type of portable document format (PDF) files, which may either be created using either an editor or by scanning in documents. The AIUs are used to make these documents more intelligent for content cross-referencing to/from related multimedia documents within an electronic document publishing environment. Domain specific knowledge about the documents are used to aid the extraction process. Once the location and extent of the texts are found, the content is extracted through the use of an optical character recognition (OCR) software if necessary. For the case of object extraction for highlighting, first the images are extracted and then a variety of image processing algorithms are applied.
Keywords
document image processing; feature extraction; multimedia systems; optical character recognition; OCR software; PDF files; anchorable information units extraction; document processing; domain specific knowledge; electronic document; electronic document publishing environment; image extraction; image processing algorithms; multimedia documents; object extraction; optical character recognition; portable document format; Automation; Character recognition; Data mining; Documentation; Focusing; Image processing; Optical character recognition software; Publishing; Web sites; World Wide Web;
fLanguage
English
Publisher
ieee
Conference_Titel
Multimedia and Expo, 2003. ICME '03. Proceedings. 2003 International Conference on
Print_ISBN
0-7803-7965-9
Type
conf
DOI
10.1109/ICME.2003.1220882
Filename
1220882
Link To Document