DocumentCode
1942729
Title
Intelligent Text Extraction from PDF Documents
Author
Hassan, Tamir ; Baumgartner, Robert
Author_Institution
Database & Artificial Intelligence Group, Vienna Univ. of Technol., Wien
Volume
2
fYear
2005
fDate
28-30 Nov. 2005
Firstpage
2
Lastpage
6
Abstract
In recent years, PDF has become the de-facto standard for the exchange of print-oriented documents on the Web. This includes many business documents such as financial reports, newsletters and patent applications, and there are many commercial applications that require data to be extracted from these documents and processed by computer systems. A number of products currently exist on the market that navigate, extract and transform data from HTML pages; a process known as wrapping. One such methodology is Lixto, a product of research at our institute. However, none of these products are currently able to work with PDF files. We are investigating this possibility as part of the NEX-TWRAP project. This paper describes our work in progress, and details some of the low-level page segmentation techniques that we have investigated
Keywords
Internet; XML; information retrieval; text analysis; HTML pages; NEX-TWRAP project; PDF document; Web; intelligent text extraction; print-oriented document extraction; Application software; Artificial intelligence; Business; Data mining; Databases; HTML; Information systems; Packaging; Web pages; Wrapping;
fLanguage
English
Publisher
ieee
Conference_Titel
Computational Intelligence for Modelling, Control and Automation, 2005 and International Conference on Intelligent Agents, Web Technologies and Internet Commerce, International Conference on
Conference_Location
Vienna
Print_ISBN
0-7695-2504-0
Type
conf
DOI
10.1109/CIMCA.2005.1631436
Filename
1631436
Link To Document