مرکز منطقه ای اطلاع رساني علوم و فناوري - Identification and extraction of different objects and its location from a Pdf file using efficient information retrieval tools

DocumentCode :

3668002

Title :

Identification and extraction of different objects and its location from a Pdf file using efficient information retrieval tools

Author :

M. Hanumanthappa;Deepa T. Nagalavi

Author_Institution :

Dept. of Computer Science and Applications, Bangalore University, 56, INDIA

fYear :

2015

Firstpage :

Lastpage :

Abstract :

During the past few years the readers interest on e-newspapers is significantly growing. E-newspapers play an important role by providing useful, informative and timely information to the readers. It is difficult task to retrieve relevant information from different newspapers, as the layout of a newspaper is not geometrically simple. The use of e-newspaper PDF format in the implementation of Information Retrieval System for Online Newspapers requires a robust information extraction system. The proposed system converts e-newspapers PDF format to text format. The PDF document is a platform independent file format. The file can be viewed on any information processing system with a PDF viewer. In this work the system analyses the tree structure of PDF file where it first locates the trailer part (root node of a file) and access the information of cross reference table such as the root object, location and the size of the cross reference table. From the cross reference table the system extracts the information of all the objects stored in the PDF document. This work mainly focuses on identifying each object and extracting the location of the contents of PDF documents.

Keywords :

"Portable document format","Streaming media","Object recognition","Data mining","Layout","Information retrieval","Libraries"

Publisher :

ieee

Conference_Titel :

Soft-Computing and Networks Security (ICSNS), 2015 International Conference on

Print_ISBN :

978-1-4799-1752-5

Type :

conf

DOI :

10.1109/ICSNS.2015.7292375

Filename :

7292375

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3668002