Title :
Identification and extraction of different objects and its location from a Pdf file using efficient information retrieval tools
Author :
M. Hanumanthappa;Deepa T. Nagalavi
Author_Institution :
Dept. of Computer Science and Applications, Bangalore University, 56, INDIA
Abstract :
During the past few years the readers interest on e-newspapers is significantly growing. E-newspapers play an important role by providing useful, informative and timely information to the readers. It is difficult task to retrieve relevant information from different newspapers, as the layout of a newspaper is not geometrically simple. The use of e-newspaper PDF format in the implementation of Information Retrieval System for Online Newspapers requires a robust information extraction system. The proposed system converts e-newspapers PDF format to text format. The PDF document is a platform independent file format. The file can be viewed on any information processing system with a PDF viewer. In this work the system analyses the tree structure of PDF file where it first locates the trailer part (root node of a file) and access the information of cross reference table such as the root object, location and the size of the cross reference table. From the cross reference table the system extracts the information of all the objects stored in the PDF document. This work mainly focuses on identifying each object and extracting the location of the contents of PDF documents.
Keywords :
"Portable document format","Streaming media","Object recognition","Data mining","Layout","Information retrieval","Libraries"
Conference_Titel :
Soft-Computing and Networks Security (ICSNS), 2015 International Conference on
Print_ISBN :
978-1-4799-1752-5
DOI :
10.1109/ICSNS.2015.7292375