Title :
Image Extraction from Online Text Streams: A Straightforward Template Independent Approach without Training
Author :
Adam, George ; Bouras, Christos ; Poulopoulos, Vassilis
Author_Institution :
Comput. Eng. & Inf. Dept., Univ. of Patras, Patras, Greece
Abstract :
In this paper we present an efficient system that processes HTML pages in order to extract the useful images from them. The proposed mechanism is template independent and is focalized on HTML pages that include news articles from major portals and blogs. As useful images we define the pictures that are relevant to the news report. In order to extract the image objects of the article we deconstruct the HTML page to its DOM model and we apply a set of algorithms in order to clean and correct the HTML code, locate and characterize each node of the DOM model and finally keep the nodes that are characterized as useful nodes. The proposed mechanism is applied as a subsystem of peRSSonal, a web tool that is used to obtain news articles from all over the world, process them and present them back to the end users in a personalized manner. The role of the mechanism is to feed peRSSonal´s database with digital images for browsing and searching purposes. We present the basic algorithms and experimental results on the efficiency of the proposed implementation.
Keywords :
feature extraction; image retrieval; multimedia computing; object detection; text analysis; DOM model; HTML page; image object extraction; online text stream; peRSSonal Web tool; template independent approach; Blogs; Computer networks; Content based retrieval; Data mining; HTML; Informatics; Information retrieval; Portals; Streaming media; Web pages; image annotation; image retrieval; multimedia extraction; web information extraction; web mining;
Conference_Titel :
Advanced Information Networking and Applications Workshops (WAINA), 2010 IEEE 24th International Conference on
Conference_Location :
Perth, WA
Print_ISBN :
978-1-4244-6701-3
DOI :
10.1109/WAINA.2010.131