Title :
Harvesting Information from Heterogeneous Sources
Author :
Qureshi, Pir Abdul Rasool ; Memon, Nasrullah ; Wiil, Uffe Kock ; Karampelas, Panagiotis ; Sancheze, Jose Ignacio Nieto
Author_Institution :
Maersk McKinney Moller Inst., Univ. of Southern Denmark, Odense, Denmark
Abstract :
The abundance of information regarding any topic makes the Internet a very good resource. Even though searching the Internet is very easy, what remains difficult is to automate the process of information extraction from the available online information due to the lack of structure and the diversity in the sharing methods. Most of the times, information is stored in different proprietary formats, complying with different standards and protocols which makes tasks like data mining and information harvesting very difficult. In this paper, an information harvesting tool (hetero Harvest) is presented with objectives to address these problems by filtering the useful information and then normalizing the information in a singular non hypertext format. We also discuss state of the art tools along with the shortcomings and present the results of an analysis carried out over different heterogeneous formats along with performance of our tool with respect to each format. Finally, the different potential applications of the proposed tool are discussed with special emphasis on open source intelligence.
Keywords :
Internet; data mining; information filtering; Internet; data mining; hetero Harvest; heterogeneous sources; information extraction; information filtering; information harvesting tool; sharing methods; Data mining; Error analysis; HTML; Indexing; Internet; Portable document format; Search engines; data mining; heterogeneous information sources; information harvesting; web crawler;
Conference_Titel :
Intelligence and Security Informatics Conference (EISIC), 2011 European
Conference_Location :
Athens
Print_ISBN :
978-1-4577-1464-1
Electronic_ISBN :
978-0-7695-4406-9
DOI :
10.1109/EISIC.2011.76