Title :
heteroHarvest: Harvesting information from heterogeneous sources
Author :
Qureshi, Pir Abdul Rasool ; Memon, Nasir ; Wiil, Uffe Kock ; Karampelas, Panagiotis ; Sancheze, Jose Ignacio Nieto
Author_Institution :
Maersk Mc-Kinney Moller Inst., Univ. of Southern Denmark, Odense, Denmark
Abstract :
The abundance of information regarding any topic makes the Internet a very good resource. Even though searching the Internet is very easy, what remains difficult is to automate the process of information extraction from the available online information due to the lack of structure and the diversity in the sharing methods. Most of the times, information is stored in different proprietary formats, complying with different standards and protocols which makes tasks like data mining and information harvesting very difficult. In this paper, an information harvesting tool (heteroHarvest) is presented with objectives to address these problems by filtering the useful information and then normalizing the information in a singular non hypertext format. Finally we describe the results of experimental evaluation. The results are found promising with an overall error rate equal to 6.5% across heterogeneous formats.
Keywords :
Internet; data mining; information filtering; protocols; Internet; data mining; heteroHarvest; heterogeneous source; information extraction; information filtering; information harvesting; online information; proprietary format; protocol; singular nonhypertext format; data mining; heterogeneous information sources; information harvesting; web crawler;
Conference_Titel :
Intelligence and Security Informatics (ISI), 2011 IEEE International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4577-0082-8
DOI :
10.1109/ISI.2011.5984780