DocumentCode :
1638110
Title :
Filtering the open-source information
Author :
Rasool, Pir Abdul ; Memon, Nasrullah ; Wiil, Uffe Kock ; Karampelas, Panagiotis
Author_Institution :
Maersk Mc-Kinney Moller Inst., Univ. of Southern Denmark, Odense, Denmark
fYear :
2010
Firstpage :
217
Lastpage :
220
Abstract :
The abundance of information regarding the most of domains makes Internet the best resource. Besides its usefulness, it is however difficult to automate the process of information extraction due to lack of structure in online information. The most commonly used information sharing protocol Hyper Text Transfer Protocol (HTTP) makes it possible to embed a lot of noise (like advertisements, images, headers, menus, etc.) in a document containing the useful information. Thus the filtering of noise prior information extraction is necessary. Such filtering of noise has many applications, including cell phone and Personal Digital Assistant (PDA) browsing, speech rendering for visually impaired or blind people, open source intelligence and many others. In this paper, we describe a statistical model to filter such noise from a document containing useful information. Our model is based on strategies to analyse the text distribution and link densities in HTML page across all of the nodes of Document Object Model (DOM) tree for detection of useful nodes among them. We describe the validity of model with the help of experiment conducted in implementation of an Early Warning System to facilitate open source intelligence. We also present the general work flow to convert the unstructured online text about terrorists into investigate-able data structure for social network analysis and discuss how our model fits into it.
Keywords :
Internet; hypermedia; information filtering; public domain software; statistical analysis; text analysis; transport protocols; tree data structures; HTML; Internet; cell phone; document object model tree; hyper text transfer protocol; information extraction; information sharing protocol; open source information filtering; open source intelligence; personal digital assistant; speech rendering; statistical model; text distribution analysis; visually impaired; Analytical models; Data mining; HTML; Internet; Layout; Noise; Terrorism; Document Object Model; Information Filtering; Open Source Intelligence; Social Network Analysis; Terrorist Information;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Software Engineering and Service Sciences (ICSESS), 2010 IEEE International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4244-6054-0
Type :
conf
DOI :
10.1109/ICSESS.2010.5552408
Filename :
5552408
Link To Document :
بازگشت