DocumentCode :
2051662
Title :
Statistical Model for Content Extraction
Author :
Qureshi, Pir Abdul Rasool ; Memon, Nasrullah
Author_Institution :
Maersk McKinney Moller Inst., Univ. of Southern Denmark, Odense, Denmark
fYear :
2011
fDate :
12-14 Sept. 2011
Firstpage :
129
Lastpage :
134
Abstract :
We present a statistical model for content extraction from HTML documents. The model operates on Document Object Model (DOM) tree of the corresponding HTML document. It evaluates each tree node and associated statistical features to predict significance of the node towards overall content of the document. The model exploits feature set including link densities and text distribution across the nodes of DOM tree. We describe the validity of model with the help of experiments conducted on the standard data sets. The results revealed that the proposed model outperformed other state of art models. We also describe the significance of the model in the domain of counterterrorism and open source intelligence.
Keywords :
content management; hypermedia markup languages; statistical analysis; text analysis; trees (mathematics); DOM tree; HTML documents; content extraction; counterterrorism; document content; document object model tree; link density; open source intelligence; standard data sets; statistical features; statistical model; text distribution; tree node; Data mining; Encyclopedias; Feature extraction; HTML; Internet; Web pages; content extraction; information filtering; open source intelligence;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Intelligence and Security Informatics Conference (EISIC), 2011 European
Conference_Location :
Athens
Print_ISBN :
978-1-4577-1464-1
Electronic_ISBN :
978-0-7695-4406-9
Type :
conf
DOI :
10.1109/EISIC.2011.75
Filename :
6061115
Link To Document :
بازگشت