DocumentCode :
3079607
Title :
Generalized and lightweight algorithms for automated web forum content extraction
Author :
Wee-Yong Lim ; Raja, V. ; Thing, Vrizlynn L. L.
Author_Institution :
Cybercrime & Security Intell. Dept., Inst. for Infocomm Res., Singapore, Singapore
fYear :
2013
fDate :
26-28 Dec. 2013
Firstpage :
1
Lastpage :
8
Abstract :
As online forums contain a vast amount of information that can aid in the early detection of fraud and extremist activities, accurate and efficient information extraction from forum sites is very important. In this paper, we discuss the limitations of existing works in the extraction of information from generic web sites and forum sites. We also identify the need for better suited, generalized and lightweight algorithms to carry out a more accurate and efficient information extraction while eliminating noisy data from forum sites. In this paper, we propose three generalized and lightweight algorithms to carry out accurate thread and post content extraction from web forums. We evaluate our algorithms based on two strict criteria and to the granularity of the (DOM tree) node level correctness. We consider a thread or post as successfully extracted by our algorithms only if (i) all the contents in its text and anchor nodes are extracted correctly, and (ii) each content node is grouped correctly according to its respective thread or post. Our experiments on ten different forum sites show that our proposed thread extraction algorithm achieves an average recall and precision rate of 100% and 98.66%, respectively, while our core post extraction algorithm achieves an average recall and precision rate of 99.74% and 99.79%, respectively.
Keywords :
Web sites; content management; content-based retrieval; fraud; DOM tree node level correctness; anchor nodes; automated Web forum content extraction; forum sites; fraud; generalized algorithms; generic Web sites; information extraction; information retrieval; lightweight algorithms; noisy data elimination; online forums; post content extraction; text nodes; thread extraction; Containers; Context; Data mining; Feature extraction; Message systems; Web pages; Online forums; content extraction; information retrieval; web intelligence;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Intelligence and Computing Research (ICCIC), 2013 IEEE International Conference on
Conference_Location :
Enathi
Print_ISBN :
978-1-4799-1594-1
Type :
conf
DOI :
10.1109/ICCIC.2013.6724259
Filename :
6724259
Link To Document :
بازگشت