Title :
A Lightweight Algorithm for Automated Forum Information Processing
Author :
Wee Yong Lim ; Sachan, Abhishek ; Thing, Vrizlynn L. L.
Author_Institution :
Cybercrime & Security Intell. (CSI) Dept., Inst. for Infocomm Res., Singapore, Singapore
Abstract :
The vast variety of information on Web forums makes them a valuable resource for various purposes such as scam detection, national security protection and sentiment analysis. However, it is challenging to extract useful information from Web forums accurately and efficiently. First, several page types exist in Web forums and content is presented in different formats in these pages. Second, the content on the forum pages is stored in the form of data blocks. For the information to be meaningful, it is necessary to extract the relevant data blocks separately. The main problem with generic content extraction systems is that they cannot distinguish among various pages nor extract information with the required granularity. Although, several content extraction methods exist for Web forums, these methods either do not satisfy the above requirements or use heuristics based approaches (such as assumptions on standard visual appearances, etc., resulting in limited applicability to different varieties of forum). In this paper, we propose a general and efficient content extraction method using the properties of links present in forum pages. The effectiveness of our proposed method is shown through our experimental results.
Keywords :
Web sites; information retrieval; Web forum pages; automated forum information processing; data block extraction; generic content extraction systems; information extraction; lightweight algorithm; link properties; Data mining; Feature extraction; HTML; Training; Visualization; Web pages; DOM tree; content extraction; forum; web;
Conference_Titel :
Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2013 IEEE/WIC/ACM International Joint Conferences on
Conference_Location :
Atlanta, GA
Print_ISBN :
978-1-4799-2902-3
DOI :
10.1109/WI-IAT.2013.18