• DocumentCode
    658340
  • Title

    A Lightweight Algorithm for Automated Forum Information Processing

  • Author

    Wee Yong Lim ; Sachan, Abhishek ; Thing, Vrizlynn L. L.

  • Author_Institution
    Cybercrime & Security Intell. (CSI) Dept., Inst. for Infocomm Res., Singapore, Singapore
  • Volume
    1
  • fYear
    2013
  • fDate
    17-20 Nov. 2013
  • Firstpage
    121
  • Lastpage
    126
  • Abstract
    The vast variety of information on Web forums makes them a valuable resource for various purposes such as scam detection, national security protection and sentiment analysis. However, it is challenging to extract useful information from Web forums accurately and efficiently. First, several page types exist in Web forums and content is presented in different formats in these pages. Second, the content on the forum pages is stored in the form of data blocks. For the information to be meaningful, it is necessary to extract the relevant data blocks separately. The main problem with generic content extraction systems is that they cannot distinguish among various pages nor extract information with the required granularity. Although, several content extraction methods exist for Web forums, these methods either do not satisfy the above requirements or use heuristics based approaches (such as assumptions on standard visual appearances, etc., resulting in limited applicability to different varieties of forum). In this paper, we propose a general and efficient content extraction method using the properties of links present in forum pages. The effectiveness of our proposed method is shown through our experimental results.
  • Keywords
    Web sites; information retrieval; Web forum pages; automated forum information processing; data block extraction; generic content extraction systems; information extraction; lightweight algorithm; link properties; Data mining; Feature extraction; HTML; Training; Visualization; Web pages; DOM tree; content extraction; forum; web;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2013 IEEE/WIC/ACM International Joint Conferences on
  • Conference_Location
    Atlanta, GA
  • Print_ISBN
    978-1-4799-2902-3
  • Type

    conf

  • DOI
    10.1109/WI-IAT.2013.18
  • Filename
    6690003