• DocumentCode
    3209540
  • Title

    Automatic Data Extraction from Web Discussion Forums

  • Author

    Li, Suke ; Tang, Liyong ; Hu, Jianbin ; Chen, Zhong

  • Author_Institution
    Sch. of Electron. Eng. & Comput. Sci., Peking Univ., Beijing, China
  • fYear
    2009
  • fDate
    17-19 Dec. 2009
  • Firstpage
    219
  • Lastpage
    225
  • Abstract
    This paper presents an approach to extract information from Web discussion forums automatically. HTML tag paths built from a HTML DOM tree are employed to generate the post extraction template. Visual text features and HTML structure information in the same page are also combined together to extract author profile, posted date and post content automatically. Experiment results show that our approach is effective.
  • Keywords
    Internet; data analysis; hypermedia markup languages; text analysis; HTML DOM tree; HTML structure information; HTML tag paths; automatic data extraction; information extraction; visual text features; web discussion forums; Computer science; Computer science education; Data engineering; Data mining; Discussion forums; Educational technology; HTML; Laboratories; Navigation; Web pages; Data Extraction; Data Mining; Web Forum Mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Frontier of Computer Science and Technology, 2009. FCST '09. Fourth International Conference on
  • Conference_Location
    Shanghai
  • Print_ISBN
    978-0-7695-3932-4
  • Electronic_ISBN
    978-1-4244-5467-9
  • Type

    conf

  • DOI
    10.1109/FCST.2009.20
  • Filename
    5392915