DocumentCode
3209540
Title
Automatic Data Extraction from Web Discussion Forums
Author
Li, Suke ; Tang, Liyong ; Hu, Jianbin ; Chen, Zhong
Author_Institution
Sch. of Electron. Eng. & Comput. Sci., Peking Univ., Beijing, China
fYear
2009
fDate
17-19 Dec. 2009
Firstpage
219
Lastpage
225
Abstract
This paper presents an approach to extract information from Web discussion forums automatically. HTML tag paths built from a HTML DOM tree are employed to generate the post extraction template. Visual text features and HTML structure information in the same page are also combined together to extract author profile, posted date and post content automatically. Experiment results show that our approach is effective.
Keywords
Internet; data analysis; hypermedia markup languages; text analysis; HTML DOM tree; HTML structure information; HTML tag paths; automatic data extraction; information extraction; visual text features; web discussion forums; Computer science; Computer science education; Data engineering; Data mining; Discussion forums; Educational technology; HTML; Laboratories; Navigation; Web pages; Data Extraction; Data Mining; Web Forum Mining;
fLanguage
English
Publisher
ieee
Conference_Titel
Frontier of Computer Science and Technology, 2009. FCST '09. Fourth International Conference on
Conference_Location
Shanghai
Print_ISBN
978-0-7695-3932-4
Electronic_ISBN
978-1-4244-5467-9
Type
conf
DOI
10.1109/FCST.2009.20
Filename
5392915
Link To Document