Title :
Forum Data Extraction without Explicit Rules
Author :
Jingwei Zhang ; Cheqing Jin ; Yuming Lin ; Xueqing Gong
Author_Institution :
Inst. of Massive Comput., East China Normal Univ., Shanghai, China
Abstract :
Web forum data contributed by millions of users are the mixture of well-formed user information and free-format user-generated content. Though easy to read for users, forum data are difficult to be analyzed by computer systems because of various surrounding HTML tags. It is challenging to extract forum data from a large number of Web sites automatically since these sites may have different styles. In this paper, we propose an approach to extract user information and user-generated content from multiple forum sites by using both structural and textual characteristics of forums. A structural induction process and a term combination computation process are introduced to assure extraction accuracy and automation. Extensive experiments on real-life data sets show the effectiveness of our proposed method.
Keywords :
Internet; data handling; HTML tag; Web forum data extraction; explicit rule; forum structural characteristics; forum textual characteristics; free-format user-generated content; structural induction process; term combination computation process; well-formed user information; Accuracy; Data mining; Feature extraction; HTML; Manuals; User-generated content; Web pages; forum data extraction; user-generated content;
Conference_Titel :
Cloud and Green Computing (CGC), 2012 Second International Conference on
Conference_Location :
Xiangtan
Print_ISBN :
978-1-4673-3027-5
DOI :
10.1109/CGC.2012.72