Title :
Simultaneous Product Attribute Name and Value Extraction from Web Pages
Author :
Wu, Bo ; Cheng, Xueqi ; Wang, Yu ; Guo, Yan ; Song, Linhai
Abstract :
Much work has been done in the area of template independent web data extraction. However, these approaches deal with the attribute value extraction and annotation either in separate phases or constrained to a predefined set of attributes which is highly ineffective. In this paper, we perform the attribute extraction and annotation simultaneously by extracting the attribute name and value pair at the same time. In our approach, we use a co-training algorithm with naive Bayesian classifier to identify the candidate attribute name and value pairs in the unlabeled pages. The candidate attribute name and value pairs are used to detect the specification block of the product in web pages. Finally, all the attribute name and value pairs in the specification block are discovered. We conduct experiments for three types of products and obtain a promising result.
Keywords :
Bayesian methods; Books; Computers; Conferences; Crawlers; Data mining; Humans; Intelligent agent; Web mining; Web pages; information extraction; semi-supervised; templateindependent; web mining;
Conference_Titel :
Web Intelligence and Intelligent Agent Technologies, 2009. WI-IAT '09. IEEE/WIC/ACM International Joint Conferences on
Conference_Location :
Milan, Italy
Print_ISBN :
978-0-7695-3801-3
Electronic_ISBN :
978-1-4244-5331-3
DOI :
10.1109/WI-IAT.2009.286