Title :
Useful attributes identification for Unsupervised Information Extraction result set based on REAdaBoost Naïve Bayes
Author :
Yin, Wenke ; Zhu, Ming
Author_Institution :
Dept. of Autom., Univ. of Sci. & Technol. of China, Hefei, China
Abstract :
Unsupervised Information Extraction has attracted great attentions in the literature. However, it is inevitable to include useless noise in the result set. Besides, the proportion of useful attributes and the noise in the result set is greatly imbalanced, and the importance of these two types of data is also different. So how to effectively identify the useful attributes becomes an open question. To address this problem, this paper proposes a revised AdaBoost algorithm-REAdaBoost. The weight coefficient of REAdaBoost is not only decided by the precision of useful attributes, but also correlates with the recall for rare attributes. We use Naïve Bayes as the base classifier, and then apply AdaBoost and REAdaBoost to boost it separately. The experiment results show that on the premise of not increasing the overall error rate, REAdaBoost has better performance than AdaBoost and Naïve Bayes in predicting both the useful attributes and the rare attributes.
Keywords :
Bayes methods; data mining; pattern classification; AdaBoost algorithm; REAdaBoost naive Bayes; attributes identification; unsupervised information extraction; weight coefficient; 1f noise; Automation; Background noise; Data mining; Error analysis; Explosives; Internet; Large-scale systems; Web pages; Web sites; Classification; Imbalanced Class Distributions; InformationExtraction; REAdaBoost;
Conference_Titel :
Future Computer and Communication (ICFCC), 2010 2nd International Conference on
Conference_Location :
Wuhan
Print_ISBN :
978-1-4244-5821-9
DOI :
10.1109/ICFCC.2010.5497739