DocumentCode :
1791803
Title :
Knowledge based dimensionality reduction for technical text mining
Author :
Shalaby, Walid ; Zadrozny, Wlodek ; Gallagher, Sean
Author_Institution :
Comput. Sci. Dept., Univ. of North Carolina at Charlotte, Charlotte, NC, USA
fYear :
2014
fDate :
27-30 Oct. 2014
Firstpage :
39
Lastpage :
44
Abstract :
In this paper we propose a novel technique for dimensionality reduction using freely available online knowledge bases. The complexity of our method is linearly proportional to the size of the full feature set, making it applicable efficiently to huge and complex datasets. We demonstrate this approach by investigating its effectiveness on patent data, the largest free technical text. We report empirical results on classification of the CLEF-IP 2010 dataset using bigram features supported by mentions in Wikipedia, Wiktionary, and GoogleBooks knowledge bases. We achieve a 13-fold reduction in number of bigrams features and a 1.7% increase in classification accuracy over the unigrams baseline. These results give concrete evidence that significant accuracy improvements and massive reduction in dimensionality could be achieved using our approach, hence help alleviating the tradeoff between task complexity and accuracy.
Keywords :
Web sites; data mining; feature selection; knowledge based systems; patents; pattern classification; text analysis; CLEF-IP 2010 dataset classification; GoogleBooks knowledge bases; Wikipedia knowledge bases; Wiktionary knowledge bases; bigram feature; classification accuracy; free technical text; knowledge based dimensionality reduction; online knowledge bases; patent data; task accuracy; task complexity; technical text mining; unigrams baseline; Accuracy; Electronic publishing; Encyclopedias; Internet; Patents; Training; Dimensionality Reduction; Feature Selection; Knowledge Bases; Patent Classification; Text Classification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Big Data (Big Data), 2014 IEEE International Conference on
Conference_Location :
Washington, DC
Type :
conf
DOI :
10.1109/BigData.2014.7004466
Filename :
7004466
Link To Document :
بازگشت