Title :
An unsupervised hierarchical approach to document categorization
Author :
Wetzker, Robert ; Alpcan, Tansu ; Bauckhage, Christian ; Umbrath, Winfried ; Albayrak, Sahin
Abstract :
We propose a hierarchical approach to document categorization that requires no pre-configuration and maps the semantic document space to a predefined taxonomy. The utilization of search engines to train a hierarchical classifier makes our approach more flexible than existing solutions which rely on (human) labeled data and are bound to a specific domain. We show that the structural information given by the taxonomy allows for a context aware construction of search queries and leads to higher tagging accuracy. We test our approach on different benchmark datasets and evaluate its performance on the single- and multi-tag assignment tasks. The experimental results show that our solution is as accurate as supervised classifiers for web page classification and still performs well when categorizing domain specific documents.
Keywords :
Benchmark testing; Context awareness; Humans; Internet; Laboratories; Search engines; Tagging; Taxonomy; Text categorization; Web pages;
Conference_Titel :
Web Intelligence, IEEE/WIC/ACM International Conference on
Conference_Location :
Fremont, CA
Print_ISBN :
978-0-7695-3026-0
DOI :
10.1109/WI.2007.144