DocumentCode :
583030
Title :
An Exploratory Study of Enhancing Text Clustering with Auto-Generated Semantic Tags
Author :
Tang, Xuning ; Dang, Jiangbo
Author_Institution :
Coll. of Inf. Sci. & Technol., Drexel Univ., Philadelphia, PA, USA
fYear :
2012
fDate :
22-24 Oct. 2012
Firstpage :
104
Lastpage :
111
Abstract :
With the exponentially growing volume of digital documents and internet content, it becomes very challenging to locate right information when desired. We heavily rely on search engines but most existing search tools are key-word based and they often return search results with low precision and recall. The emerging semantic tagging technology provides an automatic way to generate semantic tags from text. It has drawn more and more interest from text mining research communities. It is critical to study how to utilize semantic tags to improve text mining including clustering, which helps users to enhance their experience of searching and browsing documents. Unfortunately, most previous works on text clustering merely based on content information. A few recent researches take user-generated tags into account, however user generated tags are often noisy, inconsistent, redundant and lack of semantic information and hierarchical structure. In this work, we propose a Semantic Text Mining (STeM) framework to generate semantic tags for given documents and then utilize the semantic tags to improve text clustering. Different from the previous works, we represent a document by a combination of domains and high quality noun phrases. We investigate the performance of our methods in two different datasets and the results are evaluated by normalized mutual information. Experiment results demonstrated that our proposed method substantially outperformed the traditional Term Frequency-Inverse Document Frequency (TF-IDF) term vector based clustering. We find that incorporating semantic information into document representation is critical to improve the performance of text clustering.
Keywords :
data mining; information retrieval; pattern clustering; semantic Web; text analysis; Internet content; STeM; TF-IDF term vector based clustering; autogenerated semantic tag; content information; digital documents; document browsing; document representation; document searching; hierarchical structure; high quality noun phrase; information location; keyword based search tool; normalized mutual information; search engine; semantic information; semantic tagging technology; semantic text mining; term frequency-inverse document frequency term vector; text clustering enhancement; user-generated tags; Clustering algorithms; Frequency domain analysis; Knowledge based systems; Ontologies; Semantics; Tagging; Vectors; SteM; clustering; document; semantic tags;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Semantics, Knowledge and Grids (SKG), 2012 Eighth International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4673-2561-5
Type :
conf
DOI :
10.1109/SKG.2012.17
Filename :
6391817
Link To Document :
بازگشت