Title :
Document classification with distributions of word vectors
Author :
Chao Xing ; Dong Wang ; Xuewei Zhang ; Chao Liu
Author_Institution :
Center for Speaker & Language Technol. (CSLL), Tsinghua Univ., Beijing, China
Abstract :
The word-to-vector (W2V) technique represents words as low-dimensional continuous vectors in such a way that semantic related words are close to each other. This produces a semantic space where a word or a word collection (e.g., a document) can be well represented, and thus lends itself to a multitude of applications including document classification. Our previous study demonstrated that representations derived from word vectors are highly promising in document classification and can deliver better performance than the conventional LDA model. This paper extends the previous research and proposes to model distributions of word vectors in documents or document classes. This extends the naive approach to deriving document representations by average pooling and explores the possibility of modeling documents in the semantic space. Experiments on the sohu text database confirmed that the new approach may produce better performance on document classification.
Keywords :
Bayes methods; document handling; pattern classification; word processing; LDA model; W2V technique; document classes; document classification; document modeling; document representations; low-dimensional continuous vectors; naive approach; semantic related words; semantic space; sohu text database; word vector distributions; word-to-vector technique; Bayes methods; Computational modeling; Educational institutions; Semantics; Support vector machine classification; Training; Vectors;
Conference_Titel :
Asia-Pacific Signal and Information Processing Association, 2014 Annual Summit and Conference (APSIPA)
Conference_Location :
Siem Reap
DOI :
10.1109/APSIPA.2014.7041633