DocumentCode
3288549
Title
Mixed Graph of Terms: Beyond the Bags of Words Representation of a Text
Author
De Santo, Massimo ; Napoletano, Paolo ; Pietrosanto, Antonio ; Liguori, Consolatina ; Paciello, Vincenzo ; Polese, Francesco
Author_Institution
DIEII, Univ. of Salerno, Salerno, Italy
fYear
2012
fDate
4-7 Jan. 2012
Firstpage
1070
Lastpage
1079
Abstract
The main purpose of text mining techniques is to identify common patterns through the observation of vectors of features and then to use such patterns to make predictions. Vectors of features are usually made up of weighted words, as well as those used in the text retrieval field, which are obtained thanks to the assumption that considers a document as a "bag of words". However, in this paper we demonstrate that, to obtain more accuracy in the analysis and revelation of common patterns, we could employ (observe) more complex features than simple weighted words. The proposed vector of features considers a hierarchical structure, named a mixed Graph of Terms, composed of a directed and an undirected sub-graph of words, that can be automatically constructed from a small set of documents through the probabilistic Topic Model. The graph has demonstrated its efficiency in a classic "ad-hoc" text retrieval problem. Here we consider expanding the initial query with this new structured vector of features.
Keywords
data mining; graph theory; pattern classification; probability; query processing; text analysis; ad hoc text retrieval problem; bags of words representation; common pattern analysis; common pattern identification; feature vectors; mixed graph of terms; probabilistic topic model; query processing; text mining; text representation; Data mining; Educational institutions; Feature extraction; Probabilistic logic; Resource management; Semantics; Vectors; probabilistic topic model; query expansion; text mining; text retrieval;
fLanguage
English
Publisher
ieee
Conference_Titel
System Science (HICSS), 2012 45th Hawaii International Conference on
Conference_Location
Maui, HI
ISSN
1530-1605
Print_ISBN
978-1-4577-1925-7
Electronic_ISBN
1530-1605
Type
conf
DOI
10.1109/HICSS.2012.432
Filename
6149017
Link To Document