Title :
Agglomerative co-clustering for synonymous phrases based on common effects and influences
Author :
Kumanami, Koji ; Seki, Katsuyuki ; Uehara, Kazuhiro
Author_Institution :
Grad. Sch. of Syst. Inf., Kobe Univ., Kobe, Japan
Abstract :
This paper proposes an approach to clustering synonymous noun phrases focusing on two types of predicate argument relations extracted from potentially big textual data. One is associated with common effects, the other with common influences. Based on the context represented by those relations, a matrix is constructed with rows being noun phrases and columns being a pair of a noun phrase and a verb phrase. Following the distribution hypothesis often adopted in the literature, it is assumed that rows (i.e., noun phrases) with similar distributions share similar meanings. Due to the inherent sparsity of the matrix, however, two strategies are taken to group noun phrases having similar distributions. One strategy is to simply use a large-scale corpus, which however results in an even larger matrix. To handle the large matrix, a parallel distributed programming model, MapReduce, is employed. The other is to adopt hierarchical agglomerative co-clustering and approximates its computation in a way suited to the MapReduce programming model. The proposed approach is evaluated based on a series of experiments in terms of the validity of our underlying assumptions, processing time, quality of the resulting clusters, and effect of parallelization.
Keywords :
data handling; distributed programming; natural language processing; pattern clustering; MapReduce; agglomerative coclustering; big textual data; clustering synonymous noun phrases; distribution hypothesis; noun phrases; parallel distributed programming model; synonymous phrases; verb phrase; Approximation methods; Clustering algorithms; Context; Copper; Data mining; Guidelines; Programming; Distributional similarity; Hadoop/MapReduce; Parallel distributed processing;
Conference_Titel :
Big Data, 2013 IEEE International Conference on
Conference_Location :
Silicon Valley, CA
DOI :
10.1109/BigData.2013.6691738