Title :
Mining significant associations in large scale text corpora
Author :
Raghavan, Prabhakar ; Tsaparas, Panayiotis
Abstract :
Mining large-scale text corpora is an essential step in extracting the key themes in a corpus. We motivate a quantitative measure for significant associations through the distributions of pairs and triplets of co-occurring words. We consider the algorithmic problem of efficiently enumerating such significant associations and present pruning algorithms for these problems, with theoretical as well as empirical analyses. Our algorithms make use of two novel mining methods: (1) matrix mining, and (2) shortened documents. We present evidence from a diverse set of documents that our measure does in fact elicit interesting co-occurrences.
Keywords :
data mining; algorithmic problem; co-occurring word pair distribution; co-occurring word triplet distribution; key theme extraction; large-scale text corpora mining; matrix mining; pruning algorithms; quantitative measure; shortened documents; significant association mining; Algorithm design and analysis; Association rules; Computer science; Data mining; Databases; Large-scale systems; Statistical distributions; Text analysis; Text categorization; Text mining;
Conference_Titel :
Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on
Print_ISBN :
0-7695-1754-4
DOI :
10.1109/ICDM.2002.1183933