DocumentCode :
3249616
Title :
Mining significant associations in large scale text corpora
Author :
Raghavan, Prabhakar ; Tsaparas, Panayiotis
fYear :
2002
fDate :
2002
Firstpage :
402
Lastpage :
409
Abstract :
Mining large-scale text corpora is an essential step in extracting the key themes in a corpus. We motivate a quantitative measure for significant associations through the distributions of pairs and triplets of co-occurring words. We consider the algorithmic problem of efficiently enumerating such significant associations and present pruning algorithms for these problems, with theoretical as well as empirical analyses. Our algorithms make use of two novel mining methods: (1) matrix mining, and (2) shortened documents. We present evidence from a diverse set of documents that our measure does in fact elicit interesting co-occurrences.
Keywords :
data mining; algorithmic problem; co-occurring word pair distribution; co-occurring word triplet distribution; key theme extraction; large-scale text corpora mining; matrix mining; pruning algorithms; quantitative measure; shortened documents; significant association mining; Algorithm design and analysis; Association rules; Computer science; Data mining; Databases; Large-scale systems; Statistical distributions; Text analysis; Text categorization; Text mining;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on
Print_ISBN :
0-7695-1754-4
Type :
conf
DOI :
10.1109/ICDM.2002.1183933
Filename :
1183933
Link To Document :
بازگشت