DocumentCode
3249616
Title
Mining significant associations in large scale text corpora
Author
Raghavan, Prabhakar ; Tsaparas, Panayiotis
fYear
2002
fDate
2002
Firstpage
402
Lastpage
409
Abstract
Mining large-scale text corpora is an essential step in extracting the key themes in a corpus. We motivate a quantitative measure for significant associations through the distributions of pairs and triplets of co-occurring words. We consider the algorithmic problem of efficiently enumerating such significant associations and present pruning algorithms for these problems, with theoretical as well as empirical analyses. Our algorithms make use of two novel mining methods: (1) matrix mining, and (2) shortened documents. We present evidence from a diverse set of documents that our measure does in fact elicit interesting co-occurrences.
Keywords
data mining; algorithmic problem; co-occurring word pair distribution; co-occurring word triplet distribution; key theme extraction; large-scale text corpora mining; matrix mining; pruning algorithms; quantitative measure; shortened documents; significant association mining; Algorithm design and analysis; Association rules; Computer science; Data mining; Databases; Large-scale systems; Statistical distributions; Text analysis; Text categorization; Text mining;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on
Print_ISBN
0-7695-1754-4
Type
conf
DOI
10.1109/ICDM.2002.1183933
Filename
1183933
Link To Document