Mining significant associations in large scale text corpora

Author

Raghavan, Prabhakar ; Tsaparas, Panayiotis

fYear

2002

fDate

2002

Firstpage

402

Lastpage

409

Abstract

Mining large-scale text corpora is an essential step in extracting the key themes in a corpus. We motivate a quantitative measure for significant associations through the distributions of pairs and triplets of co-occurring words. We consider the algorithmic problem of efficiently enumerating such significant associations and present pruning algorithms for these problems, with theoretical as well as empirical analyses. Our algorithms make use of two novel mining methods: (1) matrix mining, and (2) shortened documents. We present evidence from a diverse set of documents that our measure does in fact elicit interesting co-occurrences.

Keywords

data mining; algorithmic problem; co-occurring word pair distribution; co-occurring word triplet distribution; key theme extraction; large-scale text corpora mining; matrix mining; pruning algorithms; quantitative measure; shortened documents; significant association mining; Algorithm design and analysis; Association rules; Computer science; Data mining; Databases; Large-scale systems; Statistical distributions; Text analysis; Text categorization; Text mining;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on

Print_ISBN

0-7695-1754-4

Type

conf

DOI

10.1109/ICDM.2002.1183933

Filename

1183933

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=3249616