• DocumentCode
    3249616
  • Title

    Mining significant associations in large scale text corpora

  • Author

    Raghavan, Prabhakar ; Tsaparas, Panayiotis

  • fYear
    2002
  • fDate
    2002
  • Firstpage
    402
  • Lastpage
    409
  • Abstract
    Mining large-scale text corpora is an essential step in extracting the key themes in a corpus. We motivate a quantitative measure for significant associations through the distributions of pairs and triplets of co-occurring words. We consider the algorithmic problem of efficiently enumerating such significant associations and present pruning algorithms for these problems, with theoretical as well as empirical analyses. Our algorithms make use of two novel mining methods: (1) matrix mining, and (2) shortened documents. We present evidence from a diverse set of documents that our measure does in fact elicit interesting co-occurrences.
  • Keywords
    data mining; algorithmic problem; co-occurring word pair distribution; co-occurring word triplet distribution; key theme extraction; large-scale text corpora mining; matrix mining; pruning algorithms; quantitative measure; shortened documents; significant association mining; Algorithm design and analysis; Association rules; Computer science; Data mining; Databases; Large-scale systems; Statistical distributions; Text analysis; Text categorization; Text mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on
  • Print_ISBN
    0-7695-1754-4
  • Type

    conf

  • DOI
    10.1109/ICDM.2002.1183933
  • Filename
    1183933