• DocumentCode
    589160
  • Title

    OCCAMS -- An Optimal Combinatorial Covering Algorithm for Multi-document Summarization

  • Author

    Davis, S.T. ; Conroy, J.M. ; Schlesinger, J.D.

  • Author_Institution
    Center for Comput. Sci., IDA, Bowie, MD, USA
  • fYear
    2012
  • fDate
    10-10 Dec. 2012
  • Firstpage
    454
  • Lastpage
    463
  • Abstract
    OCCAMS is a new algorithm for the Multi-Document Summarization (MDS) problem. We use Latent Semantic Analysis (LSA) to produce term weights which identify the main theme(s) of a set of documents. These are used by our heuristic for extractive sentence selection which borrows techniques from combinatorial optimization to select a set of sentences such that the combined weight of the terms covered is maximized while redundancy is minimized. OCCAMS outperforms CLASSY11 on DUC/TAC data for nearly all years since 2005, where CLASSY11 is the best human-rated system of TAC 2011. OCCAMS also delivers higher ROUGE scores than all human-generated summaries for TAC 2011. We show that if the combinatorial component of OCCAMS, which computes the extractive summary, is given true weights of terms, then the quality of the summaries generated outperforms all human generated summaries for all years using ROUGE-2, ROUGE-SU4, and a coverage metric. We introduce this new metric based on term coverage and demonstrate that a simple bi-gram instantiation achieves a statistically significant higher Pearson correlation with overall responsiveness than ROUGE on the TAC data.
  • Keywords
    combinatorial mathematics; document handling; natural language processing; optimisation; LSA; MDS problem; OCCAMS; ROUGE scores; ROUGE-2; ROUGE-SU4; combinatorial optimization; coverage metric; extractive sentence selection; human-generated summaries; latent semantic analysis; multidocument summarization; optimal combinatorial covering algorithm; simple bi-gram instantiation; Approximation algorithms; Approximation methods; Entropy; Humans; Optimization; Redundancy; Semantics; Combinatorial Optimization; Latent Semantic Analysis; Multi-document Summarization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining Workshops (ICDMW), 2012 IEEE 12th International Conference on
  • Conference_Location
    Brussels
  • Print_ISBN
    978-1-4673-5164-5
  • Type

    conf

  • DOI
    10.1109/ICDMW.2012.50
  • Filename
    6406475