• DocumentCode
    2787698
  • Title

    Applying data mining techniques for descriptive phrase extraction in digital document collections

  • Author

    Ahonen, Hannu ; Heinonen, Oskari ; Klemettinen, Mika ; Verkamo, A.Inkeri

  • Author_Institution
    Wilhelm-Schickard-Inst. fur Inf., Tubingen Univ., Germany
  • fYear
    1998
  • fDate
    22-24 Apr 1998
  • Firstpage
    2
  • Lastpage
    11
  • Abstract
    Traditionally, texts have been analysed using various information retrieval-related methods, such as full-text analysis and natural language processing. However, only few examples of data mining in text, particularly in full text, are available. In this paper, we show that general data mining methods are applicable to text analysis tasks such as descriptive phrase extraction. Moreover, we present a general framework for text mining. The framework follows the general knowledge discovery process, thus containing steps from preprocessing to utilization of the results. The data mining method that we apply is based on generalized episodes and episode rules. We give concrete examples of how to preprocess texts based on the intended use of the discovered results and we introduce a weighting scheme that helps in pruning out redundant or non-descriptive phrases. We also present results from real-life data experiments
  • Keywords
    deductive databases; full-text databases; information analysis; knowledge acquisition; very large databases; data mining; descriptive phrase extraction; digital document collections; episode rules; full-text analysis; generalized episodes; information retrieval; knowledge discovery process; natural language processing; nondescriptive phrase pruning; preprocessing; redundant phrase pruning; results utilization; text mining; weighting scheme; Computer science; Concrete; Data mining; Information retrieval; Natural language processing; Sensor phenomena and characterization; Sensor systems; Text analysis; Text mining; Text processing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Research and Technology Advances in Digital Libraries, 1998. ADL 98. Proceedings. IEEE International Forum on
  • Conference_Location
    Santa Barbara, CA
  • ISSN
    1092-9959
  • Print_ISBN
    0-8186-8464-X
  • Type

    conf

  • DOI
    10.1109/ADL.1998.670374
  • Filename
    670374