• DocumentCode
    2334175
  • Title

    Better rules, fewer features: a semantic approach to selecting features from text

  • Author

    Blake, Catherine ; Pratt, Wanda

  • Author_Institution
    Dept. of Inf. & Comput. Sci., California Univ., Irvine, CA, USA
  • fYear
    2001
  • fDate
    2001
  • Firstpage
    59
  • Lastpage
    66
  • Abstract
    The choice of features used to represent a domain has a profound effect on the quality of the model produced; yet, few researchers have investigated the relationship between the features used to represent text and the quality of the final model. We explored this relationship for medical texts by comparing association rules based on features with three different semantic levels: (1) words (2) manually assigned keywords and (3) automatically selected medical concepts. Our preliminary findings indicate that bi-directional association rules based on concepts or keywords are more plausible and more useful than those based on word features. The concept and keyword representations also required 90% fewer features than the word representation. This drastic dimensionality reduction suggests that this approach is well suited to large textual corpora of medical text, such as parts of the Web
  • Keywords
    bibliographic systems; computational linguistics; data mining; medical information systems; text analysis; Web; association rules; automatically selected medical concepts; bi-directional association rules; dimensionality reduction; feature selection; keyword representations; large textual corpus; manually assigned keywords; medical texts; semantic approach; semantic levels; text representation; word features; word representation; words; Association rules; Bidirectional control; Breast cancer; Breast neoplasms; Computer science; Data mining; Diseases; Medical treatment; Natural languages; Predictive models;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on
  • Conference_Location
    San Jose, CA
  • Print_ISBN
    0-7695-1119-8
  • Type

    conf

  • DOI
    10.1109/ICDM.2001.989501
  • Filename
    989501