• DocumentCode
    3231421
  • Title

    The Role of Different Thesauri Terms and Captions in Automated Subject Classification

  • Author

    Golub, Koraljka

  • Author_Institution
    Dept. of Inf. Technol., Lund Univ.
  • fYear
    2006
  • fDate
    18-22 Dec. 2006
  • Firstpage
    961
  • Lastpage
    965
  • Abstract
    The paper aims to explore to what degree different types of terms in engineering information (Ei) thesaurus and classification scheme influence automated subject classification performance. Preferred terms, their synonyms, broader, narrower, related terms, and captions are examined in combination with a stemmer and a stop-word list. The algorithm comprises string-to-string matching between words in the documents to be classified and words in term lists derived from the Ei thesaurus and classification scheme. The data collection for evaluation consists of some 35000 scientific paper abstracts from the compendex database. A subset of the Ei thesaurus and classification scheme is used, comprising 92 classes at up to five hierarchical levels from general engineering. The results show that preferred terms perform best, whereas captions perform worst. Stemming in most cases shows performance improvement, whereas the stop-word list does not have a significant impact
  • Keywords
    classification; string matching; text analysis; thesauri; automated subject classification; compendex database; data collection; document classification; engineering information; string-to-string matching; thesauri term; Abstracts; Automatic control; Colon; Databases; Gases; Information technology; Mechanical variables measurement; Solids; Thesauri; Vocabulary;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on
  • Conference_Location
    Hong Kong
  • Print_ISBN
    0-7695-2747-7
  • Type

    conf

  • DOI
    10.1109/WI.2006.169
  • Filename
    4061503