• DocumentCode
    270122
  • Title

    Turkish labeled text corpus

  • Author

    Özturk, Seçil ; Sankur, B. ; Gungör, Tunga ; Yilmaz, Mustafa Berkay ; Köroǧlu, Bilge ; Aǧin, Onur ; İşbilen, Mustafa ; Ulaş, Çaǧdaş ; Ahat, Mehmet

  • Author_Institution
    Elektr. Elektron., Muhendisligi Bolumleri, Bogazici Univ., Istanbul, Turkey
  • fYear
    2014
  • fDate
    23-25 April 2014
  • Firstpage
    1395
  • Lastpage
    1398
  • Abstract
    A labeled text corpus made up of Turkish papers´ titles, abstracts and keywords is collected. The corpus includes 35 number of different disciplines, and 200 documents per subject. This study presents the text corpus´ collection and content. The classification performance of Term Frequcney - Inverse Document Frequency (TF-IDF) and topic probabilities of Latent Dirichlet Allocation (LDA) features are compared for the text corpus. The text corpus is shared as open source so that it could be used for natural language processing applications with academic purposes.
  • Keywords
    natural language processing; pattern classification; probability; text analysis; LDA features; TF-IDF; Turkish labeled text corpus; Turkish paper abstracts; Turkish paper keywords; Turkish paper titles; academic purposes; classification performance; latent Dirichlet allocation features; natural language processing applications; term frequency-inverse document frequency; text corpus collection; text corpus content; topic probabilities; Abstracts; Conferences; Natural language processing; Resource management; Signal processing; Support vector machines; XML; Classification; Corpus; Inverse Document Frequency; Latent Dirichlet Allocation; NLP; Natural Language Processing; Paper; TF-IDF; Term Frequcney; Turkish;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Signal Processing and Communications Applications Conference (SIU), 2014 22nd
  • Conference_Location
    Trabzon
  • Type

    conf

  • DOI
    10.1109/SIU.2014.6830499
  • Filename
    6830499