• DocumentCode
    3739798
  • Title

    Discovering Aspectual Classes of Russian Verbs in Untagged Large Corpora

  • Author

    Aleksandr Drozd;Anna Gladkova;Satoshi Matsuoka

  • Author_Institution
    Global Sci. Inf. &
  • fYear
    2015
  • Firstpage
    61
  • Lastpage
    68
  • Abstract
    This paper presents a case study of discovering and classifying verbs in large web-corpora. Many tasks in natural language processing require corpora containing billions of words, and with such volumes of data co-occurrence extraction becomes one of the performance bottlenecks in the Vector Space Models of computational linguistics. We propose a co-occurrence extraction kernel based on ternary trees as an alternative (or a complimentary stage) to conventional map-reduce based approach, this kernel achieves an order of magnitude improvement in memory footprint and processing speed. Our classifier successfully and efficiently identified verbs in a 1.2-billion words untagged corpus of Russian fiction and distinguished between their two aspectual classes. The model proved efficient even for low-frequency vocabulary, including nonce verbs and neologisms.
  • Keywords
    "Context","Pragmatics","Semantics","Syntactics","Internet","Electronic mail","Data models"
  • Publisher
    ieee
  • Conference_Titel
    Data Science and Data Intensive Systems (DSDIS), 2015 IEEE International Conference on
  • Type

    conf

  • DOI
    10.1109/DSDIS.2015.30
  • Filename
    7396482