Discovering Aspectual Classes of Russian Verbs in Untagged Large Corpora

Author

Aleksandr Drozd;Anna Gladkova;Satoshi Matsuoka

Author_Institution

Global Sci. Inf. &

fYear

2015

Firstpage

Lastpage

Abstract

This paper presents a case study of discovering and classifying verbs in large web-corpora. Many tasks in natural language processing require corpora containing billions of words, and with such volumes of data co-occurrence extraction becomes one of the performance bottlenecks in the Vector Space Models of computational linguistics. We propose a co-occurrence extraction kernel based on ternary trees as an alternative (or a complimentary stage) to conventional map-reduce based approach, this kernel achieves an order of magnitude improvement in memory footprint and processing speed. Our classifier successfully and efficiently identified verbs in a 1.2-billion words untagged corpus of Russian fiction and distinguished between their two aspectual classes. The model proved efficient even for low-frequency vocabulary, including nonce verbs and neologisms.

Keywords

"Context","Pragmatics","Semantics","Syntactics","Internet","Electronic mail","Data models"

Publisher

ieee

Conference_Titel

Data Science and Data Intensive Systems (DSDIS), 2015 IEEE International Conference on

Type

conf

DOI

10.1109/DSDIS.2015.30

Filename

7396482

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=3739798