مرکز منطقه ای اطلاع رساني علوم و فناوري - Classification of Homogeneous Data With Large Alphabets

DocumentCode :

62846

Title :

Classification of Homogeneous Data With Large Alphabets

Author :

Kelly, B.G. ; Wagner, Aaron B. ; Tularak, T. ; Viswanath, Pramod

Author_Institution :

Sch. of Electr. & Comput. Eng., Cornell Univ., Ithaca, NY, USA

Volume :

Issue :

fYear :

2013

fDate :

Feb. 2013

Firstpage :

782

Lastpage :

795

Abstract :

Given training sequences generated by two distinct, but unknown, distributions on a common alphabet, we study the problem of determining whether a third sequence was generated according to the first or second distribution. To model sources such as natural language, for which the underlying distributions are difficult to learn from realistic amounts of data, we allow the alphabet size to grow and therefore the probability distributions to change with the block length. Our primary focus is the situation in which the underlying probabilities are all of the same order, and in this regime, we show that consistent classification is possible if and only if the alphabet grows subquadratically with the block length. We also show that some commonly used statistical tests are suboptimal in that they are consistent only if the alphabet grows sublinearly.

Keywords :

natural language processing; pattern classification; statistical distributions; block length; homogeneous data classification; large alphabets; natural language; probability distribution; statistical test; Educational institutions; Natural languages; Probability distribution; Testing; Training; Training data; Zinc; Chi-squared; classification; hypothesis testing; large alphabets; natural language;

fLanguage :

English

Journal_Title :

Information Theory, IEEE Transactions on

Publisher :

ieee

ISSN :

0018-9448

Type :

jour

DOI :

10.1109/TIT.2012.2222343

Filename :

6340343

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=62846