Title of article :
The exact rank-frequency function and size-frequency function of N-grams and N-word phrases with applications
Author/Authors :
Egghe، نويسنده , , L.، نويسنده ,
Issue Information :
روزنامه با شماره پیاپی سال 2005
Abstract :
N-grams are generalized words consisting of N consecutive symbols (letters), as they are used in a text. N-word phrases are general concepts consisting of N consecutive words, also as used in a text. Given the rank-frequency function of single letters (i.e., one-grams) or of single words (i.e., one-word phrases) being Zipfian, we determine in this paper, the exact rank-frequency function (i.e., the occurrence of N-grams or N-word phrases on each rank) and size-frequency distribution (i.e., the density of N-grams or N-word phrases on each occurrence density) of these N-grams and N-word phrases. This paper distinguishes itself from other ones on this topic by allowing no approximations in the calculations. This leads to an intricate rank-frequency function for N-grams and N-word phrases (as we knew before from unpublished calculations) but leads surprisingly, to a very simple size-frequency function fN for N-grams or N-word phrases of the form f N ( j ) = F j 1 + 1 / β ln N − 1 ( G j ) ,
the Zipfian distribution of single letters or words is proportional to 1/rβ.
per closes with the calculation of type/token averages μN and type/token-taken averages μ*N for N-grams and N-word phrases, where we also verify the theoretically proved result μ*N ≥ μN but where we also give estimates for the differences μ*N − μN.
Keywords :
Rank-frequency distribution , Zipfian distribution , Size-frequency distribution , n-gram , N-word phrase
Journal title :
Mathematical and Computer Modelling
Journal title :
Mathematical and Computer Modelling