• DocumentCode
    3466841
  • Title

    N-gram Statistics in English and Chinese: Similarities and Differences

  • Author

    Yang, Stewart ; Zhu, Hongjun ; Apostoli, Ariel ; Cao, Pei

  • Author_Institution
    Google, Inc., Mountain View
  • fYear
    2007
  • fDate
    17-19 Sept. 2007
  • Firstpage
    454
  • Lastpage
    460
  • Abstract
    Chinese and English belong to two very different families of human languages. Yet, since the underlying human concepts are universal, one can expect that there are many statistical similarities between Chinese texts and English texts. In this paper, we present results of analyzing the quantity and frequency of N-grams in 200 million randomly-sampled English and Chinese Web pages. The similarities and differences in N-gram frequency distributions yield important insights about the two languages. First, the distribution of the unique number of N-grams is similar between English and Chinese, yet the Chinese distribution is "shifted" to larger N. The distribution indicates that on average, 1.5 Chinese characters correspond to 1 English word. Second, while frequency distributions of uni-grams and bi-grams are very different between Chinese and English, the frequency distribution for 3-grams and 4- grams are strikingly similar between Chinese and English. This leads to the conjecture that in both languages, frequent 3-grams and 4-grams represent the same set of concepts and patterns.
  • Keywords
    Web sites; natural languages; statistical analysis; Chinese texts; English texts; N-gram frequency distributions; N-gram statistics; Web pages; human languages; Frequency; Humans; Knee; Natural languages; Speech; Statistical distributions; Statistics; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Semantic Computing, 2007. ICSC 2007. International Conference on
  • Conference_Location
    Irvine, CA
  • Print_ISBN
    978-0-7695-2997-4
  • Type

    conf

  • DOI
    10.1109/ICSC.2007.46
  • Filename
    4338381