• DocumentCode
    2260300
  • Title

    Chinese coding type identification based on sub-sentence length observation

  • Author

    He, Gang ; Peng, Peidong ; Wu, Xiaochun ; Chen, Luming

  • Author_Institution
    Sch. of Inf. & Commun. Eng., Beijing Univ. of Posts & Telecommun., Beijing, China
  • fYear
    2009
  • fDate
    24-27 Sept. 2009
  • Firstpage
    1
  • Lastpage
    5
  • Abstract
    This paper studied the identification algorithm of Chinese character coding type by analyzing the sub-sentence length. A sub-sentence definition is given in this paper and the pdf of sub-sentence length is analyzed based on the sentence samples from Lancaster corpus. We proposed a new algorithm to recognize the coding type of Chinese characters by splitting sentences into sub-sentences using Chinese punctuation characters and analyzing the probability of the observed sub-sentence length. In this algorithm we used both Bayesian rules and iterated sub-sentence length calculation for trust-region comparison. Because the size of Chinese punctuation characters set is very small, this algorithm has shown great advantages on the space complexity. Time complexity and identification performance are also studied in the end of the paper.
  • Keywords
    belief networks; character recognition; computational complexity; encoding; identification; probability; sampling methods; Bayesian rule; Chinese coding type identification; Chinese punctuation character; Lancaster corpus; character coding recognition; probability analysis; sentence sample; space complexity; subsentence length observation; time complexity; trust-region comparison; Algorithm design and analysis; Bayesian methods; Character recognition; Decoding; Frequency; Helium; Information analysis; Natural languages; Probability; Space technology; BIG5; Bayesian rules; Chinese coding; Chinese decoding type identification; GB; UTF-8; Unicode; multi-octets coding;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference on
  • Conference_Location
    Dalian
  • Print_ISBN
    978-1-4244-4538-7
  • Electronic_ISBN
    978-1-4244-4540-0
  • Type

    conf

  • DOI
    10.1109/NLPKE.2009.5313785
  • Filename
    5313785