Title :
Chinese coding type identification based on sub-sentence length observation
Author :
He, Gang ; Peng, Peidong ; Wu, Xiaochun ; Chen, Luming
Author_Institution :
Sch. of Inf. & Commun. Eng., Beijing Univ. of Posts & Telecommun., Beijing, China
Abstract :
This paper studied the identification algorithm of Chinese character coding type by analyzing the sub-sentence length. A sub-sentence definition is given in this paper and the pdf of sub-sentence length is analyzed based on the sentence samples from Lancaster corpus. We proposed a new algorithm to recognize the coding type of Chinese characters by splitting sentences into sub-sentences using Chinese punctuation characters and analyzing the probability of the observed sub-sentence length. In this algorithm we used both Bayesian rules and iterated sub-sentence length calculation for trust-region comparison. Because the size of Chinese punctuation characters set is very small, this algorithm has shown great advantages on the space complexity. Time complexity and identification performance are also studied in the end of the paper.
Keywords :
belief networks; character recognition; computational complexity; encoding; identification; probability; sampling methods; Bayesian rule; Chinese coding type identification; Chinese punctuation character; Lancaster corpus; character coding recognition; probability analysis; sentence sample; space complexity; subsentence length observation; time complexity; trust-region comparison; Algorithm design and analysis; Bayesian methods; Character recognition; Decoding; Frequency; Helium; Information analysis; Natural languages; Probability; Space technology; BIG5; Bayesian rules; Chinese coding; Chinese decoding type identification; GB; UTF-8; Unicode; multi-octets coding;
Conference_Titel :
Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference on
Conference_Location :
Dalian
Print_ISBN :
978-1-4244-4538-7
Electronic_ISBN :
978-1-4244-4540-0
DOI :
10.1109/NLPKE.2009.5313785