DocumentCode :
2260300
Title :
Chinese coding type identification based on sub-sentence length observation
Author :
He, Gang ; Peng, Peidong ; Wu, Xiaochun ; Chen, Luming
Author_Institution :
Sch. of Inf. & Commun. Eng., Beijing Univ. of Posts & Telecommun., Beijing, China
fYear :
2009
fDate :
24-27 Sept. 2009
Firstpage :
1
Lastpage :
5
Abstract :
This paper studied the identification algorithm of Chinese character coding type by analyzing the sub-sentence length. A sub-sentence definition is given in this paper and the pdf of sub-sentence length is analyzed based on the sentence samples from Lancaster corpus. We proposed a new algorithm to recognize the coding type of Chinese characters by splitting sentences into sub-sentences using Chinese punctuation characters and analyzing the probability of the observed sub-sentence length. In this algorithm we used both Bayesian rules and iterated sub-sentence length calculation for trust-region comparison. Because the size of Chinese punctuation characters set is very small, this algorithm has shown great advantages on the space complexity. Time complexity and identification performance are also studied in the end of the paper.
Keywords :
belief networks; character recognition; computational complexity; encoding; identification; probability; sampling methods; Bayesian rule; Chinese coding type identification; Chinese punctuation character; Lancaster corpus; character coding recognition; probability analysis; sentence sample; space complexity; subsentence length observation; time complexity; trust-region comparison; Algorithm design and analysis; Bayesian methods; Character recognition; Decoding; Frequency; Helium; Information analysis; Natural languages; Probability; Space technology; BIG5; Bayesian rules; Chinese coding; Chinese decoding type identification; GB; UTF-8; Unicode; multi-octets coding;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference on
Conference_Location :
Dalian
Print_ISBN :
978-1-4244-4538-7
Electronic_ISBN :
978-1-4244-4540-0
Type :
conf
DOI :
10.1109/NLPKE.2009.5313785
Filename :
5313785
Link To Document :
بازگشت