DocumentCode
2260300
Title
Chinese coding type identification based on sub-sentence length observation
Author
He, Gang ; Peng, Peidong ; Wu, Xiaochun ; Chen, Luming
Author_Institution
Sch. of Inf. & Commun. Eng., Beijing Univ. of Posts & Telecommun., Beijing, China
fYear
2009
fDate
24-27 Sept. 2009
Firstpage
1
Lastpage
5
Abstract
This paper studied the identification algorithm of Chinese character coding type by analyzing the sub-sentence length. A sub-sentence definition is given in this paper and the pdf of sub-sentence length is analyzed based on the sentence samples from Lancaster corpus. We proposed a new algorithm to recognize the coding type of Chinese characters by splitting sentences into sub-sentences using Chinese punctuation characters and analyzing the probability of the observed sub-sentence length. In this algorithm we used both Bayesian rules and iterated sub-sentence length calculation for trust-region comparison. Because the size of Chinese punctuation characters set is very small, this algorithm has shown great advantages on the space complexity. Time complexity and identification performance are also studied in the end of the paper.
Keywords
belief networks; character recognition; computational complexity; encoding; identification; probability; sampling methods; Bayesian rule; Chinese coding type identification; Chinese punctuation character; Lancaster corpus; character coding recognition; probability analysis; sentence sample; space complexity; subsentence length observation; time complexity; trust-region comparison; Algorithm design and analysis; Bayesian methods; Character recognition; Decoding; Frequency; Helium; Information analysis; Natural languages; Probability; Space technology; BIG5; Bayesian rules; Chinese coding; Chinese decoding type identification; GB; UTF-8; Unicode; multi-octets coding;
fLanguage
English
Publisher
ieee
Conference_Titel
Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference on
Conference_Location
Dalian
Print_ISBN
978-1-4244-4538-7
Electronic_ISBN
978-1-4244-4540-0
Type
conf
DOI
10.1109/NLPKE.2009.5313785
Filename
5313785
Link To Document