• DocumentCode
    3434127
  • Title

    Chinese coding type identification based on Kolmogorov complexity theory

  • Author

    He, Gang ; Zhu, Ning ; Wu, Xiaochun ; Xu, Qiuchen

  • Author_Institution
    Sch. of Inf. & Commun. Eng., Beijing Univ. of Posts & Telecommun., Beijing, China
  • fYear
    2010
  • fDate
    24-26 Sept. 2010
  • Firstpage
    293
  • Lastpage
    297
  • Abstract
    Identification of Chinese coding type is a major and challenging issue in Chinese web content audit and analysis. In this paper we develop a novel algorithm based on the theory of Kolmogorov complexity to identify the coding type of Chinese characters of a given text segment. An array of text compressors are used as filters to evaluate the information distance of text under examination and the training corpus coded in different coding type. The information distance can be used to decide the coding type according to the Kolmogorov theory. In this paper a particular compressing algorithm is used to minimize computing complexity by separating coding book training stage and compressing stage. Finally, we present the experimental results through which the accuracy and performance of the algorithm is confirmed. The result also proves that this algorithm is especially efficient when short text segment is under examination comparing with the n-gram algorithms.
  • Keywords
    data compression; encoding; text analysis; Chinese characters; Chinese coding type identification; Chinese web content audit; Kolmogorov complexity theory; information distance; n-gram algorithms; text compressors; text segment; Accuracy; Algorithm design and analysis; Books; Complexity theory; Encoding; Grippers; Training; Chinese encoding identification; Kolmogorov complexity; information distance; text compression;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Network Infrastructure and Digital Content, 2010 2nd IEEE International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4244-6851-5
  • Type

    conf

  • DOI
    10.1109/ICNIDC.2010.5657789
  • Filename
    5657789