Title :
New word identification in social network text based on time series information
Author :
Meng Wang ; Lanfen Lin ; Feng Wang
Author_Institution :
Coll. of Comput. Sci. & Technol., Zhejiang Univ., Hangzhou, China
Abstract :
Different from the languages widely used in western countries such as English or French, there are no spaces between words in Chinese language, and a segmentation of the texts is necessary before other superior processes. New word identification is an important problem in the segmentation process, especially when the segmentation targets are social network texts which have more abbreviated words or other non-standard representations. Several methods have been proposed to detect Chinese new words. Most of these methods take the corpus as a static set and they don´t consider the time domain information. Different from these studies, we regard our social network corpus as a text series spreading along the time line and design a new kind of features named dynamic features which can reflect the temporal variety of the string´s statistical features. The experimental results on the dataset crawled from the biggest microblogging application in China show that this method can significantly improve the effect of Chinese new word identification.
Keywords :
Internet; natural language processing; social networking (online); text analysis; time series; Chinese language; Chinese new words; English; French; Western countries; microblogging application; new word identification; segmentation process; segmentation targets; social network corpus; social network text; string statistical features; text series; time domain information; time series information; Blogs; Entropy; Feature extraction; Social network services; Tagging; Time-domain analysis; Vectors; new word identification; social network; time domain;
Conference_Titel :
Computer Supported Cooperative Work in Design (CSCWD), Proceedings of the 2014 IEEE 18th International Conference on
Conference_Location :
Hsinchu
DOI :
10.1109/CSCWD.2014.6846904