مرکز منطقه ای اطلاع رساني علوم و فناوري - Understanding speaking styles of internet speech data with LSTM and low-resource training

DocumentCode :

3703409

Title :

Understanding speaking styles of internet speech data with LSTM and low-resource training

Author :

Xixin Wu;Zhiyong Wu;Yishuang Ning;Jia Jia;Lianhong Cai;Helen Meng

Author_Institution :

Tsinghua-CUHK Joint Research Center for Media Sciences, Technologies and Systems, Shenzhen Key Laboratory of Information Science and Technology, Graduate School at Shenzhen, Tsinghua University, Shenzhen, China

fYear :

2015

Firstpage :

815

Lastpage :

820

Abstract :

Speech are widely used to express one´s emotion, intention, desire, etc. in social network communication, deriving abundant of internet speech data with different speaking styles. Such data provides a good resource for social multimedia research. However, regarding different styles are mixed together in the internet speech data, how to classify such data remains a challenging problem. In previous work, utterance-level statistics of acoustic features are utilized as features in classifying speaking styles, ignoring the local context information. Long short-term memory (LSTM) recurrent neural network (RNN) has achieved exciting success in lots of research areas, such as speech recognition. It is able to retrieve context information for long time duration, which is important in characterizing speaking styles. To train LSTM, huge number of labeled training data is required. While for the scenario of internet speech data classification, it is quite difficult to get such large scale labeled data. On the other hand, we can get some publicly available data for other tasks (such as speech emotion recognition), which offers us a new possibility to exploit LSTM in the low-resource task. We adopt retraining strategy to train LSTM to recognize speaking styles in speech data by training the network on emotion and speaking style datasets sequentially without reset the weights of the network. Experimental results demonstrate that retraining improves the training speed and the accuracy of network in speaking style classification.

Keywords :

"Speech","Training","Speech recognition","Training data","Context","Internet","Recurrent neural networks"

Publisher :

ieee

Conference_Titel :

Affective Computing and Intelligent Interaction (ACII), 2015 International Conference on

Electronic_ISBN :

2156-8111

Type :

conf

DOI :

10.1109/ACII.2015.7344667

Filename :

7344667

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3703409