مرکز منطقه ای اطلاع رساني علوم و فناوري - On parallelizability of stochastic gradient descent for speech DNNS

DocumentCode :

177477

Title :

On parallelizability of stochastic gradient descent for speech DNNS

Author :

Seide, Frank ; Hao Fu ; Droppo, Jasha ; Gang Li ; Dong Yu

Author_Institution :

Microsoft Res. Asia, Beijing, China

fYear :

2014

fDate :

4-9 May 2014

Firstpage :

235

Lastpage :

239

Abstract :

This paper compares the theoretical efficiency of model-parallel and data-parallel distributed stochastic gradient descent training of DNNs. For a typical Switchboard DNN with 46M parameters, the results are not pretty: With modern GPUs and interconnects, model parallelism is optimal with only 3 GPUs in a single server, while data parallelism with a minibatch size of 1024 does not even scale to 2 GPUs. We further show that data-parallel training efficiency can be improved by increasing the minibatch size (through a combination of AdaGrad and automatic adjustments of learning rate and minibatch size) and data compression. We arrive at an estimated possible end-to-end speed-up of 5 times or more. We do not address issues of robustness to process failure or other issues that might occur during training, nor of speed of convergence differences between ASGD and SGD parameter update patterns.

Keywords :

gradient methods; neural nets; parallel processing; data compression; data parallel training efficiency; distributed stochastic gradient descent training; speech DNNS; Computational modeling; Data models; Hidden Markov models; Parallel processing; Peer-to-peer computing; Speech; Training;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on

Conference_Location :

Florence

Type :

conf

DOI :

10.1109/ICASSP.2014.6853593

Filename :

6853593

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=177477