Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash

2019-03-05

Hiroaki Mikami, Hisahiro Suganuma, Pongsakorn U-chupala, Yoshiki Tanaka, Yuichi Kageyama

arXiv_CV

arXiv_CV Deep_Learning

Abstract
Abstract (translated by Google)
URL
PDF

Abstract

Scaling the distributed deep learning to a massive GPU cluster level is challenging due to the instability of the large mini-batch training and the overhead of the gradient synchronization. We address the instability of the large mini-batch training with batch-size control and label smoothing. We address the overhead of the gradient synchronization with 2D-Torus all-reduce. Specifically, 2D-Torus all-reduce arranges GPUs in a logical 2D grid and performs a series of collective operation in different orientations. These two techniques are implemented with Neural Network Libraries (NNL). We have successfully trained ImageNet/ResNet-50 in 122 seconds without significant accuracy loss on ABCI cluster.

Abstract (translated by Google)

URL

http://arxiv.org/abs/1811.05233

PDF

http://arxiv.org/pdf/1811.05233

Massively Distributed SGD: ImageNet/ResNet-50 Training in a Flash

Abstract

Abstract (translated by Google)

URL

PDF

Similar Posts

Comments