Домой United States USA — software IBM Reduces Neural Network Training Times by Efficiently Scaling Training

IBM Reduces Neural Network Training Times by Efficiently Scaling Training

382
0
ПОДЕЛИТЬСЯ

In August 2017 IBM announced it broke the training record for image recognition capabilities. IBM research reduced their training time for the neural network layout called «ResNet-50» to only 50 minutes. On another network layout called ResNet-101 they obtained an accuracy record of 33…
In August 2017 IBM announced it broke the training record for image recognition capabilities.
IBM research was able to reduce their training time for the neural network layout called «ResNet-50» to only 50 minutes. On another network layout called ResNet-101 they obtained a new accuracy record of 33.8 percent. Using 256 GPUs they trained their neural network on the 7.5 million images in the ImageNet-22K dataset. In comparison: in June 2017 Facebook announced it was able to train their model in an hour, but they used a smaller dataset and a smaller neural network. IBM published their results as a paper on the arXiv.
InfoQ reached out to Hillery Hunter, director of accelerated cognitive infrastructure at IBM Research, and asked several questions.
InfoQ: Could you start by telling us what problem you faced when attempting to break this record? How big was your data set, and what problems do others normally face with these datasets?
We used 7.5 million images for our ResNet-101 training run and when you’re dealing with so many pieces of data, compute time becomes a major challenge. If you had conducted this training problem on a single server, it would have taken about 16 days to complete. There are few domains today in which people will tolerate that kind of computing turnaround time. We wanted to tackle this time-scale problem, bringing training on this large of a data set down to well under a day.
InfoQ: The communication between the more than 256 GPUs you used is very imporant in this achievement. Could you tell what you did, and how this helps in training your network?
We developed a custom communication library which helps all the learners in the system (each of the GPUs) communicate with each other at very close to optimal speeds and bandwidths. Our library can be integrated into any deep learning framework (TensorFlow, Caffe, etc.) — it isn’t hard-coded into just one deep learning software package. When the learners can communicate with each other very quickly, then you can productively add more learners to your system and complete the training run faster. If you don’t have fast communication time, you hit scalability bottlenecks and can’t apply more servers/GPUs to solve your training problem.
InfoQ: something you mention is scaling efficiency. The previous record was at 89%, and you managed to reach 95 percent. What exactly is scaling efficiency, and how is this relevant to your training time?
Scaling efficiency is a measure of how effectively many servers can work together to solve your compute problem. The more efficient your scaling is, the more servers you can add and speed up your solution time. 95% scaling efficiency means that if instead of using 1 server to tackle your problem, you instead used 100 servers, they’ll complete the problem 95 times faster.
InfoQ: In this case you used 256 GPUs to come up with the 95% scaling efficiency. If I were to use 10.000 GPUs, would my network still train 9.500 times faster? In other words: is this a linear scale? And what are the limiting factors?
We believe our new communication library is quite close to optimal and we would expect to continue to see great speedups with many more GPUs. Right now, the deep learning research community is working on tackling a limiting factor called «batch size». This factor would currently make 10,000 GPU runs difficult, but as it is overcome, we expect to see scaling to many more GPUs become possible.
InfoQ: in addition to breaking the record you also managed to improve the accuracy from 29.8% to 33.8%. Is this purely because of more «training power», or did you change the network layout.
We didn’t do new neural network design for this work; we leveraged fully-synchronous training (possible because of our low-latency communication library) and were able to feasibly train on lots of images, because of our training time advantages.
InfoQ: what framework did you develop your model in?
In our announcement, we describe work done in both Torch (ResNet-50) and Caffe (ResNet-101). Through the PowerAI technology preview program, IBM Server Group is also making our Distributed Deep Learning technology available for people to try using TensorFlow.
InfoQ: Could you explain what the PowerAI platform is, and what it can do for developers?
PowerAI is a set of deep learning capabilities, including frameworks (like Caffe, Tensorflow, Torch, Chainer, etc.), multi-server support, and user tools which are pre-compiled and pre-optimized for GPU-accelerated IBM Servers. PowerAI helps users avoid both the hassle of getting going with open source deep learning tools and provides capabilities to help speed training time and simplify deep learning performance on custom datasets. Anyone can try the PowerAI capabilities either on their own server or through the Nimbix cloud.
InfoQ: are there plans to increase the training speed even more? What do you think is the limit in terms of computation time and accuracy?
We believe our distributed deep learning library is quite close to optimal in terms of scaling efficiency, but overall we definitely think deep learning training times and accuracies will improve further. We want to see deep learning move out of the ivory tower where large-scale capabilities are currently taking weeks to a month and into the hands of customers who need business results in minutes and seconds.
Hillery Hunter is an IBM Fellow and Director of the Accelerated Cognitive Infrastructure group at IBM’s T. J. Watson Research Center in Yorktown Heights, NY. She is interested in cross-disciplinary technology topics, spanning silicon to system architecture to achieve new solutions to traditional problems. Her team pursues hardware-software co-optimization to take the wait time out of machine and deep learning problems. Her prior work was in the areas of DRAM main memory systems and embedded DRAM, and she gained development experience serving as IBM’s server and mainframe DDR3-generation end-to-end memory power lead.

Continue reading...