Keras is undoubtedly my favorite deep learning + Python framework, especially for image classification.
I use Keras in production applications, in my personal deep learning projects, and here on the PyImageSearch blog.
I’ve even based over two-thirds of my new book, Deep Learning for Computer Vision with Python on Keras.
However, one of my biggest hangups with Keras is that it can be a pain to perform multi-GPU training. Between the boilerplate code and configuring TensorFlow it can be a bit of a process…
…but not anymore.
With the latest commit and release of Keras (v2.0.9) it’s now extremely easy to train deep neural networks using multiple GPUs.
In fact, it’s as easy as a single function call!
To learn more about training deep neural networks using Keras, Python, and multiple GPUs, just keep reading.
Looking for the source code to this post?
Jump Right To The Downloads SectionHow-To: Multi-GPU training with Keras, Python, and deep learning
2020-06-16 Update: This blog post is now TensorFlow 2+ compatible! Keras is now built into TensorFlow 2 and serves as TensorFlow’s high-level API. Sentences in this introduction section may be misleading given the update to TensorFlow/Keras; they are left “as-is” for historical purposes.
When I first started using Keras I fell in love with the API. It’s simple and elegant, similar to scikit-learn. Yet it’s extremely powerful, capable of implementing and training state-of-the-art deep neural networks.
However, one of my biggest frustrations with Keras is that it could be a bit non-trivial to use in multi-GPU environments.
If you were using Theano, forget about it — multi-GPU training wasn’t going to happen.
TensorFlow was a possibility, but it could take a lot of boilerplate code and tweaking to get your network to train using multiple GPUs.
I preferred using the mxnet backend (or even the mxnet library outright) to Keras when performing multi-GPU training, but that introduced even more configurations to handle.
All of that changed with François Chollet’s announcement that multi-GPU support using the TensorFlow backend is now baked in to Keras v2.0.9. Much of this credit goes to @kuza55 and their keras-extras repo.
I’ve been using and testing this multi-GPU function for almost a year now and I’m incredibly excited to see it as part of the official Keras distribution.
In the remainder of today’s blog post I’ll be demonstrating how to train a Convolutional Neural Network for image classification using Keras, Python, and deep learning.
The MiniGoogLeNet deep learning architecture
In Figure 1 above we can see the individual convolution (left), inception (middle), and downsample (right) modules, followed by the overall MiniGoogLeNet architecture (bottom), constructed from these building blocks. We will be using the MiniGoogLeNet architecture in our multi-GPU experiments later in this post.
The Inception module in MiniGoogLenet is a variation of the original Inception module designed by Szegedy et al.
I first became aware of this “Miniception” module from a tweet by @ericjang11 and @pluskid where they beautifully visualized the modules and associated MiniGoogLeNet architecture.
After doing a bit of research, I found that this graphic was from Zhang et al.’s 2017 publication, Understanding Deep Learning Requires Re-Thinking Generalization.
I then proceeded to implement the MiniGoogLeNet architecture in Keras + Python — I even included it as part of Deep Learning for Computer Vision with Python.
A full review of the MiniGoogLeNet Keras implementation is outside the scope of this blog post, so if you’re interested in how the network works (and how to code it), please refer to my book.
Otherwise, you can use the “Downloads” section at the bottom of this blog post to download the source code.
Configuring your development environment
To configure your system for this tutorial, I recommend following my How to install TensorFlow 2.0 on Ubuntu guide which has instructions on how to set up your system with your GPU driver, CUDA, and cuDNN.
Additionally, you’ll learn to set up a a convenient Python virtual environment to house your Python packages including TensorFlow 2+.
I don’t recommend using macOS for working with GPUs; please follow this link if you need a non-GPU macOS TensorFlow installation guide (not for today’s tutorial).
Additionally, please note that PyImageSearch does not recommend or support Windows for CV/DL projects.
Training a deep neural network with Keras and multiple GPUs
Let’s go ahead and get started training a deep learning network using Keras and multiple GPUs.
Open up a new file, name it train.py
, and insert the following code:
# set the matplotlib backend so figures can be saved in the background # (uncomment the lines below if you are using a headless server) # import matplotlib # matplotlib.use("Agg") # import the necessary packages from pyimagesearch.minigooglenet import MiniGoogLeNet from sklearn.preprocessing import LabelBinarizer from tensorflow.keras.preprocessing.image import ImageDataGenerator from tensorflow.keras.callbacks import LearningRateScheduler from tensorflow.compat.v2.keras.utils import multi_gpu_model from tensorflow.keras.optimizers import SGD from tensorflow.keras.datasets import cifar10 import matplotlib.pyplot as plt import tensorflow as tf import numpy as np import argparse
If you’re using a headless server, you’ll want to configure the matplotlib backend on Lines 3 and 4 by uncommenting the lines. This will enable your matplotlib plots to be saved to disk. If you are not using a headless server (i.e., your keyboard + mouse + monitor are plugged in to your system, you can keep the lines commented out).
From there we import our required packages for this script.
Line 7 imports the MiniGoogLeNet from my pyimagesearch
module (included with the download available in the “Downloads” section).
Another notable import is on Line 13 where we import the CIFAR10 dataset. This helper function will enable us to load the CIFAR-10 dataset from disk with just a single line of code.
Now let’s parse our command line arguments:
# construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-o", "--output", required=True, help="path to output plot") ap.add_argument("-g", "--gpus", type=int, default=1, help="# of GPUs to use for training") args = vars(ap.parse_args()) # grab the number of GPUs and store it in a conveience variable G = args["gpus"]
We use argparse
to parse one required and one optional argument on Lines 20-25:
--output
: The path to the output plot after training is complete.--gpus
: The number of GPUs used for training.
After loading the command line arguments, we store the number of GPUs as G
for convenience (Line 28).
From there, we initialize two important variables used to configure our training process, followed by defining a polynomial-based learning rate schedule function:
# definine the total number of epochs to train for along with the # initial learning rate NUM_EPOCHS = 70 INIT_LR = 5e-3 def poly_decay(epoch): # initialize the maximum number of epochs, base learning rate, # and power of the polynomial maxEpochs = NUM_EPOCHS baseLR = INIT_LR power = 1.0 # compute the new learning rate based on polynomial decay alpha = baseLR * (1 - (epoch / float(maxEpochs))) ** power # return the new learning rate return alpha
We set NUM_EPOCHS = 70
— this is the number of times (epochs) our training data will pass through the network (Line 32).
We also initialize the learning rate INIT_LR = 5e-3
, a value that was found experimentally in previous trials (Line 33).
From there, we define the poly_decay
function which is the equivalent of Caffe’s polynomial learning rate decay (Lines 35-46). Essentially this function updates the learning rate during training, effectively reducing it after each epoch. Setting the power = 1.0
changes the decay from polynomial to linear.
Next we’ll load our training + testing data and convert the image data from integer to float:
# load the training and testing data, converting the images from # integers to floats print("[INFO] loading CIFAR-10 data...") ((trainX, trainY), (testX, testY)) = cifar10.load_data() trainX = trainX.astype("float") testX = testX.astype("float")
From there we apply mean subtraction to the data:
# apply mean subtraction to the data mean = np.mean(trainX, axis=0) trainX -= mean testX -= mean
On Line 56, we calculate the mean of all training images followed by Lines 57 and 58 where we subtract the mean from each image in the training and testing sets.
Then, we perform “one-hot encoding”, an encoding scheme I discuss in more detail in my book:
# convert the labels from integers to vectors lb = LabelBinarizer() trainY = lb.fit_transform(trainY) testY = lb.transform(testY)
One-hot encoding transforms categorical labels from a single integer to a vector so we can apply the categorical cross-entropy loss function. We’ve taken care of this on Lines 61-63.
Next, we create a data augmenter and set of callbacks:
# construct the image generator for data augmentation and construct # the set of callbacks aug = ImageDataGenerator(width_shift_range=0.1, height_shift_range=0.1, horizontal_flip=True, fill_mode="nearest") callbacks = [LearningRateScheduler(poly_decay)]
On Lines 67-69 we construct the image generator for data augmentation.
Data augmentation is covered in detail inside the Practitioner Bundle of Deep Learning for Computer Vision with Python; however, for the time being understand that it’s a method used during the training process where we randomly alter the training images by applying random transformations to them.
Because of these alterations, the network is constantly seeing augmented examples — this enables the network to generalize better to the validation data while perhaps performing worse on the training set. In most situations these trade off is a worthwhile one.
We create a callback function on Line 70 which will allow our learning rate to decay after each epoch — notice our function name, poly_decay
.
Let’s check that GPU variable next:
# check to see if we are compiling using just a single GPU if G <= 1: print("[INFO] training with 1 GPU...") model = MiniGoogLeNet.build(width=32, height=32, depth=3, classes=10)
If the GPU count is less than or equal to one, we initialize the model
via the .build
function (Lines 73-76), otherwise we’ll parallelize the model during training:
# otherwise, we are compiling using multiple GPUs else: # disable eager execution tf.compat.v1.disable_eager_execution() print("[INFO] training with {} GPUs...".format(G)) # we'll store a copy of the model on *every* GPU and then combine # the results from the gradient updates on the CPU with tf.device("/cpu:0"): # initialize the model model = MiniGoogLeNet.build(width=32, height=32, depth=3, classes=10) # make the model parallel model = multi_gpu_model(model, gpus=G)
Creating a multi-GPU model in Keras requires some bit of extra code, but not much!
To start, you’ll notice on Line 86 that we’ve specified to use the CPU (rather than the GPU) as the network context.
Why do we need the CPU?
Well, the CPU is responsible for handling any overhead (such as moving training images on and off GPU memory) while the GPU itself does the heavy lifting.
In this case, the CPU instantiates the base model.
We can then call the multi_gpu_model
on Line 92. This function replicates the model from the CPU to all of our GPUs, thereby obtaining single-machine, multi-GPU data parallelism.
When training our network images will be batched to each of the GPUs. The CPU will obtain the gradients from each GPU and then perform the gradient update step.
We can then compile our model and kick off the training process:
# initialize the optimizer and model print("[INFO] compiling model...") opt = SGD(lr=INIT_LR, momentum=0.9) model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"]) # train the network print("[INFO] training network...") H = model.fit( x=aug.flow(trainX, trainY, batch_size=64 * G), validation_data=(testX, testY), steps_per_epoch=len(trainX) // (64 * G), epochs=NUM_EPOCHS, callbacks=callbacks, verbose=2)
2020-06-16 Update: Formerly, TensorFlow/Keras required use of a method called .fit_generator
in order to accomplish data augmentation. Now, the .fit
method can handle data augmentation as well, making for more-consistent code. This also applies to the migration from .predict_generator
to .predict
. Be sure to check out my articles about fit and fit_generator as well as data augmentation.
On Line 96 we build a Stochastic Gradient Descent (SGD) optimizer with our initial learning rate.
Subsequently, we compile the model with the SGD optimizer and a categorical crossentropy loss function.
We’re now ready to train the network!
To initiate the training process, we make a call to model.fit
and provide the necessary arguments.
We’d like a batch size of 64 on each GPU so that is specified by batch_size=64 * G
.
Our training will continue for 70 epochs (which we specified previously).
The results of the gradient update will be combined on the CPU and then applied to each GPU throughout the training process.
Now that training and testing is complete, let’s plot the loss/accuracy so we can visualize the training process:
# grab the history object dictionary H = H.history # plot the training loss and accuracy N = np.arange(0, len(H["loss"])) plt.style.use("ggplot") plt.figure() plt.plot(N, H["loss"], label="train_loss") plt.plot(N, H["val_loss"], label="test_loss") plt.plot(N, H["accuracy"], label="train_acc") plt.plot(N, H["val_accuracy"], label="test_acc") plt.title("MiniGoogLeNet on CIFAR-10") plt.xlabel("Epoch #") plt.ylabel("Loss/Accuracy") plt.legend() # save the figure plt.savefig(args["output"]) plt.close()
2020-06-16 Update: In order for this plotting snippet to be TensorFlow 2+ compatible the H.history
dictionary keys are updated to fully spell out “accuracy” sans “acc” (i.e., H["val_accuracy"]
and H["accuracy"]
). It is semi-confusing that “val” is not spelled out as “validation”; we have to learn to love and live with the API and always remember that it is a work in progress that many developers around the world contribute to.
This last block simply uses matplotlib to plot training/testing loss and accuracy (Lines 110-123), and then saves the figure to disk (Line 126).
If you would like more to learn more about the training process (and how it works internally), please refer to Deep Learning for Computer Vision with Python.
Keras multi-GPU results
Let’s check the results of our hard work.
To start, grab the code from this lesson using the “Downloads” section at the bottom of this post. You’ll then be able to follow along with the results
Let’s train on a single GPU to obtain a baseline:
$ python train.py --output single_gpu.png [INFO] loading CIFAR-10 data... [INFO] training with 1 GPU... [INFO] compiling model... [INFO] training network... Epoch 1/70 - 64s - loss: 1.4323 - accuracy: 0.4787 - val_loss: 1.1319 - val_ accuracy: 0.5983 Epoch 2/70 - 63s - loss: 1.0279 - accuracy: 0.6361 - val_loss: 0.9844 - accuracy: 0.6472 Epoch 3/70 - 63s - loss: 0.8554 - accuracy: 0.6997 - val_loss: 1.5473 - accuracy: 0.5592 ... Epoch 68/70 - 63s - loss: 0.0343 - accuracy: 0.9898 - val_loss: 0.3637 - accuracy: 0.9069 Epoch 69/70 - 63s - loss: 0.0348 - accuracy: 0.9898 - val_loss: 0.3593 - accuracy: 0.9080 Epoch 70/70 - 63s - loss: 0.0340 - accuracy: 0.9900 - val_loss: 0.3583 - accuracy: 0.9065 Using TensorFlow backend. real 74m10.603s user 131m24.035s sys 11m52.143s
For this experiment, I trained on a single Titan X GPU on my NVIDIA DevBox. Each epoch took ~63 seconds with a total training time of 74m10s.
I then executed the following command to train with all four of my Titan X GPUs:
$ python train.py --output multi_gpu.png --gpus 4 [INFO] loading CIFAR-10 data... [INFO] training with 4 GPUs... [INFO] compiling model... [INFO] training network... Epoch 1/70 - 21s - loss: 1.6793 - accuracy: 0.3793 - val_loss: 1.3692 - accuracy: 0.5026 Epoch 2/70 - 16s - loss: 1.2814 - accuracy: 0.5356 - val_loss: 1.1252 - accuracy: 0.5998 Epoch 3/70 - 16s - loss: 1.1109 - accuracy: 0.6019 - val_loss: 1.0074 - accuracy: 0.6465 ... Epoch 68/70 - 16s - loss: 0.1615 - accuracy: 0.9469 - val_loss: 0.3654 - accuracy: 0.8852 Epoch 69/70 - 16s - loss: 0.1605 - accuracy: 0.9466 - val_loss: 0.3604 - accuracy: 0.8863 Epoch 70/70 - 16s - loss: 0.1569 - accuracy: 0.9487 - val_loss: 0.3603 - accuracy: 0.8877 Using TensorFlow backend. real 19m3.318s user 104m3.270s sys 7m48.890s
Here you can see the quasi-linear speed up in training: Using four GPUs, I was able to decrease each epoch to only 16 seconds. The entire network finished training in 19m3s.
As you can see, not only is training deep neural networks with Keras and multiple GPUs easy, it’s also efficient as well!
Note: In this case, the single GPU experiment obtained slightly higher accuracy than the multi-GPU experiment. When training any stochastic machine learning model, there will be some variance. If you were to average these results out across hundreds of runs they would be (approximately) the same.
What's next? We recommend PyImageSearch University.
86 total classes • 115+ hours of on-demand code walkthrough videos • Last updated: October 2024
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled
I strongly believe that if you had the right teacher you could master computer vision and deep learning.
Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?
That’s not the case.
All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.
If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.
Inside PyImageSearch University you'll find:
- ✓ 86 courses on essential computer vision, deep learning, and OpenCV topics
- ✓ 86 Certificates of Completion
- ✓ 115+ hours of on-demand video
- ✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
- ✓ Pre-configured Jupyter Notebooks in Google Colab
- ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
- ✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
- ✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
- ✓ Access on mobile, laptop, desktop, etc.
Summary
In today’s blog post we learned how to use multiple GPUs to train Keras-based deep neural networks.
Using multiple GPUs enables us to obtain quasi-linear speedups.
To validate this, we trained MiniGoogLeNet on the CIFAR-10 dataset.
Using a single GPU we were able to obtain 63 second epochs with a total training time of 74m10s.
However, by using multi-GPU training with Keras and Python we decreased training time to 16 second epochs with a total training time of 19m3s.
Enabling multi-GPU training with Keras is as easy as a single function call — I recommend you utilize multi-GPU training whenever possible. In the future I imagine that the multi_gpu_model
will evolve and allow us to further customize specifically which GPUs should be used for training, eventually enabling multi-system training as well.
Ready to take a deep dive into deep learning? Follow my lead.
If you’re interested in learning more about deep learning (and training state-of-the-art neural networks on multiple GPUs), be sure to take a look at my new book, Deep Learning for Computer Vision with Python.
Whether you’re just getting started with deep learning or you’re already a seasoned deep learning practitioner, my new book is guaranteed to help you reach expert status.
To learn more about Deep Learning for Computer Vision with Python (and grab your copy), click here.
Download the Source Code and FREE 17-page Resource Guide
Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!
Anthony The Koala
Dear Dr Adrian,
Thank you for your tutorial. I have a few questions which point to the same thing – how to construct & wire up multiple processors and how to input and output to and from multiple processors for the purposes of experimenting with deep learning and keras.
Other questions related to the main question on multiple processors and how to input and output to and from multiple processors.
1. What kind of processors are used if you had to rig up an array of GPUs – for example do you have purchase a number of NVIDIA GPUs each one connected to a PCI bus – apologies for sounding ‘naive’ here.
2. Can multiple RPi’s be rigged up in an array – how?
3. Are multi-core CPUs used in PCs the same as multiple processors being ‘rigged’ up for multiprocessors?
Thank you
Anthony of Sydney Australia
Adrian Rosebrock
Hi Anthony — Thanks for your comment.
(1) If you take a look at my NVIDIA DIGITS DevBox, you can see the specs which include 4 Titan X GPUs connected via PCI Express.
(2) I wouldn’t advise spending money on a bunch of Raspberry Pis and connecting them up in an array, but people have done it as in this 2013 article.
(3) I think you’re asking if there’s a difference between multi-core CPUs and having a multiprocessor system. There is a difference — here’s a brief explanation.
Anthony The Koala
Dear Dr Adrian,
Thank you for the reply in regards the different kinds of processors and pointing me to different kinds of processors. My summary is:
(1) NVIDIA DIGITS DevBox – it is a self-contained i7 PC with 4 x GPUs connected on a BUS costing $15000 (US). I understand that the graphics processors which were originally designed for graphics displays’ calculations lends itself also for deep learning computations.
(2) Using an array of RPis as a supercomputer. In this example article, the computers were interconnected via network cards and were for distributive super computing. For the latter, you want your array to be arranged for parallel computing, for which GPUs (see issue (1)) are constructed and parallel computing is ideal for deep learning.
However, this clever person in England (2017) designed a supercomputer using RPis http://thundaxsoftware.blogspot.com.au/2016/07/creating-raspberry-pi-3-cluster.html. This still required inter-connecting the RPis via network cables and a switch. It also requires each RPi having its unique IP address and communication is via the Message Passing Interface (‘MPI’) communication protocol which has a python implementation. This article (2015) also explains the principles of MPI applied to the RPi, http://www.meccanismocomplesso.org/en/cluster-e-programmazione-in-parallelo-con-mpi-e-raspberry-pi/ .
Yet to find out comparison data between parallel computers using RPis and commercially available machines (NVidia) and services (Amazon).
(3) Difference between multicore CPUs and a multiprocessor system. The former (multicore) means two or more processors on a single die (chip) while the latter (multiprocessor) is two or more separate CPUs. Multicore allows for faster caching speeds whereas multiprocessor are independently operating on the same motherboard.
Even though multicore and multiprocessor allow concurrent execution, it is not the same as parallel execution. I stand corrected on this.
Thank you,
Anthony of Sydney Australia
Stephen Borstelmann MD
Anthony et. al. –
I’ve seen the raspberry pi supercomputer video & its cute and neat for a parallel computing proof of concept but unless you are a hardware and coding master, I’d forget it.
Here’s my attempt to make something similar to a DIGITS box. Its far less expensive – but also has some limitations in capabilities, particuarly along the bus and ease of use/setup. You get what you pay for, but if you’re looking just to experiment and expand, it might fit your needs:
http://ai-imaging.org/building-a-high-performance-gpu-computing-workstation-for-deep-learning-part-i/
It currently uses one 1080Ti GPU for running Tensorflow, Keras, and pytorch under Ubuntu 16.04LTS but can easily be expanded to 3, possibly 4 GPU’s.
Puget Systems also builds similar & installs software for those not inclined to do-it-yourself.
Bohumír Zámečník
Hi Adrian,
thanks for a beautiful and very practical tutorial. Just a correction – the multi_gpu_model() function is yet to be released in 2.0.9, it was added on 11 Oct, whereas 2.0.8 was released on 25 Aug. It means until 2.0.9 we need Keras from master.
I’d be really interested how you achieved so perfect speedup (more than 95% efficiency). I’ve been experimenting with multi-GPU training in Keras with TensorFlow since summer and in Keras got efficiency around 75-85% with ResNet50/imagenet-synth and much better with optimized tf_cnn_benchmark. Coincidentaly, today we also published an article our experience (https://medium.com/rossum/towards-efficient-multi-gpu-training-in-keras-with-tensorflow-8a0091074fb2). We tried replication code from kuza55, avolkov1 and fchollet. Our conclusion was that Keras compared to tf_cnn_benchmark is lacking asynchronous prefetching of inputs (to TF memory on CPU and from CPU to GPU). It seems your model was on CIFAR10 with not too big batch size. What was roughly the number of parameters? Maybe in your case the the inputs were small compared to computation and the machine was benefitting from 4 16-channel PCIe slots. Could you try comparing our benchmark on your machine (https://github.com/rossumai/keras-multi-gpu/blob/master/experiments/keras_tensorflow/benchmark_inception3_resnet50_az-4x-m60.sh)?
We’d also like to try new packages like Horovod and Tensorpack, but anyway I’m working on async prefetch using StagingArea.
Thanks!
Kind regards,
Bohumir
Adrian Rosebrock
Hi Bohumír! Thank you for the clarification on the Keras version numbers. I have updated the blog post to report the correct Keras v2.0.9 version number.
I’m using my NVIDIA DevBox which has been built specifically for deep learning, optimizing across the PCIe bus, processor, GPUs, etc. NVIDIA did a really great job building the machine. I’m personally not a hardware person and I don’t particularly enjoy working with hardware. I tend to focus more on the software side of things. With the DevBox things “just work”.
I’m very busy with other projects/blog posts right now, but please send me a message and we can chat more about the benchmark.
Segovia
I think when tensorflow is used as backend, all GPUs will be used by default, right? Thanks for your great tutorials!
Adrian Rosebrock
TensorFlow will allocate all GPUs but the network will not be trained on all GPUs.
Kenny
Cool post yet again, Adrian 🙂
Adrian Rosebrock
Thanks Kenny!
CT
Dear Dr Adrain,
I downloaded your code and get the following error while running it.
Using TensorFlow backend.
Traceback (most recent call last):
…
ModuleNotFoundError: No module named ‘keras.utils.training_utils’
I checked that my keras version is 2.0.8
Thanks.
CT
Adrian Rosebrock
Please see my reply to “GKS”. The correct Keras version number is their development branch, 2.0.9. Install the 2.0.9 branch and it will work 🙂
CT
Hi Dr Adrian,
Thanks, after I look around on the internet, found some discussion on this and I used the latest Keras source from the Git. Yes, it works now however when I check the version using pip freeze it still shows 2.0.8.
Another question is that I two different version of Nvidia card a 1080Ti and a GTX 780. When I use both GPU for training, it is slower than a single one, GTX 1080ti. Is this expected?
Thanks again.
Adrian Rosebrock
The 2.0.9 branch hasn’t been officially released it (it’s the development branch for the next release which will be 2.0.9).
As for your second question, it’s entirely possible that the training process would be slower. Your 1080 Ti is significantly faster than your 780 so your Ti quickly processes the batch while your 780 is far behind. The CPU is then stuck waiting for the 780 to catch up before the weights can be updated and then the next batch sent to the two GPUs.
Lei Cao
Dear Adrian,
Thanks for the great tutorial.
I have the same issue with CT as I have a Titan XP and a 1060. Is there a way to assign different portions of batches to GPUs, rather than equally divide them.
Thank you
Adrian Rosebrock
Great question but I’m honestly not sure. I would ask on the official Keras GitHub page. If you find anything out please do come back and let us know!
GKS
I believe that Keras 2.0.8 doesn’t contain multi_gpu_model as the release was on August 25th while the function has been added sometime in October. Pypi installation should cause importError/
Adrian Rosebrock
I misspoke, the actual Keras version number is the current development branch 2.0.9. I have updated the blog post to reflect this. A big thank you to Bohumir for clarifying this.
Davide
Keras 2.0.8 does not include training_utils.py in utils, sadly.
Adrian Rosebrock
Please see my reply to GKS.
Davide
Yeah, didn’t refresh the post when I finished reading it 🙂
Thanks!
Sunny
Hi, Thanks for your posts.I also used this technique to speed up the training.
However, the parallelized model cannot be saved like the original model. There is no way I can save it and I’m not able to perform reinforced training in future. Do you have any solution for this bug?
Adrian Rosebrock
Hi Sunny — I’m not sure what you mean by “the parallelized model cannot be saved like the original model”. Can you please elaborate?
Guangzhe Cui
I can’t using model.save(model_path) either
Adrian Rosebrock
There should be an internal model representation. I would suggest doing:
print(dir(model))
And examining the output. You should be able to find the internal model object that can be serialized using
model.save
there. I’ll test this out myself the next time I’m at my workstation.Sunny
Hi, thanks for the suggestion. Seems like I’ve found the solution. Just compile the base model, then transfer the trained weights of GPU model back to base model itself, then it was able to be saved like usual, walla!
>>> autoencoder.compile(optimizer=’adadelta’, loss=’binary_crossentropy’) # since the GPU model is compiled, now only compile the base model
>>> output = autoencoder.predict(img) # the output will be a mess since only the GPU model is trained, not the base model
>>> output = parallel_autoencoder.predict(img) # the output is a clear image from well-trained GPU model
>>> autoencoder.set_weights(parallel_autoencoder.get_weights()) # transfer the trained weights from GPU model to base model
>>> output = autoencoder.predict(img) # perform the prediction again and the result is similar to the GPU model
>>> autoencoder.save(‘CAE.h5’) # now the mode can be saved with transferred weights from GPU model.
Hope it helps.
Adrian Rosebrock
Thank you for sharing, Sunny!
Adam
Dear Dr Adrian
Great post again!
I have another problem in training parallelized model.
The weight trained in the GPUs can not be loaded in the CPU environment.
(The model’s summary is different from the CPU)
Do you have any solutions for this problem?
Adrian Rosebrock
See my reply to “Sunny”. I believe this can be resolved by finding the internal model representation via
dir(model)
and extracting this object.Adam
Thank you for your reply!I will try it!
I am looking forward to reading your another post about this problem.
yunxiao
Hi Adrian
Thank you for your excellent post. Training and validation using the multi_gpu_model was smooth for me. But when I try to do test (parallel_model.predict()) it throws an error:
InternalError: CUB segmented reduce errorinvalid configuration argument [……]
My understanding is that test is only forward pass for the network so maybe I should use model.predict() instead? But since I’m on a quite big dataset by doing model.predict() would take ~300 hours to complete…I think there must be something wrong with my doing but I can’t seem to figure it out… Do you have any suggestions?
Adrian Rosebrock
Please see my reply to “Sunny” — I think there is an internal model object that exposes the
.predict
function. I’m not sure if you can directly use multi-GPUs for prediction in this specific instance.Simon Walsh
Has anyone encountered this error TypeError: can’t pickle module objects when using model.save following this multi-gpu implementation? This is a real pain – it also occurs when trying to save best weights during training.
Adrian Rosebrock
Hey Simon, please see the conversation between “Sunny” and myself. The parallel model is a wrapper around the model instance that is transferred to the GPUs so the object can’t be directly called when using
model.save
or the callbacks used to save the best weights. If you wanted to save the best weights you would need to code your own custom callback to handle this use case (since you need to transfer the weights and/or access the internal model). I’m sure this will become easier to use in future versions of Keras, keep in mind this is the first time multi-GPU training is included in an official release of Keras. I’ll also try to do a blog post on how to access the internal model object as well.Simon Walsh
Thanks Adrian – if you can have a look at it and provide some guidance that would be great – I am not fully sure I understand (but thats me, not your explanation!)
Adrian Rosebrock
I’ll be taking a look and likely doing an entirely separate blog post since this seems to be a common issue.
Jeff
Hi Adrian, great post, thanks for sharing!
I’ve been playing around with multi_gpu_model for a couple weeks now and finally got it to work recently (had to patch a bug in tensorflow for this to happen – quirk of my model).
I’ve been using the save_weights() method on the parallel model after training and then later creating the base model and using load_weights() with the ‘by_name’ parameter equal to True. It loads fine, but the predictions I get are all roughly between 0.45 and 0.55 even though the training results would suggest pretty high accuracy. Also, each time I do this, the metrics change slightly even though it’s on the same test data. This seems to suggest to me that maybe weights aren’t being loaded but are perhaps randomly initialized?
I tried using Sunny’s suggestion:
base_model.set_weights(parallel_model.get_weights())
But I always get a shape mismatch error. After inspecting the results of the get_weights() methods on both models, I noticed they more or less have the same internal shapes but the ordering is different. I’m somewhat new to all this stuff so I was wondering if you might have any suggestions on what might be going on and how to investigate this further?
Adrian Rosebrock
Hi Jeff — thanks for the comment. I’m not sure what the exact issue is off the top of my head, I’ll need to play with the multi-GPU training function further, in particular serializing the weights.
Parul Singh
Hi Jeff
Just wondering if you were able to save the model?
michael reynolds
Hi Adrian – I am looking at a project with multi V100 GPUs. Are there any compatibility issues or special setup required that you are aware of with that hardware configuration and your ImageNet Bundle?
Adrian Rosebrock
Wow, that’s awesome that you’ll be using multiple V100 GPUs! There are no compatibly issues at all. The ImageNet Bundle of Deep Learning for Computer Vision with Python will work just fine on your GPUs.
bharath grandhi
Hi Adrian, how do i install keras v2.0.9 and also iam getting error while running the test.py code. The error is ImportError: No module named pyimagesearch.minigooglenet. How to get rectify out of this. I want to execute this code and waiting for your solution.Thank you
Adrian Rosebrock
It sounds like you may have your system configured correctly, you’re just missing the source code. Make sure you use the “Downloads” section of this blog post to download the MiniGoogLeNet implementation.
Xu Zhang
Thank you so much for your great tutorial.
If I use multi GPU, could I get more memories? I mean if I have one GPU with 8G memory. after I add another 8G GPU, can two memories add together to become 16G? I don’t think so, but I am not so sure. Thank you.
Adrian Rosebrock
There are ways to share the memory across multiple GPUs to train larger networks in larger batch sizes, but I don’t recommend it as it really slows down the training process. Realistically you don’t need 16GB of GPU memory to train a network. You’ll be fine with 8-12GB per card.
Alexander Lazarev
strange thing. my jupyter notebook kernel dies when I start training a model using 2 GPUs right after the first epoch is finished
Adrian Rosebrock
That is strange. Perhaps try executing via the command line and see if you have the same issue?
Jay Abrams
hi Adrian, thanks for the awesome post.
I have a problem where, after training, when i am actually using my model to do classification (not sure what this step is called, “Predicting” i guess??) for a real task, I am getting a “Out of memory” error when I try to run the script multiple times (at the same time).
I thought that decreasing batch size would help but my batch size is already 1.
My question is: Is this a problem that can be remedied by using more than 1 GPU at “prediction” time?
Thank you
Adrian Rosebrock
That’s quite odd. You’re able to train with multiple GPUs but you cannot classify an image? I’m not aware of any issue related to classifying images at prediction time.
Jay Abrams
sorry should have been more clear. I’m using a pretrained model and running the classifier many times (each time in the background by doing “python scriptname.py &”. The script is classifying in each frame from videos continually as its pulling from a db. This works ok for the first two scripts but when I run the third it throws the memory error. Wondering if I need to add more gpus to increase memory or if I’m missing something fundamental about gpu and ram etc. The model i am using is YOLO by Joseph Redmon to do object detection. Thank you for your time
Adrian Rosebrock
Is there a particular reason you are running the script multiple times? Keep in mind that each time you execute “python scriptname.py” your are loading YOLO and storing it in memory. You eventually run out of memory and that is why you receive those error messages. You should refactor your code such that only ONE copy of YOLO needs to be kept in memory and images/frames are sequentially passed through it.
Jay Abrams
I’m thinking of a hypothetical situation in which I have to run this on hundreds of thousands of videos per day and since only around 1 can be processed per second (~86k per day) i’m trying to figure out a way to process more than 86k per day without adding more GPUs. though at this point I don’t think it’s possible to overcome this bottleneck without doing so. Is it not possible to store ONE YOLO in memory and use it over several processes in parallel or something like that? I don’t know much about hardware but I’d think that something like this would require more cores or more GPUs. Anyway, thank you for your time and your very valuable blog. you are a hero to the CV community.
Adrian Rosebrock
You can store one YOLO model in memory but keep in mind that images pass through the GPUs in batches. You are bounded by the amount of memory on the GPU. You need to balance both model size and batch size. If you want to run YOLO on hundreds of thousands of GPUs per day you simply need more GPUs to achieve higher throughput.
Diego
What about loading the weights to a non-multi-gpu model and running it locally?
Jerry
Hm the comparison here is with fixed batch size per device. So it’s not really a fair comparison is it? As far as I understand this is a data parallelism with synchronous update, so ideally you want to keep the same batch size for both experiments (i.e. 256). If you use batch size 256 and compare single GPU vs 4 GPU, do you still see the speedup?
Rias
Hi,Adrian
It seems it’s still a problem for us to save best model when using multi_gpu_model(), I am wondering if you have any good solution for it ?
Adrian Rosebrock
I thought that was fixed in the latest version of Keras, you might want to double-check the release notes (perhaps I’m wrong). But in either case you can still serialize the model, you just need to find the right attribute. If you run
print(dir(model))
you’ll find all the attributes of the model. There should be an internal model representation (ex.model.model_
) that can be saved directly to disk.Khelina Fedorchuk
Hi Adrian,
Thanks a lot for the great tutorial! However, one thing is unclear to me: how should I make Slurm resources request on a supercomputer if I want to train my neural net with more than 2 GPUs (there are 2 GPUs only on each node)? There are things like the number of tasks and the number of nodes on the supercomputer, but I still have quite a vague idea as to how to use them if I want to do data parallelism? thanks in advance
Adrian Rosebrock
Sorry Khelina, I do not have any experience working with the Slurm manager. I hope another reader can help with your question.
Peter Harrison
Adrian, I’ve been experimenting with multi_gpu_model and have hit a wall for the past 2 weeks. My code works correctly with a single GPU, but when I add GPUs and switch to multi_gpu_model the range of predictions is noticeably reduced and cluster around the low of the actual values.
I’ve noticed that most of the examples on line don’t use the Sequential model like I have and also don’t apply multi_gpu_model to time series data.
At first I wasn’t sure whether my input data was flawed, so I decided to create a test set of sinuosoidal data and a simplified python script which really illustrates the issue.
My GitHub issue explains the situation fairly clearly. It includes working code.
https://github.com/keras-team/keras/issues/11941
Could you take a quick look? I’d be very grateful.
Adrian Rosebrock
Thanks for sharing the link to the Keras Issue thread, Peter. Unfortunately I’m not sure what the issue is. It seems like it may be a problem with the Keras library. I would suggest readers keep an eye on the thread to see how the library is updated.
Srikar
Hi Adrian,
Thanks for your great posts.
Is it possible to use multi_gpu_model() with train_on_batch() where each batch has different number of training data.
My use case required me to use only train_on_batch(). My hunch is that the entire training batch runs on a single gpu and multi_gpu_model() is of no help here.
Thanks for reading.
Ujjawal Singh
Hi Adrian,
Thanks for posting such a cool post again. Since this is post all about training a model on multiple GPUs, May I know how to train a deep learning model on a cluster of machines in which each machine may or may not has GPU enabled?
Please write a similar post considering my view Sir.
Thanks again.
Adrian Rosebrock
That’s a great idea for a tutorial! I don’t have a tutorial on that at the moment but I’ll consider it for the future.