In this tutorial, you’ll learn how to use OpenCV’s “dnn” module with an NVIDIA GPU for up to 1,549% faster object detection (YOLO and SSD) and instance segmentation (Mask R-CNN).
Last week, we discovered how to configure and install OpenCV and its “deep neural network” (dnn
) module for inference using an NVIDIA GPU.
Using OpenCV’s GPU-optimized dnn
module we were able to push a given network’s computation from the CPU to the GPU in only three lines of code:
# load the model from disk and set the backend target to a # CUDA-enabled GPU net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"]) net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA) net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
Today we’re going to discuss complete code examples in more detail — and by the end of the tutorial, you’ll be able to apply:
- Single Shot Detectors (SSDs) at 65.90 FPS
- YOLO object detection at 11.87 FPS
- Mask R-CNN instance segmentation at 11.05 FPS
To learn how to use OpenCV’s dnn
module and an NVIDIA GPU for faster object detection and instance segmentation, just keep reading!
Looking for the source code to this post?
Jump Right To The Downloads SectionOpenCV ‘dnn’ with NVIDIA GPUs: 1,549% faster YOLO, SSD, and Mask R-CNN
Inside this tutorial you’ll learn how to implement Single Shot Detectors, YOLO, and Mask R-CNN using OpenCV’s “deep neural network” (dnn
) module and an NVIDIA/CUDA-enabled GPU.
Compile OpenCV’s ‘dnn’ module with NVIDIA GPU support
If you haven’t yet, make sure you carefully read last week’s tutorial on configuring and installing OpenCV with NVIDIA GPU support for the “dnn” module — following that tutorial is an absolute prerequisite for this tutorial.
If you do not install OpenCV with NVIDIA GPU support enabled, OpenCV will still use your CPU for inference; however, if you try to pass the computation to the GPU, OpenCV will error out.
Project Structure
Before we review the structure of today’s project, grab the code and model files from the “Downloads” section of this blog post.
From there, unzip the files and use the tree
command in your terminal to inspect the project hierarchy:
$ tree --dirsfirst . ├── example_videos │ ├── dog_park.mp4 │ ├── guitar.mp4 │ └── janie.mp4 ├── opencv-ssd-cuda │ ├── MobileNetSSD_deploy.caffemodel │ ├── MobileNetSSD_deploy.prototxt │ └── ssd_object_detection.py ├── opencv-yolo-cuda │ ├── yolo-coco │ │ ├── coco.names │ │ ├── yolov3.cfg │ │ └── yolov3.weights │ └── yolo_object_detection.py ├── opencv-mask-rcnn-cuda │ ├── mask-rcnn-coco │ │ ├── colors.txt │ │ ├── frozen_inference_graph.pb │ │ ├── mask_rcnn_inception_v2_coco_2018_01_28.pbtxt │ │ └── object_detection_classes_coco.txt │ └── mask_rcnn_segmentation.py └── output_videos 7 directories, 15 files
In today’s tutorial, we will review three Python scripts:
ssd_object_detection.py
: Performs Caffe-based MobileNet SSD object detection on 20 COCO classes with CUDA.yolo_object_detection.py
: Performs YOLO V3 object detection on 80 COCO classes with CUDA.mask_rcnn_segmentation.py
: Performs TensorFlow-based Inception V2 segmentation on 90 COCO classes with CUDA.
Each of the model files and class name files are included in their respective folders with the exception of our MobileNet SSD (the class names are hardcoded in a Python list directly in the script). Let’s review the folder names in the order in which we’ll work with them today:
opencv-ssd-cuda/
opencv-yolo-cuda/
opencv-mask-rcnn-cuda/
As is evident by all three directory names, we will use OpenCV’s DNN module compiled with CUDA support. If your OpenCV is not compiled with CUDA support for your NVIDIA GPU, then you need to configure your system using the instructions in last week’s tutorial.
Implementing Single Shot Detectors (SSDs) using OpenCV’s NVIDIA GPU-Enabled ‘dnn’ module
The first object detector we’ll be looking at are Single Shot Detectors (SSDs), which we originally covered back in 2017:
- Object detection with deep learning and OpenCV
- Real-time object detection with deep learning and OpenCV
Back then we could only run those SSDs on a CPU; however, today I’ll be showing you how to use your NVIDIA GPU to improve inference speed by up to 211%.
Open up the ssd_object_detection.py
file in your project directory structure, and insert the following code:
# import the necessary packages from imutils.video import FPS import numpy as np import argparse import imutils import cv2 # construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-p", "--prototxt", required=True, help="path to Caffe 'deploy' prototxt file") ap.add_argument("-m", "--model", required=True, help="path to Caffe pre-trained model") ap.add_argument("-i", "--input", type=str, default="", help="path to (optional) input video file") ap.add_argument("-o", "--output", type=str, default="", help="path to (optional) output video file") ap.add_argument("-d", "--display", type=int, default=1, help="whether or not output frame should be displayed") ap.add_argument("-c", "--confidence", type=float, default=0.2, help="minimum probability to filter weak detections") ap.add_argument("-u", "--use-gpu", type=bool, default=False, help="boolean indicating if CUDA GPU should be used") args = vars(ap.parse_args())
Here we’ve imported our packages. Notice that we do not require any special imports for CUDA. The CUDA capability is built in (via our compilation last week) to our cv2
import on Line 6.
Next let’s parse our command line arguments:
--prototxt
: Our pretrained Caffe MobileNet SSD “deploy” prototxt file path.--model
: The path to our pretrained Caffe MobileNet SSD model.--input
: The optional path to our input video file. If it is not supplied, your first camera will be used by default.--output
: The optional path to our output video file.--display
: The optional boolean flag indicating whether we will diplay output frames to an OpenCV GUI window. Displaying frames costs CPU cycles, so for a true benchmark, you may wish to turn display off (by default it is on).--confidence
: The minimum probability threshold to filter weak detections. By default the value is set to 20%; however, you may override it if you wish.--use-gpu
: A boolean indicating whether the CUDA GPU should be used. By default this value isFalse
(i.e., off). If you desire for your NVIDIA CUDA-capable GPU to be used for object detection with OpenCV, you need to pass a1
value to this argument.
Next we’ll specify our classes and associated random colors:
# initialize the list of class labels MobileNet SSD was trained to # detect, then generate a set of bounding box colors for each class CLASSES = ["background", "aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"] COLORS = np.random.uniform(0, 255, size=(len(CLASSES), 3))
And then we’ll load our Caffe-based model:
# load our serialized model from disk net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"]) # check if we are going to use GPU if args["use_gpu"]: # set CUDA as the preferable backend and target print("[INFO] setting preferable backend and target to CUDA...") net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA) net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
As Line 35 indicates, we use OpenCV’s dnn
module to load our Caffe object detection model.
A check is made to see if NVIDIA CUDA-enabled GPU should be used. From there, we set the backend and target accordingly (Lines 38-42).
Let’s go ahead and start processing frames and performing object detection with our GPU (provided the --use-gpu
command line argument is turned on, of course):
# initialize the video stream and pointer to output video file, then # start the FPS timer print("[INFO] accessing video stream...") vs = cv2.VideoCapture(args["input"] if args["input"] else 0) writer = None fps = FPS().start() # loop over the frames from the video stream while True: # read the next frame from the file (grabbed, frame) = vs.read() # if the frame was not grabbed, then we have reached the end # of the stream if not grabbed: break # resize the frame, grab the frame dimensions, and convert it to # a blob frame = imutils.resize(frame, width=400) (h, w) = frame.shape[:2] blob = cv2.dnn.blobFromImage(frame, 0.007843, (300, 300), 127.5) # pass the blob through the network and obtain the detections and # predictions net.setInput(blob) detections = net.forward() # loop over the detections for i in np.arange(0, detections.shape[2]): # extract the confidence (i.e., probability) associated with # the prediction confidence = detections[0, 0, i, 2] # filter out weak detections by ensuring the `confidence` is # greater than the minimum confidence if confidence > args["confidence"]: # extract the index of the class label from the # `detections`, then compute the (x, y)-coordinates of # the bounding box for the object idx = int(detections[0, 0, i, 1]) box = detections[0, 0, i, 3:7] * np.array([w, h, w, h]) (startX, startY, endX, endY) = box.astype("int") # draw the prediction on the frame label = "{}: {:.2f}%".format(CLASSES[idx], confidence * 100) cv2.rectangle(frame, (startX, startY), (endX, endY), COLORS[idx], 2) y = startY - 15 if startY - 15 > 15 else startY + 15 cv2.putText(frame, label, (startX, y), cv2.FONT_HERSHEY_SIMPLEX, 0.5, COLORS[idx], 2)
Here we access our video stream. Note that the code is meant to be compatible with both video files and live video streams, which is why I elected not to use my threaded VideoStream class.
Looping over frames, we:
- Read and preprocess incoming frames.
- Construct a blob from the frame.
- Detect objects using the Single Shot Detector and our GPU (if the
--use-gpu
flag was set). - Filter objects allowing only high
--confidence
objects to pass. - Annotate bounding boxes, class labels, and probabilities. If you need a refresher on OpenCV drawing basics, be sure to refer to my OpenCV Tutorial: A Guide to Learn OpenCV.
Finally, we’ll wrap up:
# check to see if the output frame should be displayed to our # screen if args["display"] > 0: # show the output frame cv2.imshow("Frame", frame) key = cv2.waitKey(1) & 0xFF # if the `q` key was pressed, break from the loop if key == ord("q"): break # if an output video file path has been supplied and the video # writer has not been initialized, do so now if args["output"] != "" and writer is None: # initialize our video writer fourcc = cv2.VideoWriter_fourcc(*"MJPG") writer = cv2.VideoWriter(args["output"], fourcc, 30, (frame.shape[1], frame.shape[0]), True) # if the video writer is not None, write the frame to the output # video file if writer is not None: writer.write(frame) # update the FPS counter fps.update() # stop the timer and display FPS information fps.stop() print("[INFO] elasped time: {:.2f}".format(fps.elapsed())) print("[INFO] approx. FPS: {:.2f}".format(fps.fps()))
In the remaining lines, we:
- Display the annotated video frames if required.
- Capture key presses if we are displaying.
- Write annotated output frames to a video file on disk.
- Update, calculate, and print out FPS statistics.
Great job developing your SSD + OpenCV + CUDA script. In the next sections, we’ll analyze results using both our GPU and CPU.
Single Shot Detectors: 211% faster object detection with OpenCV’s ‘dnn’ module and an NVIDIA GPU
To see our Single Shot Detector in action, make sure you use the “Downloads” section of this tutorial to download (1) the source code and (2) pretrained models compatible with OpenCV’s dnn
module.
From there, execute the following command to obtain a baseline for our SSD by running it on our CPU:
$ python ssd_object_detection.py \ --prototxt MobileNetSSD_deploy.prototxt \ --model MobileNetSSD_deploy.caffemodel \ --input ../example_videos/guitar.mp4 \ --output ../output_videos/ssd_guitar.avi \ --display 0 [INFO] accessing video stream... [INFO] elasped time: 11.69 [INFO] approx. FPS: 21.13
Here we are obtaining ~21 FPS on our CPU, which is quite good for an object detector!
To see the detector really fly, let’s supply the --use-gpu 1
command line argument, instructing OpenCV to push the dnn
computation to our NVIDIA Tesla V100 GPU:
$ python ssd_object_detection.py \ --prototxt MobileNetSSD_deploy.prototxt \ --model MobileNetSSD_deploy.caffemodel \ --input ../example_videos/guitar.mp4 \ --output ../output_videos/ssd_guitar.avi \ --display 0 \ --use-gpu 1 [INFO] setting preferable backend and target to CUDA... [INFO] accessing video stream... [INFO] elasped time: 3.75 [INFO] approx. FPS: 65.90
Using our NVIDIA GPU, we’re now reaching ~66 FPS which improves our frames-per-second throughput rate by over 211%! And as the video demonstration shows, our SSD is quite accurate.
Note: As discussed by this comment by Yashas, the MobileNet SSD could perform poorly because cuDNN does not have optimized kernels for depthwise convolutions on all NVIDA GPUs. If you see your GPU results similar to your CPU results, this is likely the problem.
Implementing YOLO object detection for OpenCV’s NVIDIA GPU/CUDA-enabled ‘dnn’ module
While YOLO is certainly one of the fastest deep learning-based object detectors, the YOLO model included with OpenCV is anything but — on a CPU, YOLO struggled to break 3 FPS.
Therefore, if you intend on using YOLO with OpenCV’s dnn
module, you better be using a GPU.
Let’s take a look at how to use the YOLO object detector (yolo_object_detection.py
) with OpenCV’s CUDA-enabled dnn
module:
# import the necessary packages from imutils.video import FPS import numpy as np import argparse import cv2 import os # construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-y", "--yolo", required=True, help="base path to YOLO directory") ap.add_argument("-i", "--input", type=str, default="", help="path to (optional) input video file") ap.add_argument("-o", "--output", type=str, default="", help="path to (optional) output video file") ap.add_argument("-d", "--display", type=int, default=1, help="whether or not output frame should be displayed") ap.add_argument("-c", "--confidence", type=float, default=0.5, help="minimum probability to filter weak detections") ap.add_argument("-t", "--threshold", type=float, default=0.3, help="threshold when applyong non-maxima suppression") ap.add_argument("-u", "--use-gpu", type=bool, default=0, help="boolean indicating if CUDA GPU should be used") args = vars(ap.parse_args())
Our imports are nearly the same as our previous script with one swap. In this script we don’t need imutils
, but we do need Python’s os
module for file I/O. Again, the CUDA capability is baked into our custom-compiled OpenCV installation.
Let’s review our command line arguments:
--yolo
: The base path to your pretrained YOLO model directory.--input
: The optional path to our input video file. If it is not supplied, your first camera will be used by default.--output
: The optional path to our output video file.--display
: The optional boolean flag indicating whether we will use output frames to an OpenCV GUI window. Displaying frames costs CPU cycles, so for a true benchmark, you may wish to turn display off (by default it is on).--confidence
: The minimum probability threshold to filter weak detections. By default the value is set to 50%; however you may override it if you wish.--threshold
: The Non-Maxima Suppression (NMS) threshold is set to 30% by default.--use-gpu
: A boolean indicating whether the CUDA GPU should be used. By default this value isFalse
(i.e., off). If you desire for your NVIDIA CUDA-capable GPU to be used for object detection with OpenCV, you need to pass a1
value to this argument.
Next we’ll load our class labels and assign random colors:
# load the COCO class labels our YOLO model was trained on labelsPath = os.path.sep.join([args["yolo"], "coco.names"]) LABELS = open(labelsPath).read().strip().split("\n") # initialize a list of colors to represent each possible class label np.random.seed(42) COLORS = np.random.randint(0, 255, size=(len(LABELS), 3), dtype="uint8")
-
We load class labels from the
coco.names
file and assign randomCOLORS
.Now we’re ready to load our YOLO model from disk including setting the GPU backend/target if required:
-
# derive the paths to the YOLO weights and model configuration weightsPath = os.path.sep.join([args["yolo"], "yolov3.weights"]) configPath = os.path.sep.join([args["yolo"], "yolov3.cfg"]) # load our YOLO object detector trained on COCO dataset (80 classes) print("[INFO] loading YOLO from disk...") net = cv2.dnn.readNetFromDarknet(configPath, weightsPath) # check if we are going to use GPU if args["use_gpu"]: # set CUDA as the preferable backend and target print("[INFO] setting preferable backend and target to CUDA...") net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA) net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
-
Lines 36 and 37 grab our pretrained YOLO detector model and weights paths.
From there, Lines 41-48 load the model and set the GPU as the backend if the
--use-gpu
command line flag is set.Moving on, we’ll begin performing object detection with YOLO:
-
# determine only the *output* layer names that we need from YOLO ln = net.getLayerNames() ln = [ln[i[0] - 1] for i in net.getUnconnectedOutLayers()] # initialize the width and height of the frames in the video file W = None H = None # initialize the video stream and pointer to output video file, then # start the FPS timer print("[INFO] accessing video stream...") vs = cv2.VideoCapture(args["input"] if args["input"] else 0) writer = None fps = FPS().start() # loop over frames from the video file stream while True: # read the next frame from the file (grabbed, frame) = vs.read() # if the frame was not grabbed, then we have reached the end # of the stream if not grabbed: break # if the frame dimensions are empty, grab them if W is None or H is None: (H, W) = frame.shape[:2] # construct a blob from the input frame and then perform a forward # pass of the YOLO object detector, giving us our bounding boxes # and associated probabilities blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (416, 416), swapRB=True, crop=False) net.setInput(blob) layerOutputs = net.forward(ln)
-
- Grab a frame.
- Construct a blob from the frame.
- Compute predictions (i.e., perform YOLO inference on the blob).
-
Lines 51 and 52 grab only the output layer names from the YOLO model. We need these in order to perform inference with YOLO using OpenCV.
We then grab frame dimensions and initialize our video stream + FPS counter.
From there, we’ll loop over frames and begin YOLO object detection. Inside the loop, we:
Continuing on, we’ll process the results:
# initialize our lists of detected bounding boxes, confidences, # and class IDs, respectively boxes = [] confidences = [] classIDs = [] # loop over each of the layer outputs for output in layerOutputs: # loop over each of the detections for detection in output: # extract the class ID and confidence (i.e., probability) # of the current object detection scores = detection[5:] classID = np.argmax(scores) confidence = scores[classID] # filter out weak predictions by ensuring the detected # probability is greater than the minimum probability if confidence > args["confidence"]: # scale the bounding box coordinates back relative to # the size of the image, keeping in mind that YOLO # actually returns the center (x, y)-coordinates of # the bounding box followed by the boxes' width and # height box = detection[0:4] * np.array([W, H, W, H]) (centerX, centerY, width, height) = box.astype("int") # use the center (x, y)-coordinates to derive the top # and and left corner of the bounding box x = int(centerX - (width / 2)) y = int(centerY - (height / 2)) # update our list of bounding box coordinates, # confidences, and class IDs boxes.append([x, y, int(width), int(height)]) confidences.append(float(confidence)) classIDs.append(classID) # apply non-maxima suppression to suppress weak, overlapping # bounding boxes idxs = cv2.dnn.NMSBoxes(boxes, confidences, args["confidence"], args["threshold"]) # ensure at least one detection exists if len(idxs) > 0: # loop over the indexes we are keeping for i in idxs.flatten(): # extract the bounding box coordinates (x, y) = (boxes[i][0], boxes[i][1]) (w, h) = (boxes[i][2], boxes[i][3]) # draw a bounding box rectangle and label on the frame color = [int(c) for c in COLORS[classIDs[i]]] cv2.rectangle(frame, (x, y), (x + w, y + h), color, 2) text = "{}: {:.4f}".format(LABELS[classIDs[i]], confidences[i]) cv2.putText(frame, text, (x, y - 5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
-
- Initialize results lists.
- Loop over detections and accumulate outputs while filtering low confidence detections.
- Apply Non-Maxima Suppression (NMS).
- Annotate the output frame with the object’s bounding box, class label, and confidence value.
-
Still in our loop, now we will:
We’ll wrap up our frame processing loop and perform cleanup next:
# check to see if the output frame should be displayed to our # screen if args["display"] > 0: # show the output frame cv2.imshow("Frame", frame) key = cv2.waitKey(1) & 0xFF # if the `q` key was pressed, break from the loop if key == ord("q"): break # if an output video file path has been supplied and the video # writer has not been initialized, do so now if args["output"] != "" and writer is None: # initialize our video writer fourcc = cv2.VideoWriter_fourcc(*"MJPG") writer = cv2.VideoWriter(args["output"], fourcc, 30, (frame.shape[1], frame.shape[0]), True) # if the video writer is not None, write the frame to the output # video file if writer is not None: writer.write(frame) # update the FPS counter fps.update() # stop the timer and display FPS information fps.stop() print("[INFO] elasped time: {:.2f}".format(fps.elapsed())) print("[INFO] approx. FPS: {:.2f}".format(fps.fps()))
-
The remaining lines handle display, keypresses, printing FPS statistics, and cleanup.
While our YOLO + OpenCV + CUDA script was more challenging to implement than the SSD script, you did a great job hanging in there. In the next section, we will analyze results.
YOLO: 380% faster object detection with OpenCV’s NVIDIA GPU-enabled ‘dnn’ module
We are now ready to test our YOLO object detector.
Make sure you have used the “Downloads” section of this tutorial to download the source code and pretrained models compatible with OpenCV’s
dnn
module.From there, execute the following command to obtain a baseline for YOLO on our CPU:
-
$ python yolo_object_detection.py --yolo yolo-coco \ --input ../example_videos/janie.mp4 \ --output ../output_videos/yolo_janie.avi \ --display 0 [INFO] loading YOLO from disk... [INFO] accessing video stream... [INFO] elasped time: 51.11 [INFO] approx. FPS: 2.47
-
On our CPU, YOLO is obtaining a quite pitiful 2.47 FPS.
But by pushing the computation to our NVIDIA V100 GPU, we now reach 11.87 FPS, a 380% improvement:
-
$ python yolo_object_detection.py --yolo yolo-coco \ --input ../example_videos/janie.mp4 \ --output ../output_videos/yolo_janie.avi \ --display 0 \ --use-gpu 1 [INFO] loading YOLO from disk... [INFO] setting preferable backend and target to CUDA... [INFO] accessing video stream... [INFO] elasped time: 10.61 [INFO] approx. FPS: 11.87
-
As I discuss in my original YOLO + OpenCV blog post, I’m not really sure why YOLO obtains such a low frames-per-second throughput rate. YOLO is consistently cited as one of the fastest object detectors.
That said, it appears there is something amiss either with the converted model or how OpenCV is handling inference — unfortunately I don’t know what the exact problem is, but I welcome feedback in the comments section.
Implementing Mask R-CNN Instance Segmentation for OpenCV’s CUDA-Enabled ‘dnn’ module
At this point we’ve looked at SSDs and YOLO, two different types of deep learning-based object detectors — but what about instance segmentation networks such as Mask R-CNN? Can we utilize our NVIDIA GPUs with OpenCV’s CUDA-enabled
dnn
module to improve our frames-per-second processing rate for Mask R-CNNs?You bet we can!
Open up
mask_rcnn_segmentation.py
in your directory structure to find out how:# import the necessary packages from imutils.video import FPS import numpy as np import argparse import cv2 import os # construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-m", "--mask-rcnn", required=True, help="base path to mask-rcnn directory") ap.add_argument("-i", "--input", type=str, default="", help="path to (optional) input video file") ap.add_argument("-o", "--output", type=str, default="", help="path to (optional) output video file") ap.add_argument("-d", "--display", type=int, default=1, help="whether or not output frame should be displayed") ap.add_argument("-c", "--confidence", type=float, default=0.5, help="minimum probability to filter weak detections") ap.add_argument("-t", "--threshold", type=float, default=0.3, help="minimum threshold for pixel-wise mask segmentation") ap.add_argument("-u", "--use-gpu", type=bool, default=0, help="boolean indicating if CUDA GPU should be used") args = vars(ap.parse_args())
First we handle our imports. They are identical to our previous YOLO script.
From there we’ll parse command line arguments:
--mask-rcnn
: The base path to your pretrained Mask R-CNN model directory.--input
: The optional path to our input video file. If it is not supplied, your first camera will be used by default.--output
: The optional path to our output video file.--display
: The optional boolean flag indicating whether we will display output frames to an OpenCV GUI window. Displaying frames costs CPU cycles, so for a true benchmark, you may wish to turn display off (by default it is on).--confidence
: The minimum probability threshold to filter weak detections. By default the value is set to 50%; however you may override it if you wish.--threshold
: Minimum threshold for pixel-wise segmentation. By default this value is set to 30%.--use-gpu
: A boolean indicating whether the CUDA GPU should be used. By default this value isFalse
(i.e.; off). If you desire for your NVIDIA CUDA-capable GPU to be used for instance segmentation with OpenCV, you need to pass a1
value to this argument.
With our imports and command line arguments in hand, now we’ll load our class labels and assign random colors:
# load the COCO class labels our Mask R-CNN was trained on labelsPath = os.path.sep.join([args["mask_rcnn"], "object_detection_classes_coco.txt"]) LABELS = open(labelsPath).read().strip().split("\n") # initialize a list of colors to represent each possible class label np.random.seed(42) COLORS = np.random.randint(0, 255, size=(len(LABELS), 3), dtype="uint8")
From there we’ll load our model.
# derive the paths to the Mask R-CNN weights and model configuration weightsPath = os.path.sep.join([args["mask_rcnn"], "frozen_inference_graph.pb"]) configPath = os.path.sep.join([args["mask_rcnn"], "mask_rcnn_inception_v2_coco_2018_01_28.pbtxt"]) # load our Mask R-CNN trained on the COCO dataset (90 classes) # from disk print("[INFO] loading Mask R-CNN from disk...") net = cv2.dnn.readNetFromTensorflow(weightsPath, configPath) # check if we are going to use GPU if args["use_gpu"]: # set CUDA as the preferable backend and target print("[INFO] setting preferable backend and target to CUDA...") net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA) net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
Here we grab the paths to our pretrained Mask R-CNN weights and model.
We then load the model from disk and set the target backend to the GPU if the
--use-gpu
command line flag is set. When using only your CPU, segmentation will be slow as molasses. If you set the--use-gpu
flag, you’ll process your input video or camera stream at warp-speed.Let’s begin processing frames:
# initialize the video stream and pointer to output video file, then # start the FPS timer print("[INFO] accessing video stream...") vs = cv2.VideoCapture(args["input"] if args["input"] else 0) writer = None fps = FPS().start() # loop over frames from the video file stream while True: # read the next frame from the file (grabbed, frame) = vs.read() # if the frame was not grabbed, then we have reached the end # of the stream if not grabbed: break # construct a blob from the input frame and then perform a # forward pass of the Mask R-CNN, giving us (1) the bounding box # coordinates of the objects in the image along with (2) the # pixel-wise segmentation for each specific object blob = cv2.dnn.blobFromImage(frame, swapRB=True, crop=False) net.setInput(blob) (boxes, masks) = net.forward(["detection_out_final", "detection_masks"])
After grabbing a frame, we convert it to a blob and perform a forward pass through our network to predict object
boxes
andmasks
.And now we’re ready to process our results:
# loop over the number of detected objects for i in range(0, boxes.shape[2]): # extract the class ID of the detection along with the # confidence (i.e., probability) associated with the # prediction classID = int(boxes[0, 0, i, 1]) confidence = boxes[0, 0, i, 2] # filter out weak predictions by ensuring the detected # probability is greater than the minimum probability if confidence > args["confidence"]: # scale the bounding box coordinates back relative to the # size of the frame and then compute the width and the # height of the bounding box (H, W) = frame.shape[:2] box = boxes[0, 0, i, 3:7] * np.array([W, H, W, H]) (startX, startY, endX, endY) = box.astype("int") boxW = endX - startX boxH = endY - startY # extract the pixel-wise segmentation for the object, # resize the mask such that it's the same dimensions of # the bounding box, and then finally threshold to create # a *binary* mask mask = masks[i, classID] mask = cv2.resize(mask, (boxW, boxH), interpolation=cv2.INTER_CUBIC) mask = (mask > args["threshold"]) # extract the ROI of the image but *only* extracted the # masked region of the ROI roi = frame[startY:endY, startX:endX][mask] # grab the color used to visualize this particular class, # then create a transparent overlay by blending the color # with the ROI color = COLORS[classID] blended = ((0.4 * color) + (0.6 * roi)).astype("uint8") # store the blended ROI in the original frame frame[startY:endY, startX:endX][mask] = blended # draw the bounding box of the instance on the frame color = [int(c) for c in color] cv2.rectangle(frame, (startX, startY), (endX, endY), color, 2) # draw the predicted label and associated probability of # the instance segmentation on the frame text = "{}: {:.4f}".format(LABELS[classID], confidence) cv2.putText(frame, text, (startX, startY - 5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
Looping over the results, we:
- Filter them based on
confidence
. - Resize and draw/annotate object transparent colored masks.
- Annotate bounding boxes, labels, and probabilities on the output frame.
From there we’ll go ahead and wrap up our loop, calculate FPS stats, and clean up:
# check to see if the output frame should be displayed to our # screen if args["display"] > 0: # show the output frame cv2.imshow("Frame", frame) key = cv2.waitKey(1) & 0xFF # if the `q` key was pressed, break from the loop if key == ord("q"): break # if an output video file path has been supplied and the video # writer has not been initialized, do so now if args["output"] != "" and writer is None: # initialize our video writer fourcc = cv2.VideoWriter_fourcc(*"MJPG") writer = cv2.VideoWriter(args["output"], fourcc, 30, (frame.shape[1], frame.shape[0]), True) # if the video writer is not None, write the frame to the output # video file if writer is not None: writer.write(frame) # update the FPS counter fps.update() # stop the timer and display FPS information fps.stop() print("[INFO] elasped time: {:.2f}".format(fps.elapsed())) print("[INFO] approx. FPS: {:.2f}".format(fps.fps()))
Great job developing your Mask R-CNN + OpenCV + CUDA script! In the next section, we’ll compare CPU versus GPU results.
For more details on the implementation, refer to this blog post on Mask R-CNN with OpenCV.
Mask R-CNN: 1,549% faster Instance Segmentation with OpenCV’s ‘dnn’ NVIDIA GPU module
Our final test will be to compare Mask R-CNN performance using both a CPU and an NVIDIA GPU.
Make sure you have used the “Downloads” section of this tutorial to download the source code and pretrained OpenCV model files.
You can then open up a command line and benchmark the Mask R-CNN model on the CPU:
$ python mask_rcnn_segmentation.py \ --mask-rcnn mask-rcnn-coco \ --input ../example_videos/dog_park.mp4 \ --output ../output_videos/mask_rcnn_dog_park.avi \ --display 0 [INFO] loading Mask R-CNN from disk... [INFO] accessing video stream... [INFO] elasped time: 830.65 [INFO] approx. FPS: 0.67
The Mask R-CNN architecture is incredibly computationally expensive, so seeing a result of 0.67 FPS on a CPU is to be expected.
But what about a GPU?
Will a GPU be able to push our Mask R-CNN to near real-time performance?
To answer that question, just supply the
--use-gpu 1
command line argument to themask_rcnn_segmentation.py
script:$ python mask_rcnn_segmentation.py \ --mask-rcnn mask-rcnn-coco \ --input ../example_videos/dog_park.mp4 \ --output ../output_videos/mask_rcnn_dog_park.avi \ --display 0 \ --use-gpu 1 [INFO] loading Mask R-CNN from disk... [INFO] setting preferable backend and target to CUDA... [INFO] accessing video stream... [INFO] elasped time: 50.21 [INFO] approx. FPS: 11.05
On my NVIDIA Telsa V100, our Mask R-CNN model is now reaching 11.05 FPS, a massive 1,549% improvement!
Making nearly any model compatible with OpenCV’s ‘dnn’ module run on an NVIDIA GPU
If you’ve been paying attention to each of the source code examples in today’s post, you’ll note that each of them follows a particular pattern to push the computation to an NVIDIA CUDA-enabled GPU:
- Load the trained model from disk.
- Set OpenCV backend to be CUDA.
- Push the computation to the CUDA-enabled device.
These three points neatly translate into only three lines of code:
net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"]) net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA) net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)
In general, you can follow the same recipe when working with OpenCV’s
dnn
module — if you have a model that is compatible with OpenCV anddnn
, then it likely can be used for GPU inference simply by setting CUDA as the backend and target.All you really need to do is swap out the
cv2.dnn.readNetFromCaffe
function with whatever method you’re using to load the network from disk, including:cv2.dnn.readNet
cv2.dnn.readNetFromDarknet
cv2.dnn.readNetFromModelOptimizer
cv2.dnn.readNetFromONNX
cv2.dnn.readNetFromTensorflow
cv2.dnn.readNetFromTorch
cv2.dnn.readTensorFromONNX
You’ll need to refer to the exact framework your model was trained with to confirm whether or not it will be compatible with OpenCV’s
dnn
library — I hope to cover such a tutorial in the future as well.What's next? We recommend PyImageSearch University.
Course information:
86 total classes • 115+ hours of on-demand code walkthrough videos • Last updated: October 2024
★★★★★ 4.84 (128 Ratings) • 16,000+ Students EnrolledI strongly believe that if you had the right teacher you could master computer vision and deep learning.
Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?
That’s not the case.
All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.
If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.
Inside PyImageSearch University you'll find:
- ✓ 86 courses on essential computer vision, deep learning, and OpenCV topics
- ✓ 86 Certificates of Completion
- ✓ 115+ hours of on-demand video
- ✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
- ✓ Pre-configured Jupyter Notebooks in Google Colab
- ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
- ✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
- ✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
- ✓ Access on mobile, laptop, desktop, etc.
Summary
In this tutorial you learned how to apply OpenCV’s “deep neural network” (
dnn
) module for GPU-optimized inference.Up until the release of OpenCV 4.2, OpenCV’s
dnn
module had extremely limited compute capability — most readers were left to running inference on their CPU, which is certainly less than ideal.However, thanks to Davis King of dlib, Yashas Samaga (who implemented OpenCV’s “dnn” NVIDIA GPU support) and the Google Summer of Code 2019 initiative, OpenCV can now enjoy NVIDIA GPU and CUDA support, making it easier than ever to apply state-of-the-art networks to your own projects.
To download the source code to this post, including the pre-trained SSD, YOLO, and Mask R-CNN models, just enter your email address in the form below!
Download the Source Code and FREE 17-page Resource Guide
Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!
Morgan S.
Great guide as always–PyImageSearch has been an invaluable resource every step of the way on my computer vision journey.
I do have a question, however: I have been working on a Macbook Air, and generally have just had to accept slow processing speeds for computer vision projects. Normally this is fine, as my pet project involves only a few thousand images. However, there are certain processes (MTCNN, VggFace2, …) that are just entirely off-limits to me, even with multiprocessing.
Is there any way to use the built-in GPU of a Macbook (in this case an Intel UHD Graphics 617, if that makes a difference) to increase processing speed for computer vision projects above and beyond what the CPU can provide?
Thank you!
Adrian Rosebrock
Technically, yes, you could, but the real question is whether or not it’s worth the effort.
The answer to that question would be a simple “No”.
Your built-in graphics accelerator on your MacBook will not sufficiently speedup your models.
If you want to really speed them up I would suggest you use cloud instances or configure a local deep learning rig.
Amir
I think opencv’s dldt, or intel openvino could help to run models faster on intel gpus.
Adrian Rosebrock
Correct, but that’s different than what this blog post covers. If you want to go the OpenVINO route this tutorial will get you started.
John Doe
1549% faster compared to what? Tensorflow GPU or OpenCV CPU?
Adrian Rosebrock
Compared to running OpenCV’s “dnn” module on your CPU.
Kamal Chhirang
Thanks for the information. Do you know, how faster/slower it is compared to Tensorflow Inference? Comparing Apples to Apples.
Adrian Rosebrock
Sorry, I don’t have those numbers on hand.
amar
Hi Adrian , I usually read all blogs of you. Published FPS on yolo i.e. 11 approx. compare to SSD which is 65 on V100. could you check again yolov3 FPS? I have doubt on the result because when I was checking same dnn module on c++ I was getting 33fps on 1070 8GB GPU.
Adrian Rosebrock
I was using YOLO v3 with OpenCV’s “dnn” module on a NVIDIA V100 when gathering the results for this post. I would suggest using my Python code here and comparing it your C++ code. YOLO historically seems to be very slow with OpenCV’s “dnn” module — based on your result it may be because of the Python bindings (perhaps there’s a bug somewhere?)
Yashas
I too was totally confused and surprised while looking at the FPS numbers reported in this post. My own benchmarks on GTX 1050 appeared to outperform a V100.
19FPS and 9FPS for YOLOv3 and Mask RCNN on GTX 1050 respectively. V100 should considerably outperform GTX 1050.
I get 150FPS (90FPS) in FP32 (FP16) target in RTX 2080 Ti for YOLOv3. I haven’t seen the benchmark code but I suspect something is not right.
https://github.com/opencv/opencv/pull/14827#issuecomment-568156546 is the benchmark I published for OpenCV 4.2.0. Also, note that the CUDA backend in the master branch is much faster.
There is also a trick for YOLOv3 (which is described in the above link) which can make YOLOv3 inference 10%-25% faster depending on your device.
Adrian Rosebrock
Thanks Yashas. I was definitely testing with the official v4.2 release, not the master branch. The YOLO code is in the “Implementing YOLO object detection for OpenCV’s NVIDIA GPU/CUDA-enabled ‘dnn’ module” section of this tutorial if you wanted to take a look. I’m pretty stumped as to what the issue is.
Yashas
OpenCV DNN uses lazy initialization. The first forward pass is very slow. On my device, the first forward pass for YOLOv3 takes around a second (allocating memory, compiling kernels, etc) but subsequent forward passes take around 10ms. If you include the first forward pass and calculate the FPS for 100 frames, you’ll get the average to be 20ms per frame instead of 10ms per frame! This affects almost all backends in OpenCV DNN. The first forward pass must be ignored while measuring FPS.
Adrian Rosebrock
Great pointer, thank you for that Yashas 🙂
Yashas
I just ran your code on my PC. It turns out that the OpenCV DNN on GPU is so fast that the non-DNN part of the code takes up 85% of the time. Hence, the benchmark code isn’t really measuring the DNN performance; rather, it is measuring the pre/postprocessing and the IO part.
Here is some estimate of how much impact the non-DNN part is having on your device:
This post reports an FPS of 12 => ~83ms per frame.
On RTX 2080 Ti, it takes 10ms for the inference. It should be even faster on V100. Let’s be conservative and take it to be 10ms.
Approximately 73ms of the time is spent in IO and non-DNN stuff.
For a fair comparison of DNN performance (the non-DNN time is significant for CPU inference too), I’d recommend measuring the time taken to execute `net.forward(outLayerNames)`.
Adrian Rosebrock
Thanks for taking the time to look into this Yashas, I really appreciate it.
My bet is that YOLO requires NMS to be executed after the forward pass, and combined with Python’s slower “for” loops, we run into a bottleneck. We’ll run Python’s debugger (“pdb”) and see what else we can find.
Thanks again!
Christoph
“YOLO object detection at 11.87 FPS” not even 12 FPS, is that a typo or just a poor old version of YOLO ?
Adrian Rosebrock
No, that was with the latest version of YOLO (v3). For whatever reason YOLO is just very slow using OpenCV.
Vadim Voynov
Hi! I’ve got the following results on my rented server with NVIDIA GeForce GTX 1080 Ti
Single Shot Detectors
no GPU:
elapsed time: 12.22
approx. FPS: 20.21
use GPU:
elapsed time: 19.74
approx. FPS: 12.52
YOLO object detection
elapsed time: 88.30
approx. FPS: 1.43
use GPU:
elapsed time: 14.96
approx. FPS: 8.42
Mask R-CNN Instance Segmentation
elapsed time: 1996.47
approx. FPS: 0.28
use GPU:
elapsed time: 43.47
approx. FPS: 12.77
Why is Single Shot Detectors performance worse using this GPU card?
Adrian Rosebrock
The SSD object detector was designed to run in near real-time on a CPU. It’s likely that your GPU spends all its time waiting for the CPU to bring it new frames and the I/O overhead of loading/offloading the frame and object detection results is slower than simply running the model on your CPU.
You can use the “nvidia-smi” command to confirm if your GPU is sitting idle (which I likely believe it is).
Yashas
MobileNet SSD performs poorly (or below expectation) because cuDNN does not have optimized kernels for depthwise convolutions on all devices.
Adrian Rosebrock
Thanks for the clarification there!
Vincent bellet
Awesome tutorial, as always.
How do that compare to TensorRT and do you think it would run a jetson nano ?
Adrian Rosebrock
Take a look at the comments thread from my OpenCV “dnn” NVIDIA GPU install post. Most readers are unfortunately reporting very little performance gains on the Jetson Nano.
Narayan
Is it possible to make detected object as audio (speech) output…?
If yes then how can it be made?
Adrian Rosebrock
Yes, take a look at text-to-speech libraries. Google’s “gTTS” is a good one. For an example of using gTTS for text-to-speech, refer to Raspberry Pi for Computer Vision.
Winston
Hi Adrian,
Thanks for sharing this great article.
Just wondering, is it possible to have this on Raspberry Pi 4 as well?
I know it doesn’t have a CUDA GPU, but does the OpenCV.dnn work on ARM CPU as well?
Cheers,
Winston
Adrian Rosebrock
No, unfortunately it will not. OpenCL isn’t compatible with the RPi4 which would be a requirement in order to use the RPi’s graphics unit. I would suggest you instead use a Movidius NCS or Google Coral USB Accelerator to improve inference speed.
Andrew Baker
Here is a data point using the Titan XP. GPU performance was much better, however not for the Single Shot Detectors. The same held true for last week’s tutorial. The CPU vastly outperformed the GPU.
Versions:
Ubuntu 18.04.4 LTS
Python 3.6.9
CUDA v10.0.130
cuDNN v7.6.4
GPU Titan XP compute capability 6.1
NVIDIA Driver 440.48.02
CPU Intel i7-7800X 3.5GHz x 12
openCV v4.2.0
Output:
Single Shot Detectors
no GPU:
elapsed time: 7.40
approx. FPS: 33.37
use GPU:
elapsed time: 15.27
approx. FPS: 16.17
down 51.5%
YOLO object detection
elapsed time: 30.40
approx. FPS: 4.14
use GPU:
elapsed time: 7.58
approx. FPS: 16.62
up 301%
Mask R-CNN Instance Segmentation
elapsed time: 540.97
approx. FPS: 1.03
use GPU:
elapsed time: 36.47
approx. FPS: 15.22
up 1378%
Adrian Rosebrock
Thanks for sharing, Andrew. Refer to Yashas’ replies on this thread and they discuss why the SSD may perform poorly on some GPUs.
Parthasarathy
Hi Adrian Rosebrock, I really enjoy reading your blog post and I am a big fan of yours. I just wanted to know if opencv’s dnn module provides functionalities for extracting facial landmarks just like dlib does?
Adrian Rosebrock
It does not. I would recommend you use dlib’s facial landmark functions if you need to detect and extract facial landmarks.
Ankit
Hello Adrian, Awesome tutorial, but i got the below warning and hence i am unable to use GPU for this code
Version:
Cuda: 10
CuDnn: 7.4
python = 3.6.10
open-cv =4.2
opencv-python\opencv\modules\dnn\src\dnn.cpp (1363) cv::dnn::dnn4_v20191202::Net::Impl::setUpNet DNN module was not built with CUDA backend; switching to CPU
Adrian Rosebrock
What is your GPU model?
anami
i have same issue, my GPU model is GTX1650 and i set CUDA_ARCH_BIN = 7.5 then i rebuilt again but never solve the problem
Xue Wen
Hi Adrian, wonderful tutorial. OpenCV DNN is powerful, according to your tutorial, I used the module successfully. Thanks very much!
YOLO GPU:
[INFO] loading YOLO from disk…
[INFO] setting preferable backend and target to CUDA…
[INFO] accessing video stream…
[INFO] elasped time: 9.86
[INFO] approx. FPS: 12.78
YOLO CPU
[INFO] loading YOLO from disk…
[INFO] accessing video stream…
[INFO] elasped time: 45.93
[INFO] approx. FPS: 2.74
Mask R-CNN GPU:
[INFO] loading Mask R-CNN from disk…
[INFO] setting preferable backend and target to CUDA…
[INFO] accessing video stream…
[INFO] elasped time: 54.78
[INFO] approx. FPS: 10.13
Mask R-CNN CPU:
[INFO] loading Mask R-CNN from disk…
[INFO] accessing video stream…
[INFO] elasped time: 826.92
[INFO] approx. FPS: 0.67
Adrian Rosebrock
Thanks Xue Wen.
What model GPU were you using for your experiments?
Xue Wen
NVIDIA Geforce GTX1070
Adrian Rosebrock
Thanks for sharing 🙂
Saurabh
Hello Adrian,
Thanks for sharing an interesting blog!
Could you please share your views on “How to label overlapping objects?” What is the best practice with reference to overlapping objects? The problem is most of the labeling tools don’t support oriented bounding boxes.
How can I inform my object detector that it should look at only certain part of images without cropping images? Can I edit images and put white/black (constant) color so that object detector will ignore such areas?
Kindly share your views.
Thanking you!
Adrian Rosebrock
Your question is outside the context of this tutorial. To keep topics on track, I typically request readers make a best effort to keep comments related to what is actually covered in the tutorial. Otherwise, I would suggest you look at Deep Learning for Computer Vision with Python which covers image annotation tools.
Saurabh
Thanks for the pointer!
Adrian Rosebrock
You are welcome 🙂
Jacop
Awesome tutorial, I wonder is it possible to use mask R-CNN to detect only specific objects (I’m not interested in all the 90 categories)? Also, I wonder how can I try Mask RCNN with other pretrained models than inception_v2?
Thanks!
Adrian Rosebrock
You can learn how to filter your object detection or instance segmentation results using this tutorial.
Patrick
Hello Adrian,
Here are my results on a Dell AlienWare I7 + GTX960
[INFO] SSD
CPU approx. FPS: 25.92
GPU approx. FPS: 10.80
[INFO] loading YOLO from disk…
CPU approx. FPS: 2.80
GPU approx. FPS: 9.08
[INFO] loading Mask R-CNN from disk…
CPU FPS: 0.67
GPU approx. FPS: 5.52
As I wote last week, I can compile OpenCV for a CC of 5.2 using Master as they have patched for CUDA_FP16 ( half precision kernels.) , look here: https://github.com/opencv/opencv/pull/16218
This table is showing features that impact performance indexes: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications
Adrian Rosebrock
Thank you for sharing, Patrick!
Mahmud Hasan
Hi Adrian
Hope you are doing great. Just one question. Why arent these awesome tutorials part of DL4CV? I will be finishing DL4CV in a few months, still in the second book. What do you suggest doing after finishing Dl4CV? Any books or courses you suggest so that I get a good experience with well explained chapters like in DL4CV, if I wish to move onto things like object detection, YOLO etc.
Adrian Rosebrock
Thanks for the comment, Mahmud. I’m also so happy to hear you are enjoying DL4CV.
To address your question:
These tutorials are brand new. They couldn’t possibly be in DL4CV as OpenCV 4.2 was only released back in December. It takes time to properly vet the tutorials and make sure they run out-of-the-box.
That said, I will be incorporating variations of these guides as new chapters into DL4CV, including chapters that go into more detail.
If you enjoyed DL4CV I would recommend you next go through the PyImageSearch Gurus course. That course covers the computer vision field as a whole (i.e., not just the intersection of computer vision and deep learning).
koksang
Looking good. Also, instead of loading it on disk, is that possible to load our model from a remote disk. Let’s say from google cloud storage?
Adrian Rosebrock
Presumably yes, but you would need to check out the Google Cloud Storage APIs on how to download the model.
ibrahim
hi adrian,
could you tell us the hardware specs you ran this tutorial on?
thanks
Adrian Rosebrock
I used an NVIDIA V100 as my GPU.
questioner
Adrian, did you do any experiments to see if 4.2 offers any means to do face recognition that is comparable in terms of accuracy to dlib’s cnn mode but possible faster due to opencv’s new CUDA dan support?
Adrian Rosebrock
No, I haven’t run any experiments to compare face recognition models. Accuracy won’t change, only the speed of the inference model.
Dornelas
One of the best codes that I’ve seen. Just because this if faster than convert the Yolo Model to Keras/TF and with this you don’t have the loss generate by the model convertion.
Congratulations Adrian and Team!
Adrian Rosebrock
Thanks so much, Dornelas!
Aref
Hi Adrian,
Thanks for the tutorial. I tried this on my desktop and jetson nano both.
Jetson nano in 10W mode
SSD object detection: CPU: 3.33 FPS, CUDA Enabled: 1.76 FPS
YOLO object detection: CPU: 0.17 FPS, CUDA Enabled: 0.95 FPS
Mask rcnn segmentation:: CPU: 0.07 FPS, CUDA Enabled: 0.54 FPS
Geforce GTX 1080, intel core i7-8700, 16 GB Ram
SSD object detection: CPU: 34.66 FPS, CUDA Enabled: 18 FPS
YOLO object detection: CPU: 4.66 FPS, CUDA Enabled: 16.49 FPS
Mask rcnn segmentation:: CPU: 1.15 FPS, CUDA Enabled: 12.43 FPS
On both tests, the CUDA enabled version of SSD is slower than CPU. On jetson nano 5.5x speed up in YOLO and for mask rcnn 7.7x speed up.
On my desktop 3.5x for YOLO and 10.8x speedup for mask rcnn.
Adrian Rosebrock
Thanks for sharing, Aref!
Allaye
Hi Adrian, I am actually trying to use FCOS (Fully Convolutional One-Stage Object Detection) for my senior year thesis in college, but it seems too slow to be actually useful, I am trying to follow your tutorial and see if it might be able to make it work, thanks for this great piece, any advice will be appreciated.
Mehrdad
Hi everyone,
What’s semantic segmentation can we run on dnn gpu? I’ve already performed ENet but I nead better and faster than ENet like FC‑HarDNet‑70
farejoe
May i know the list of development boards onto which this can be achieved…i mean can this be achieved on the boards without NVIDIA GPU which are still having different GPU unit on board.
Gleisson Bezerra
Adrian,
I would like to share with you a Docker Image that I created with OpenCV 4.2.0 compiled with CUDNN 7.6.3 enabled to be used on NVidia Jetson Nano Arm64v8 architecture. GStreamer and Python 3.6 also installed:
https://hub.docker.com/r/gleissonbezerra/jetson-nano-l4t-cuda-cudnn-opencv-4.2.0
More details about dockerfile: https://github.com/gleissonbezerra
Adrian Rosebrock
Thank you for sharing!
Akshar Awari
Hey Adrian , Can you suggest me the GPUs that can support a conventional increase in the SSD performance compared to CPU, as Yashas and you have discussed above can any thing be done
to tackle the problem?
Kindly share your views!
Thank you for sharing, Akshar,.
Adrian Rosebrock
Sorry, I don’t know that information off the top of my head. I would suggest contacting Yashas directly.
Rahul
Which is the best way to extract facial land mark in jetson nano with better fps?
Adrian Rosebrock
Facial landmarks are actually very fast to compute. It’s likely your face detector that is slowing down your pipeline. What face detector are you using?
rahul
i got fps around 23fps on my trained model.but i need to again customize my model to get around 27 fps in real time is it possible ? face detetctor + face land mark detetctor got 20fps in jetson nano.which is best pretrained model to get more fps in jetson nano for face detetector +land mark ?
Adrian Rosebrock
What have you tried thus far to improve the speed? Additionally, have you read Raspberry Pi for Computer Vision? That book covers my optimization suggestions, including ones you can apply to the Jetson Nano.
Akram Abu Owaimer
Hi ..
i have a custom model based on YOLOv3 trained on keras ..
i want to run it in my application using dnn module in openCV ..
can you please help me .. ^_^
Adrian Rosebrock
That may or may not be possible depending on the custom layers used in your YOLO implementation. OpenCV only supports “standard” layers and I’m willing to bet that your implementation of YOLO uses custom layers which don’t have a corresponding implementation in OpenCV.
Dave0
Hello everyone,
I have just noticed that OpenCV 4.3 released yesterday with many improvements to the DNN module. I don’t currently have access to a machine with a GPU but I am curious if these improvements will impact the code from this article. I am especially interested in if they improve the YOLOv3 FPS.
It would be a great help to a project I am currently working on, (when out of lockdown) if someone could report if there is any improvement. Also, has anyone tried with targeting FP16 backend?
Thanks in advance!