In this tutorial, you will learn to improve text detection speed with OpenCV and GPUs.
This tutorial is the last in a 4-part series on OCR with Python:
- Multi-Column Table OCR
- OpenCV Fast Fourier Transform (FFT) for Blur Detection in Images and Video Streams
- OCR’ing Video Streams
- Improving Text Detection Speed with OpenCV and GPUs (this tutorial)
To learn how to improve text detection speed with OpenCV and GPUs, just keep reading.
Looking for the source code to this post?
Jump Right To The Downloads SectionImproving Text Detection Speed with OpenCV and GPUs
Up to this point, everything except the EasyOCR material has focused on performing OCR on our CPU. But what if we could instead apply OCR on our GPU? Since many state-of-the-art text detection and OCR models are deep learning-based, couldn’t these models run faster and more efficiently on a GPU?
The answer is yes; they absolutely can.
This tutorial will show you how to take the efficient and accurate scene text detector (EAST) model and run it on OpenCV’s dnn
(deep neural network) module using an NVIDIA GPU. As we’ll see, our text detection throughput rate nearly triples, improving from ~23
frames per second (FPS) to an astounding ~97
FPS!
In this tutorial, you will:
- Learn how to use OpenCV’s
dnn
module to run deep neural networks on an NVIDIA CUDA-based GPU - Implement a Python script to benchmark text detection speed on both a CPU and GPU
- Implement a second Python script, this one that performs text detection in real-time video streams
- Compare the results of running text detection on a CPU versus a GPU
Using Your GPU for OCR with OpenCV
The first part of this tutorial covers reviewing our directory structure for the project.
We’ll then implement a Python script to benchmark running text detection on a CPU versus a GPU. We’ll run this script and measure just how much of a difference running text detection on a GPU improves our FPS throughput rate.
Once we’ve measured our FPS increase, we’ll implement a second Python script, this one, to perform text detection in real-time video streams.
We’ll wrap up the tutorial with a discussion of our results.
Configuring Your Development Environment
To follow this guide, you need to have the OpenCV library installed on your system.
Luckily, OpenCV is pip-installable:
$ pip install opencv-contrib-python
If you need help configuring your development environment for OpenCV, we highly recommend that you read our pip install OpenCV guide — it will have you up and running in a matter of minutes.
Having Problems Configuring Your Development Environment?
All that said, are you:
- Short on time?
- Learning on your employer’s administratively locked system?
- Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
- Ready to run the code right now on your Windows, macOS, or Linux system?
Then join PyImageSearch University today!
Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides that are pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.
And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!
Project Structure
Before we can apply text detection with our GPU, we first need to review our project directory structure.
Start by accessing the “Downloads” section of this tutorial to retrieve the source code and example images.
From there, take a look at the directory structure:
|-- pyimagesearch | |-- __init__.py | |-- east | | |-- __init__.py | | |-- east.py |-- ../models | |-- east | | |-- frozen_east_text_detection.pb -- images | |-- car_wash.png |-- text_detection_speed.py |-- text_detection_video.py
We’ll be reviewing two Python scripts in this tutorial:
text_detection_speed.py
: Benchmarks text detection speed on a CPU versus a GPU using thecar_wash.png
image in ourimages
directory.text_detection_video.py
: Demonstrates how to perform real-time text detection on your GPU.
Implementing Our OCR GPU Benchmark Script
Before implementing text detection in real-time video streams with our GPU, let’s first benchmark how much of a speedup we get by running the EAST detection model on our CPU versus our GPU.
To find out, open the text_detection_speed.py
file in our project directory, and let’s get started:
# import the necessary packages from pyimagesearch.east import EAST_OUTPUT_LAYERS import numpy as np import argparse import time import cv2
Lines 2-6 handle importing our required Python packages. We need the EAST model’s output layers (Line 2) to grab the text detection outputs. If you need a refresher on these output values, be sure to refer to the OCR with OpenCV, Tesseract, and Python: Intro to OCR book.
Next, we have our command line arguments:
# construct the argument parser and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-i", "--image", required=True, help="path to input image") ap.add_argument("-e", "--east", required=True, help="path to input EAST text detector") ap.add_argument("-w", "--width", type=int, default=320, help="resized image width (should be multiple of 32)") ap.add_argument("-t", "--height", type=int, default=320, help="resized image height (should be multiple of 32)") ap.add_argument("-c", "--min-conf", type=float, default=0.5, help="minimum probability required to inspect a text region") ap.add_argument("-n", "--nms-thresh", type=float, default=0.4, help="non-maximum suppression threshold") ap.add_argument("-g", "--use-gpu", type=bool, default=False, help="boolean indicating if CUDA GPU should be used") args = vars(ap.parse_args())
The --image
command line argument specifies the path to the input image where we’ll perform text detection.
Lines 12-21 then specify command line arguments for the EAST text detection model.
Finally, we have our --use-gpu
command line argument. By default, we’ll use our CPU. But by specifying this argument (and provided we have a CUDA-capable GPU and OpenCV’s dnn
module compiled with NVIDIA GPU support), we can use our GPU for text detection inference.
With our command line arguments taken care of, we can now load the EAST text detection model and set whether we are using the CPU or GPU:
# load the pre-trained EAST text detector print("[INFO] loading EAST text detector...") net = cv2.dnn.readNet(args["east"]) # check if we are going to use GPU if args["use_gpu"]: # set CUDA as the preferable backend and target print("[INFO] setting preferable backend and target to CUDA...") net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA) net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA) # otherwise we are using our CPU else: print("[INFO] using CPU for inference...")
Line 28 loads our EAST text detection model from disk.
Lines 31-35 make a check to see if the --use-gpu
command line argument was supplied, and if so, indicates that we want to use our NVIDIA CUDA-capable GPU.
Note: To use your GPU for neural network inference, you need to have OpenCV’s dnn
module compiled with NVIDIA CUDA support. OpenCV’s dnn
module does not have NVIDIA support via a pip install. Instead, you need to compile OpenCV with GPU support explicitly. We cover how to do that in the tutorial on PyImageSearch.
Next, let’s load our sample image from disk:
# load the input image and then set the new width and height values # based on our command line arguments image = cv2.imread(args["image"]) (newW, newH) = (args["width"], args["height"]) # construct a blob from the image, set the blob as input to the # network, and initialize a list that records the amount of time # each forward pass takes print("[INFO] running timing trials...") blob = cv2.dnn.blobFromImage(image, 1.0, (newW, newH), (123.68, 116.78, 103.94), swapRB=True, crop=False) net.setInput(blob) timings = []
Line 43 loads our input --image
from disk while Lines 50 and 51 construct a blob
object such that we can pass it through the EAST text detection model.
Line 52 sets our blob
as input to the EAST network, while Line 53 initializes a timings
list to measure how long the inference takes.
When using a GPU for inference, your first prediction tends to be very slow compared to the rest of the predictions, the reason being that your GPU hasn’t “warmed up” yet. Therefore, when taking measurements on your GPU, you typically want to take an average over several predictions.
In the following code block, we perform text detection for 500
trials, recording how long each prediction takes:
# loop over 500 trials to obtain a good approximation to how long # each forward pass will take for i in range(0, 500): # time the forward pass start = time.time() (scores, geometry) = net.forward(EAST_OUTPUT_LAYERS) end = time.time() timings.append(end - start) # show average timing information on text prediction avg = np.mean(timings) print("[INFO] avg. text detection took {:.6f} seconds".format(avg))
After all trials are complete, we compute the average of the timings
and then display our average text detection time on our terminal.
Speed Test: OCR With and Without GPU
Let’s now measure our EAST text detection FPS throughput rate without a GPU (i.e., running on a CPU):
$ python text_detection_speed.py --image images/car_wash.png --east ../models/east/frozen_east_text_detection.pb [INFO] loading EAST text detector... [INFO] using CPU for inference... [INFO] running timing trials... [INFO] avg. text detection took 0.108568 seconds
Our average text detection speed is ~0.1
seconds, equating to ~9-10
FPS. A deep learning model running on a CPU is fast and sufficient for many applications.
However, like Tim Taylor (played by Tim Allen of Toy Story) from the 1990s TV show, Home Improvement, says, “More power!”
Let’s now break out the GPUs:
$ python text_detection_speed.py --image images/car_wash.png --east ../models/east/frozen_east_text_detection.pb --use-gpu 1 [INFO] loading EAST text detector... [INFO] setting preferable backend and target to CUDA... [INFO] running timing trials... [INFO] avg. text detection took 0.004763 seconds
Using an NVIDIA V100 GPU, our average frame processing rate decreases to ~0.004
seconds, meaning that we can now process ~250
FPS! As you can see, using your GPU makes a substantial difference!
OCR on GPU for Real-Time Video Streams
Ready to implement our script to perform text detection in real-time video streams using your GPU?
Open the text_detection_video.py
file in your project directory, and let’s get started:
# import the necessary packages from pyimagesearch.east import EAST_OUTPUT_LAYERS from pyimagesearch.east import decode_predictions from imutils.video import VideoStream from imutils.video import FPS import numpy as np import argparse import imutils import time import cv2
Lines 2-10 import our required Python packages. The EAST_OUTPUT_LAYERS
and decode_predictions
function come from our implementation of the EAST text detector in our tutorial, OpenCV Text Detection. Be sure to review that lesson if you need a refresher on the EAST detection model.
Line 4 imports our VideoStream
to access our webcam, while Line 5 provides our FPS
class to measure the FPS throughput rate of our pipeline.
Let’s now proceed to our command line arguments:
# construct the argument parser and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-i", "--input", type=str, help="path to optional input video file") ap.add_argument("-e", "--east", required=True, help="path to input EAST text detector") ap.add_argument("-w", "--width", type=int, default=320, help="resized image width (should be multiple of 32)") ap.add_argument("-t", "--height", type=int, default=320, help="resized image height (should be multiple of 32)") ap.add_argument("-c", "--min-conf", type=float, default=0.5, help="minimum probability required to inspect a text region") ap.add_argument("-n", "--nms-thresh", type=float, default=0.4, help="non-maximum suppression threshold") ap.add_argument("-g", "--use-gpu", type=bool, default=False, help="boolean indicating if CUDA GPU should be used") args = vars(ap.parse_args())
These command line arguments are nearly the same as previous command line arguments. The only exception is that we swapped out the --image
command line argument for an --input
argument, which specifies the path to an optional video file on disk (just in case we wanted to use a video file rather than our webcam).
Next, we have a few initializations:
# initialize the original frame dimensions, new frame dimensions, # and ratio between the dimensions (W, H) = (None, None) (newW, newH) = (args["width"], args["height"]) (rW, rH) = (None, None)
Here we initialize our original frame’s width and height, the new frame dimensions for the EAST model, followed by the ratio between the original and the new dimensions.
This next code block handles loading the EAST text detection model from disk and then setting whether or not we are using our CPU or GPU for inference:
# load the pre-trained EAST text detector print("[INFO] loading EAST text detector...") net = cv2.dnn.readNet(args["east"]) # check if we are going to use GPU if args["use_gpu"]: # set CUDA as the preferable backend and target print("[INFO] setting preferable backend and target to CUDA...") net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA) net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA) # otherwise we are using our CPU else: print("[INFO] using CPU for inference...")
Our text detection model needs frames to operate on, so the next code block accesses either our webcam or a video file residing on disk, depending on whether or not the --input
command line argument was supplied:
# if a video path was not supplied, grab the reference to the webcam if not args.get("input", False): print("[INFO] starting video stream...") vs = VideoStream(src=0).start() time.sleep(1.0) # otherwise, grab a reference to the video file else: vs = cv2.VideoCapture(args["input"]) # start the FPS throughput estimator fps = FPS().start()
Line 62 starts measuring our FPS throughput rates to get a good idea of the number of frames our text detection pipeline can process in a single second.
Let’s start looping over frames from the video stream now:
# loop over frames from the video stream while True: # grab the current frame, then handle if we are using a # VideoStream or VideoCapture object frame = vs.read() frame = frame[1] if args.get("input", False) else frame # check to see if we have reached the end of the stream if frame is None: break # resize the frame, maintaining the aspect ratio frame = imutils.resize(frame, width=1000) orig = frame.copy() # if our frame dimensions are None, we still need to compute the # ratio of old frame dimensions to new frame dimensions if W is None or H is None: (H, W) = frame.shape[:2] rW = W / float(newW) rH = H / float(newH)
Lines 68 and 69 read the next frame
from either our webcam or video file.
If we are indeed processing a video file, Lines 72 makes a check to see if we are at the end of the video and if so, we break
from the loop.
Lines 81-84 grab the spatial dimensions of the input frame
and then compute the ratio of the original frame dimensions to the dimensions required by the EAST model.
Now that we have these dimensions, we can construct our input to the EAST text detector:
# construct a blob from the image and then perform a forward pass # of the model to obtain the two output layer sets blob = cv2.dnn.blobFromImage(frame, 1.0, (newW, newH), (123.68, 116.78, 103.94), swapRB=True, crop=False) net.setInput(blob) (scores, geometry) = net.forward(EAST_OUTPUT_LAYERS) # decode the predictions from OpenCV's EAST text detector and # then apply non-maximum suppression (NMS) to the rotated # bounding boxes (rects, confidences) = decode_predictions(scores, geometry, minConf=args["min_conf"]) idxs = cv2.dnn.NMSBoxesRotated(rects, confidences, args["min_conf"], args["nms_thresh"])
Lines 88-91 build blob
from the input frame
. We then set this blob
as input to our EAST text detection net
. A forward pass of the network is performed, resulting in our raw text detections.
However, our raw text detections are unusable in our current state, so we call decode_predictions
on them, yielding a 2-tuple of the bounding box coordinates of the text detections along with the associated probabilities (Lines 96 and 97).
We then apply non-maxima suppression to suppress weak, overlapping bounding boxes (otherwise, there would be multiple bounding boxes for each detection).
If you need more details on this code block, including how the decode_predictions
function is implemented, be sure to review OpenCV Text Detection, where I cover the EAST text detector in far more detail.
After non-maximum suppression (NMS), we can now loop over each of the bounding boxes:
# ensure that at least one text bounding box was found if len(idxs) > 0: # loop over the valid bounding box indexes after applying NMS for i in idxs.flatten(): # compute the four corners of the bounding box, scale the # coordinates based on the respective ratios, and then # convert the box to an integer NumPy array box = cv2.boxPoints(rects[i]) box[:, 0] *= rW box[:, 1] *= rH box = np.int0(box) # draw a rotated bounding box around the text cv2.polylines(orig, [box], True, (0, 255, 0), 2) # update the FPS counter fps.update() # show the output frame cv2.imshow("Text Detection", orig) key = cv2.waitKey(1) & 0xFF # if the 'q' key was pressed, break from the loop if key == ord("q"): break
Line 102 verifies that at least one text bounding box was found, and if so, we loop over the indexes of the kept bounding boxes after applying NMS.
For each resulting index, we compute the bounding box of the text ROI, scale the bounding box (x, y)-coordinates back to the orig
input frame dimensions, and then draw the bounding box on the orig
frame (Lines 108-114).
Line 117 updates our FPS throughput estimator while Lines 120-125 display the output text detection on our screen.
The final step here is to stop our FPS time, approximate the throughput rate, and release any video file pointers:
# stop the timer and display FPS information fps.stop() print("[INFO] elapsed time: {:.2f}".format(fps.elapsed())) print("[INFO] approx. FPS: {:.2f}".format(fps.fps())) # if we are using a webcam, release the pointer if not args.get("input", False): vs.stop() # otherwise, release the file pointer else: vs.release() # close all windows cv2.destroyAllWindows()
Lines 128-130 stop our FPS timer and approximate the FPS of our text detection pipeline. We then release any video file pointers and close any windows opened by OpenCV.
GPU and OCR Results
This section needs to be executed locally on a machine with a GPU. After running the text_detection_video.py
script on an NVIDIA RTX 2070 SUPER GPU (coupled with an i9 9900K processor), I obtained ~97
FPS:
$ python text_detection_video.py --east ../models/east/frozen_east_text_detection.pb --use-gpu 1 [INFO] loading EAST text detector... [INFO] setting preferable backend and target to CUDA... [INFO] starting video stream... [INFO] elapsed time: 74.71 [INFO] approx. FPS: 96.80
When I ran the same script without using any GPU, I reached an FPS of ~23
, which is ~77%
slower than the above results.
$ python text_detection_video.py --east ../models/east/frozen_east_text_detection.pb [INFO] loading EAST text detector... [INFO] using CPU for inference... [INFO] starting video stream... [INFO] elapsed time: 68.59 [INFO] approx. FPS: 22.70
As you can see, using your GPU can dramatically improve the throughput speed of your text detection pipeline!
What's next? We recommend PyImageSearch University.
86 total classes • 115+ hours of on-demand code walkthrough videos • Last updated: October 2024
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled
I strongly believe that if you had the right teacher you could master computer vision and deep learning.
Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?
That’s not the case.
All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.
If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.
Inside PyImageSearch University you'll find:
- ✓ 86 courses on essential computer vision, deep learning, and OpenCV topics
- ✓ 86 Certificates of Completion
- ✓ 115+ hours of on-demand video
- ✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
- ✓ Pre-configured Jupyter Notebooks in Google Colab
- ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
- ✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
- ✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
- ✓ Access on mobile, laptop, desktop, etc.
Summary
In this tutorial, you learned how to perform text detection in real-time video streams using your GPU. Since many text detection and OCR models are deep learning-based, using your GPU (vs. your CPU) can tremendously increase your frame processing throughput rate.
Using our CPU, we were able to process ~22-23
FPS. However, by running the EAST model on OpenCV’s dnn
module, we could reach ~97
FPS!
If you have a GPU available to you, definitely consider utilizing it — you’ll be able to run text detection models in real-time!
Citation Information
Rosebrock, A. “Improving Text Detection Speed with OpenCV and GPUs,” PyImageSearch, D. Chakraborty, P. Chugh, A. R. Gosthipaty, S. Huot, K. Kidriavsteva, R. Raha, and A. Thanki, eds., 2022, https://pyimg.co/9wde6
@incollection{Rosebrock_2022_Improving_Text, author = {Adrian Rosebrock}, title = {Improving Text Detection Speed with {OpenCV} and {GPUs}}, booktitle = {PyImageSearch}, editor = {Devjyoti Chakraborty and Puneet Chugh and Aritra Roy Gosthipaty and Susan Huot and Kseniia Kidriavsteva and Ritwik Raha and Abhishek Thanki}, year = {2022}, note = {https://pyimg.co/9wde6}, }
Unleash the potential of computer vision with Roboflow - Free!
- Step into the realm of the future by signing up or logging into your Roboflow account. Unlock a wealth of innovative dataset libraries and revolutionize your computer vision operations.
- Jumpstart your journey by choosing from our broad array of datasets, or benefit from PyimageSearch’s comprehensive library, crafted to cater to a wide range of requirements.
- Transfer your data to Roboflow in any of the 40+ compatible formats. Leverage cutting-edge model architectures for training, and deploy seamlessly across diverse platforms, including API, NVIDIA, browser, iOS, and beyond. Integrate our platform effortlessly with your applications or your favorite third-party tools.
- Equip yourself with the ability to train a potent computer vision model in a mere afternoon. With a few images, you can import data from any source via API, annotate images using our superior cloud-hosted tool, kickstart model training with a single click, and deploy the model via a hosted API endpoint. Tailor your process by opting for a code-centric approach, leveraging our intuitive, cloud-based UI, or combining both to fit your unique needs.
- Embark on your journey today with absolutely no credit card required. Step into the future with Roboflow.
To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!
Download the Source Code and FREE 17-page Resource Guide
Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!
Comment section
Hey, Adrian Rosebrock here, author and creator of PyImageSearch. While I love hearing from readers, a couple years ago I made the tough decision to no longer offer 1:1 help over blog post comments.
At the time I was receiving 200+ emails per day and another 100+ blog post comments. I simply did not have the time to moderate and respond to them all, and the sheer volume of requests was taking a toll on me.
Instead, my goal is to do the most good for the computer vision, deep learning, and OpenCV community at large by focusing my time on authoring high-quality blog posts, tutorials, and books/courses.
If you need help learning computer vision and deep learning, I suggest you refer to my full catalog of books and courses — they have helped tens of thousands of developers, students, and researchers just like yourself learn Computer Vision, Deep Learning, and OpenCV.
Click here to browse my full catalog.