Improving Text Detection Speed with OpenCV and GPUs

In this tutorial, you will learn to improve text detection speed with OpenCV and GPUs.

This tutorial is the last in a 4-part series on OCR with Python:

Multi-Column Table OCR
OpenCV Fast Fourier Transform (FFT) for Blur Detection in Images and Video Streams
OCR’ing Video Streams
Improving Text Detection Speed with OpenCV and GPUs (this tutorial)

To learn how to improve text detection speed with OpenCV and GPUs, just keep reading.

Looking for the source code to this post?

Improving Text Detection Speed with OpenCV and GPUs

Up to this point, everything except the EasyOCR material has focused on performing OCR on our CPU. But what if we could instead apply OCR on our GPU? Since many state-of-the-art text detection and OCR models are deep learning-based, couldn’t these models run faster and more efficiently on a GPU?

The answer is yes; they absolutely can.

This tutorial will show you how to take the efficient and accurate scene text detector (EAST) model and run it on OpenCV’s dnn (deep neural network) module using an NVIDIA GPU. As we’ll see, our text detection throughput rate nearly triples, improving from ~23 frames per second (FPS) to an astounding ~97 FPS!

In this tutorial, you will:

Learn how to use OpenCV’s dnn module to run deep neural networks on an NVIDIA CUDA-based GPU
Implement a Python script to benchmark text detection speed on both a CPU and GPU
Implement a second Python script, this one that performs text detection in real-time video streams
Compare the results of running text detection on a CPU versus a GPU

Using Your GPU for OCR with OpenCV

The first part of this tutorial covers reviewing our directory structure for the project.

We’ll then implement a Python script to benchmark running text detection on a CPU versus a GPU. We’ll run this script and measure just how much of a difference running text detection on a GPU improves our FPS throughput rate.

Once we’ve measured our FPS increase, we’ll implement a second Python script, this one, to perform text detection in real-time video streams.

We’ll wrap up the tutorial with a discussion of our results.

Configuring Your Development Environment

To follow this guide, you need to have the OpenCV library installed on your system.

Luckily, OpenCV is pip-installable:

$ pip install opencv-contrib-python

If you need help configuring your development environment for OpenCV, we highly recommend that you read our pip install OpenCV guide — it will have you up and running in a matter of minutes.

Having Problems Configuring Your Development Environment?

**Figure 1:** Having trouble configuring your dev environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join PyImageSearch University — you’ll be up and running with this tutorial in a matter of minutes.

All that said, are you:

Short on time?
Learning on your employer’s administratively locked system?
Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
Ready to run the code right now on your Windows, macOS, or Linux system?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides that are pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!

Project Structure

Before we can apply text detection with our GPU, we first need to review our project directory structure.

Start by accessing the “Downloads” section of this tutorial to retrieve the source code and example images.

From there, take a look at the directory structure:

|-- pyimagesearch
|   |-- __init__.py
|   |-- east
|   |   |-- __init__.py
|   |   |-- east.py
|-- ../models
|   |-- east
|   |   |-- frozen_east_text_detection.pb
-- images
|   |-- car_wash.png
|-- text_detection_speed.py
|-- text_detection_video.py

We’ll be reviewing two Python scripts in this tutorial:

text_detection_speed.py: Benchmarks text detection speed on a CPU versus a GPU using the car_wash.png image in our images directory.
text_detection_video.py: Demonstrates how to perform real-time text detection on your GPU.

Implementing Our OCR GPU Benchmark Script

Before implementing text detection in real-time video streams with our GPU, let’s first benchmark how much of a speedup we get by running the EAST detection model on our CPU versus our GPU.

To find out, open the text_detection_speed.py file in our project directory, and let’s get started:

# import the necessary packages
from pyimagesearch.east import EAST_OUTPUT_LAYERS
import numpy as np
import argparse
import time
import cv2

Lines 2-6 handle importing our required Python packages. We need the EAST model’s output layers (Line 2) to grab the text detection outputs. If you need a refresher on these output values, be sure to refer to the OCR with OpenCV, Tesseract, and Python: Intro to OCR book.

Next, we have our command line arguments:

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
	help="path to input image")
ap.add_argument("-e", "--east", required=True,
	help="path to input EAST text detector")
ap.add_argument("-w", "--width", type=int, default=320,
	help="resized image width (should be multiple of 32)")
ap.add_argument("-t", "--height", type=int, default=320,
	help="resized image height (should be multiple of 32)")
ap.add_argument("-c", "--min-conf", type=float, default=0.5,
	help="minimum probability required to inspect a text region")
ap.add_argument("-n", "--nms-thresh", type=float, default=0.4,
	help="non-maximum suppression threshold")
ap.add_argument("-g", "--use-gpu", type=bool, default=False,
	help="boolean indicating if CUDA GPU should be used")
args = vars(ap.parse_args())

The --image command line argument specifies the path to the input image where we’ll perform text detection.

Lines 12-21 then specify command line arguments for the EAST text detection model.

Finally, we have our --use-gpu command line argument. By default, we’ll use our CPU. But by specifying this argument (and provided we have a CUDA-capable GPU and OpenCV’s dnn module compiled with NVIDIA GPU support), we can use our GPU for text detection inference.

With our command line arguments taken care of, we can now load the EAST text detection model and set whether we are using the CPU or GPU:

# load the pre-trained EAST text detector
print("[INFO] loading EAST text detector...")
net = cv2.dnn.readNet(args["east"])

# check if we are going to use GPU
if args["use_gpu"]:
	# set CUDA as the preferable backend and target
	print("[INFO] setting preferable backend and target to CUDA...")
	net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
	net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

# otherwise we are using our CPU
else:
	print("[INFO] using CPU for inference...")

Line 28 loads our EAST text detection model from disk.

Lines 31-35 make a check to see if the --use-gpu command line argument was supplied, and if so, indicates that we want to use our NVIDIA CUDA-capable GPU.

Note: To use your GPU for neural network inference, you need to have OpenCV’s dnn module compiled with NVIDIA CUDA support. OpenCV’s dnn module does not have NVIDIA support via a pip install. Instead, you need to compile OpenCV with GPU support explicitly. We cover how to do that in the tutorial on PyImageSearch.

Next, let’s load our sample image from disk:

# load the input image and then set the new width and height values
# based on our command line arguments
image = cv2.imread(args["image"])
(newW, newH) = (args["width"], args["height"])

# construct a blob from the image, set the blob as input to the
# network, and initialize a list that records the amount of time
# each forward pass takes
print("[INFO] running timing trials...")
blob = cv2.dnn.blobFromImage(image, 1.0, (newW, newH),
	(123.68, 116.78, 103.94), swapRB=True, crop=False)
net.setInput(blob)
timings = []

Line 43 loads our input --image from disk while Lines 50 and 51 construct a blob object such that we can pass it through the EAST text detection model.

Line 52 sets our blob as input to the EAST network, while Line 53 initializes a timings list to measure how long the inference takes.

When using a GPU for inference, your first prediction tends to be very slow compared to the rest of the predictions, the reason being that your GPU hasn’t “warmed up” yet. Therefore, when taking measurements on your GPU, you typically want to take an average over several predictions.

In the following code block, we perform text detection for 500 trials, recording how long each prediction takes:

# loop over 500 trials to obtain a good approximation to how long
# each forward pass will take
for i in range(0, 500):
	# time the forward pass
	start = time.time()
	(scores, geometry) = net.forward(EAST_OUTPUT_LAYERS)
	end = time.time()
	timings.append(end - start)

# show average timing information on text prediction
avg = np.mean(timings)
print("[INFO] avg. text detection took {:.6f} seconds".format(avg))

After all trials are complete, we compute the average of the timings and then display our average text detection time on our terminal.

Speed Test: OCR With and Without GPU

Let’s now measure our EAST text detection FPS throughput rate without a GPU (i.e., running on a CPU):

$ python text_detection_speed.py --image images/car_wash.png --east ../models/east/frozen_east_text_detection.pb
[INFO] loading EAST text detector...
[INFO] using CPU for inference...
[INFO] running timing trials...
[INFO] avg. text detection took 0.108568 seconds

Our average text detection speed is ~0.1 seconds, equating to ~9-10 FPS. A deep learning model running on a CPU is fast and sufficient for many applications.

However, like Tim Taylor (played by Tim Allen of Toy Story) from the 1990s TV show, Home Improvement, says, “More power!”

Let’s now break out the GPUs:

$ python text_detection_speed.py --image images/car_wash.png --east ../models/east/frozen_east_text_detection.pb --use-gpu 1
[INFO] loading EAST text detector...
[INFO] setting preferable backend and target to CUDA...
[INFO] running timing trials...
[INFO] avg. text detection took 0.004763 seconds

Using an NVIDIA V100 GPU, our average frame processing rate decreases to ~0.004 seconds, meaning that we can now process ~250 FPS! As you can see, using your GPU makes a substantial difference!

OCR on GPU for Real-Time Video Streams

Ready to implement our script to perform text detection in real-time video streams using your GPU?

Open the text_detection_video.py file in your project directory, and let’s get started:

# import the necessary packages
from pyimagesearch.east import EAST_OUTPUT_LAYERS
from pyimagesearch.east import decode_predictions
from imutils.video import VideoStream
from imutils.video import FPS
import numpy as np
import argparse
import imutils
import time
import cv2

Lines 2-10 import our required Python packages. The EAST_OUTPUT_LAYERS and decode_predictions function come from our implementation of the EAST text detector in our tutorial, OpenCV Text Detection. Be sure to review that lesson if you need a refresher on the EAST detection model.

Line 4 imports our VideoStream to access our webcam, while Line 5 provides our FPS class to measure the FPS throughput rate of our pipeline.

Let’s now proceed to our command line arguments:

# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--input", type=str,
	help="path to optional input video file")
ap.add_argument("-e", "--east", required=True,
	help="path to input EAST text detector")
ap.add_argument("-w", "--width", type=int, default=320,
	help="resized image width (should be multiple of 32)")
ap.add_argument("-t", "--height", type=int, default=320,
	help="resized image height (should be multiple of 32)")
ap.add_argument("-c", "--min-conf", type=float, default=0.5,
	help="minimum probability required to inspect a text region")
ap.add_argument("-n", "--nms-thresh", type=float, default=0.4,
	help="non-maximum suppression threshold")
ap.add_argument("-g", "--use-gpu", type=bool, default=False,
	help="boolean indicating if CUDA GPU should be used")
args = vars(ap.parse_args())

These command line arguments are nearly the same as previous command line arguments. The only exception is that we swapped out the --image command line argument for an --input argument, which specifies the path to an optional video file on disk (just in case we wanted to use a video file rather than our webcam).

Next, we have a few initializations:

# initialize the original frame dimensions, new frame dimensions,
# and ratio between the dimensions
(W, H) = (None, None)
(newW, newH) = (args["width"], args["height"])
(rW, rH) = (None, None)

Here we initialize our original frame’s width and height, the new frame dimensions for the EAST model, followed by the ratio between the original and the new dimensions.

This next code block handles loading the EAST text detection model from disk and then setting whether or not we are using our CPU or GPU for inference:

# load the pre-trained EAST text detector
print("[INFO] loading EAST text detector...")
net = cv2.dnn.readNet(args["east"])

# check if we are going to use GPU
if args["use_gpu"]:
	# set CUDA as the preferable backend and target
	print("[INFO] setting preferable backend and target to CUDA...")
	net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
	net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

# otherwise we are using our CPU
else:
	print("[INFO] using CPU for inference...")

Our text detection model needs frames to operate on, so the next code block accesses either our webcam or a video file residing on disk, depending on whether or not the --input command line argument was supplied:

# if a video path was not supplied, grab the reference to the webcam
if not args.get("input", False):
	print("[INFO] starting video stream...")
	vs = VideoStream(src=0).start()
	time.sleep(1.0)

# otherwise, grab a reference to the video file
else:
	vs = cv2.VideoCapture(args["input"])

# start the FPS throughput estimator
fps = FPS().start()

Line 62 starts measuring our FPS throughput rates to get a good idea of the number of frames our text detection pipeline can process in a single second.

Let’s start looping over frames from the video stream now:

# loop over frames from the video stream
while True:
	# grab the current frame, then handle if we are using a
	# VideoStream or VideoCapture object
	frame = vs.read()
	frame = frame[1] if args.get("input", False) else frame

	# check to see if we have reached the end of the stream
	if frame is None:
		break

	# resize the frame, maintaining the aspect ratio
	frame = imutils.resize(frame, width=1000)
	orig = frame.copy()

	# if our frame dimensions are None, we still need to compute the
	# ratio of old frame dimensions to new frame dimensions
	if W is None or H is None:
		(H, W) = frame.shape[:2]
		rW = W / float(newW)
		rH = H / float(newH)

Lines 68 and 69 read the next frame from either our webcam or video file.

If we are indeed processing a video file, Lines 72 makes a check to see if we are at the end of the video and if so, we break from the loop.

Lines 81-84 grab the spatial dimensions of the input frame and then compute the ratio of the original frame dimensions to the dimensions required by the EAST model.

Now that we have these dimensions, we can construct our input to the EAST text detector:

	# construct a blob from the image and then perform a forward pass
	# of the model to obtain the two output layer sets
	blob = cv2.dnn.blobFromImage(frame, 1.0, (newW, newH),
		(123.68, 116.78, 103.94), swapRB=True, crop=False)
	net.setInput(blob)
	(scores, geometry) = net.forward(EAST_OUTPUT_LAYERS)

	# decode the predictions from OpenCV's EAST text detector and
	# then apply non-maximum suppression (NMS) to the rotated
	# bounding boxes
	(rects, confidences) = decode_predictions(scores, geometry,
		minConf=args["min_conf"])
	idxs = cv2.dnn.NMSBoxesRotated(rects, confidences,
		args["min_conf"], args["nms_thresh"])

Lines 88-91 build blob from the input frame. We then set this blob as input to our EAST text detection net. A forward pass of the network is performed, resulting in our raw text detections.

However, our raw text detections are unusable in our current state, so we call decode_predictions on them, yielding a 2-tuple of the bounding box coordinates of the text detections along with the associated probabilities (Lines 96 and 97).

We then apply non-maxima suppression to suppress weak, overlapping bounding boxes (otherwise, there would be multiple bounding boxes for each detection).

If you need more details on this code block, including how the decode_predictions function is implemented, be sure to review OpenCV Text Detection, where I cover the EAST text detector in far more detail.

After non-maximum suppression (NMS), we can now loop over each of the bounding boxes:

	# ensure that at least one text bounding box was found
	if len(idxs) > 0:
		# loop over the valid bounding box indexes after applying NMS
		for i in idxs.flatten():
			# compute the four corners of the bounding box, scale the
			# coordinates based on the respective ratios, and then
			# convert the box to an integer NumPy array
			box = cv2.boxPoints(rects[i])
			box[:, 0] *= rW
			box[:, 1] *= rH
			box = np.int0(box)

			# draw a rotated bounding box around the text
			cv2.polylines(orig, [box], True, (0, 255, 0), 2)

	# update the FPS counter
	fps.update()

	# show the output frame
	cv2.imshow("Text Detection", orig)
	key = cv2.waitKey(1) & 0xFF

	# if the 'q' key was pressed, break from the loop
	if key == ord("q"):
		break

Line 102 verifies that at least one text bounding box was found, and if so, we loop over the indexes of the kept bounding boxes after applying NMS.

For each resulting index, we compute the bounding box of the text ROI, scale the bounding box (x, y)-coordinates back to the orig input frame dimensions, and then draw the bounding box on the orig frame (Lines 108-114).

Line 117 updates our FPS throughput estimator while Lines 120-125 display the output text detection on our screen.

The final step here is to stop our FPS time, approximate the throughput rate, and release any video file pointers:

# stop the timer and display FPS information
fps.stop()
print("[INFO] elapsed time: {:.2f}".format(fps.elapsed()))
print("[INFO] approx. FPS: {:.2f}".format(fps.fps()))

# if we are using a webcam, release the pointer
if not args.get("input", False):
	vs.stop()

# otherwise, release the file pointer
else:
	vs.release()

# close all windows
cv2.destroyAllWindows()

Lines 128-130 stop our FPS timer and approximate the FPS of our text detection pipeline. We then release any video file pointers and close any windows opened by OpenCV.

GPU and OCR Results

This section needs to be executed locally on a machine with a GPU. After running the text_detection_video.py script on an NVIDIA RTX 2070 SUPER GPU (coupled with an i9 9900K processor), I obtained ~97 FPS:

$ python text_detection_video.py --east ../models/east/frozen_east_text_detection.pb --use-gpu 1
[INFO] loading EAST text detector...
[INFO] setting preferable backend and target to CUDA...
[INFO] starting video stream...
[INFO] elapsed time: 74.71
[INFO] approx. FPS: 96.80

When I ran the same script without using any GPU, I reached an FPS of ~23, which is ~77% slower than the above results.

$ python text_detection_video.py --east ../models/east/frozen_east_text_detection.pb
[INFO] loading EAST text detector...
[INFO] using CPU for inference...
[INFO] starting video stream...
[INFO] elapsed time: 68.59
[INFO] approx. FPS: 22.70

As you can see, using your GPU can dramatically improve the throughput speed of your text detection pipeline!

What's next? We recommend PyImageSearch University.

Course information:
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: August 2025
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
✓ 86 Certificates of Completion
✓ 115+ hours hours of on-demand video
✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
✓ Pre-configured Jupyter Notebooks in Google Colab
✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
✓ Access on mobile, laptop, desktop, etc.

Click here to join PyImageSearch University

Summary

In this tutorial, you learned how to perform text detection in real-time video streams using your GPU. Since many text detection and OCR models are deep learning-based, using your GPU (vs. your CPU) can tremendously increase your frame processing throughput rate.

Using our CPU, we were able to process ~22-23 FPS. However, by running the EAST model on OpenCV’s dnn module, we could reach ~97 FPS!

If you have a GPU available to you, definitely consider utilizing it — you’ll be able to run text detection models in real-time!

Citation Information

Rosebrock, A. “Improving Text Detection Speed with OpenCV and GPUs,” PyImageSearch, D. Chakraborty, P. Chugh, A. R. Gosthipaty, S. Huot, K. Kidriavsteva, R. Raha, and A. Thanki, eds., 2022, https://pyimg.co/9wde6

@incollection{Rosebrock_2022_Improving_Text,
  author = {Adrian Rosebrock},
  title = {Improving Text Detection Speed with {OpenCV} and {GPUs}},
  booktitle = {PyImageSearch},
  editor = {Devjyoti Chakraborty and Puneet Chugh and Aritra Roy Gosthipaty and Susan Huot and Kseniia Kidriavsteva and Ritwik Raha and Abhishek Thanki},
  year = {2022},
  note = {https://pyimg.co/9wde6},
}

Unleash the potential of computer vision with Roboflow - Free!

Step into the realm of the future by signing up or logging into your Roboflow account. Unlock a wealth of innovative dataset libraries and revolutionize your computer vision operations.
Jumpstart your journey by choosing from our broad array of datasets, or benefit from PyimageSearch’s comprehensive library, crafted to cater to a wide range of requirements.
Transfer your data to Roboflow in any of the 40+ compatible formats. Leverage cutting-edge model architectures for training, and deploy seamlessly across diverse platforms, including API, NVIDIA, browser, iOS, and beyond. Integrate our platform effortlessly with your applications or your favorite third-party tools.
Equip yourself with the ability to train a potent computer vision model in a mere afternoon. With a few images, you can import data from any source via API, annotate images using our superior cloud-hosted tool, kickstart model training with a single click, and deploy the model via a hosted API endpoint. Tailor your process by opting for a code-centric approach, leveraging our intuitive, cloud-based UI, or combining both to fit your unique needs.
Embark on your journey today with absolutely no credit card required. Step into the future with Roboflow.

Join Roboflow Now

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!

Download the Source Code and FREE 17-page Resource Guide

Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!

Looking for the source code to this post?

Improving Text Detection Speed with OpenCV and GPUs

Using Your GPU for OCR with OpenCV

Configuring Your Development Environment

Having Problems Configuring Your Development Environment?

Project Structure

Implementing Our OCR GPU Benchmark Script

Speed Test: OCR With and Without GPU

OCR on GPU for Real-Time Video Streams

GPU and OCR Results

What's next? We recommend PyImageSearch University.

Summary

Citation Information

Unleash the potential of computer vision with Roboflow - Free!

Download the Source Code and FREE 17-page Resource Guide

About the Author

Comment section

PyImageSearch University

OpenCV Smoothing and Blurring

OpenCV People Counter

Facial landmarks with dlib, OpenCV, and Python

Topics

Books & Courses

PyImageSearch

Looking for the source code to this post?

Improving Text Detection Speed with OpenCV and GPUs

Using Your GPU for OCR with OpenCV

Configuring Your Development Environment

Having Problems Configuring Your Development Environment?

Project Structure

Implementing Our OCR GPU Benchmark Script

Speed Test: OCR With and Without GPU

OCR on GPU for Real-Time Video Streams

GPU and OCR Results

What's next? We recommend PyImageSearch University.

Summary

Citation Information

Unleash the potential of computer vision with Roboflow - Free!

Download the Source Code and FREE 17-page Resource Guide

About the Author

OCR’ing Video Streams

Text Detection and OCR with Amazon Rekognition API

Comment section

Similar articles

You can learn Computer Vision, Deep Learning, and OpenCV.

Footer

Topics

Books & Courses

PyImageSearch

Access the code to this tutorial and all other 500+ tutorials on PyImageSearch

What's included in PyImageSearch University?