One of the most challenging aspects of applying optical character recognition (OCR) isn’t the OCR itself. Instead, it’s the process of pre-processing, denoising, and cleaning up images such that they can be OCR’d.
To learn how to denoise your images for better OCR, just keep reading.
Looking for the source code to this post?
Jump Right To The Downloads SectionUsing Machine Learning to Denoise Images for Better OCR Accuracy
When working with documents generated by a computer, screenshots, or essentially any piece of text that has never touched a printer and then scanned, OCR becomes far easier. The text is clean and crisp. There is sufficient contrast between the background and foreground. And most of the time, the text doesn’t exist on a complex background.
That all changes once a piece of text is printed and scanned. From there, OCR becomes much more challenging.
- The printer could be low on toner or ink, resulting in the text appearing faded and hard to read.
- An old scanner could have been used when scanning the document, resulting in low image resolution and poor text contrast.
- A mobile phone scanner app may have been used under poor lighting conditions, making it incredibly challenging for human eyes to read the text, let alone a computer.
- And all too common are the clear signs that an actual human has handled the paper, including coffee mug stains on the corners, paper crinkling, rips, tears, etc.
For all the amazing things the human mind can do, it seems like we’re all just walking accidents waiting to happen when it comes to printed materials. Give us a piece of paper and enough time, and I guarantee that even the most organized of us will take that document from the pristine condition and eventually introduce some stains, rips, folds, and crinkles on it.
Inevitably, these problems will occur — and when they do, we need to utilize our computer vision, image processing, and OCR skills to pre-process and improve the quality of these damaged documents. From there, we’ll be able to obtain higher OCR accuracy.
In the remainder of this tutorial, you’ll learn how even simple machine learning algorithms constructed in a novel way can help you denoise images before applying OCR.
Learning Objectives
In this tutorial, you will:
- Gain experience working with a dataset of noisy, damaged documents
- Discover how machine learning is used to denoise these damaged documents
- Work with Kaggle’s Denoising Dirty Documents dataset
- Extract features from this dataset
- Train a random forest regressor (RFR) on the features we extracted
- Take the model and use it to denoise images in our test set (and then be able to denoise your datasets as well)
Image Denoising with Machine Learning
In the first part of this tutorial, we will review the dataset we will be using to denoise documents. From there, we’ll review our project structure, including the five separate Python scripts we’ll be utilizing, including:
- A configuration file to store variables used across multiple Python scripts
- A helper function used to blur and threshold our documents
- A script used to extract features and target values from our dataset
- Another script used to train a RFR
- And a final script used to apply our trained model to images in our test set
This is one of my longer tutorials, and while it’s straightforward and follows a linear progression, there are also many nuanced details here. Therefore, I suggest you review this tutorial twice, once at a high level to understand what we’re doing and then again at a low level to understand the implementation.
With that said, let’s get started!
Our Noisy Document Dataset
We’ll use Kaggle’s Denoising Dirty Documents dataset in this tutorial. The dataset is part of the UCI Machine Learning Repository but converted to a Kaggle competition. We will use three files for this tutorial. Those files are a part of the Kaggle competition data and are named: test.zip
, train.zip
, and train_cleaned.zip
.
The dataset is relatively small, with only 144 training samples, making it easy to work with and use as an educational tool. However, don’t let the small dataset size fool you! What we’re going to do with this dataset is far from basic or introductory.
Figure 1 shows a sample of the dirty documents dataset. For the sample document, the top shows the document’s noisy version, including stains, crinkles, folds, etc. The bottom then shows the target, pristine version of the document that we wish to generate.
Our goal is to input the image on the top and train a machine learning model to produce a cleaned output on the bottom. It may seem impossible now, but once you see some of the tricks and techniques we’ll be using, it will be a lot more straightforward than you think.
The Denoising Document Algorithm
Our denoising algorithm hinges on training an RFR to accept a noisy image and automatically predict the output pixel values. This algorithm is inspired by a denoising technique introduced by Colin Priest.
These algorithms work by applying a 5 x 5
window that slides from left-to-right and top-to-bottom, one pixel at a time (Figure 2) across both the noisy image (i.e., the image we want to pre-process automatically and cleanup) and the target output image (i.e., the “gold standard” of how the image should appear after cleaning).
At each sliding window stop, we extract:
- The
5 x 5
region of the noisy input image. We then flatten the5 x 5
region into a25-d
list and treat it like a feature vector. - The same
5 x 5
region of the cleaned image, but this time we only take the center(x, y)
-coordinate, denoted by the location(2, 2)
.
Given the 25-d
(dimensional) feature vector from the noisy input image, this single pixel value is what we want our RFR to predict.
To make this example more concrete, again consider Figure 2, where we have the following 5 x 5
grid of pixel values from the noisy image:
[[247 227 242 253 237] [244 228 225 212 219] [223 218 252 222 221] [242 244 228 240 230] [217 233 237 243 252]]
We then flatten that into a single list of 5 x 5 = 25-d
values:
[247 227 242 253 237 244 228 225 212 219 223 218 252 222 221 242 244 228 240 230 217 233 237 243 252]
This 25-d
vector is our feature vector upon which our RFR will be trained.
However, we still need to define the target output value of the RFR. Our regression model should accept the input 25-d
vector and output the cleaned, denoised pixel.
Now, let’s assume that we have the following 5 x 5
window from our gold standard/target image:
[[0 0 0 0 0] [0 0 0 0 1] [0 0 1 1 1] [0 0 1 1 1] [0 0 0 1 1]]
We are only interested in the center of this 5 x 5
region, denoted as the location x = 2
, y = 2
. So, we extract this value of 1
(foreground, versus 0
, which is background) and treat it as our target value that our RFR should predict.
Putting this entire example together, we can think of the following as a sample training data point:
trainX = [[247 227 242 253 237 244 228 225 212 219 223 218 252 222 221 242 244 228 240 230 217 233 237 243 252]] trainY = [[1]]
Given our trainX
variable (our raw pixel intensities), we want to predict the corresponding cleaned/denoised pixel value in trainY
.
We will train our RFR in this manner, ultimately leading to a model that can accept a noisy document input and automatically denoise it by examining local 5 x 5
regions and then predicting the center (cleaned) pixel value.
Configuring your development environment
To follow this guide, you need to have the OpenCV library installed on your system.
Luckily, OpenCV is pip-installable:
$ pip install opencv-contrib-python
If you need help configuring your development environment for OpenCV, I highly recommend that you read my pip install OpenCV guide — it will have you up and running in a matter of minutes.
Having Problems Configuring Your Development Environment?
All that said, are you:
- Short on time?
- Learning on your employer’s administratively locked system?
- Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
- Ready to run the code right now on your Windows, macOS, or Linux system?
Then join PyImageSearch University today!
Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides that are pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.
And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!
Project Structure
This tutorial’s project directory structure is a bit more complex than other tutorials as there are five Python scripts to review (three scripts, a helper function, and a configuration file).
Start by accessing the “Downloads” section of this tutorial to retrieve the source code and example images.
Before we get any farther, let’s familiarize ourselves with the files:
|-- pyimagesearch | |-- __init__.py | |-- denoising | | |-- __init__.py | | |-- helpers.py |-- config | |-- __init__.py | |-- denoise_config.py |-- build_features.py |-- denoise_document.py |-- denoiser.pickle |-- denoising-dirty-documents | |-- test | | |-- 1.png | | |-- 10.png | | |-- ... | | |-- 94.png | | |-- 97.png | |-- train | | |-- 101.png | | |-- 102.png | | |-- ... | | |-- 98.png | | |-- 99.png | |-- train_cleaned | | |-- 101.png | | |-- 102.png | | |-- ... | | |-- 98.png | | |-- 99.png |-- train_denoiser.py
The denoising-dirty-documents directory
contains all images from the Kaggle Denoising Dirty Documents dataset.
Inside the denoising
submodule of pyimagesearch
, there is a helpers.py
file. This file contains a single function, blur_and_threshold
, which, as the name suggests, is used to apply a combination of smoothing and thresholding as a pre-processing step for our documents.
We then have the denoise_config.py
file, which stores a few configurations specifying training data file paths, output feature CSV files, and the final serialized RFR model.
There are three Python scripts that we’ll review in their entirety:
build_features.py
: Accepts our input dataset and creates a CSV file that we’ll use to train our RFR.train_denoiser.py
: Trains the actual RFR model and serializes it to disk asdenoiser.pickle
.denoise_document.py
: Accepts an input image from disk, loads the trained RFR, and then denoises the input image.
There are several Python scripts that we need to review in this tutorial. Therefore, I suggest you review this tutorial twice to understand better what we are implementing and then grasp the implementation at a deeper level.
Implementing Our Configuration File
The first step in our denoising documents implementation is to create our configuration file. Open the denoise_config.py
file in the config
subdirectory of the project directory structure and insert the following code:
# import the necessary packages import os # initialize the base path to the input documents dataset BASE_PATH = "denoising-dirty-documents" # define the path to the training directories TRAIN_PATH = os.path.sep.join([BASE_PATH, "train"]) CLEANED_PATH = os.path.sep.join([BASE_PATH, "train_cleaned"])
Line 5 defines the base path to our denoising-dirty-documents
dataset. If you download this dataset from Kaggle, be sure to unzip all .zip
files within this directory to have all images in the dataset uncompressed and residing on disk.
We then define the paths to both the original noisy image directory and the corresponding cleaned image directory, respectively (Lines 8 and 9).
The TRAIN_PATH
images contain the noisy documents while the CLEANED_PATH
images contain our “gold standard” of what, ideally, our output images should look like after applying document denoising via our trained model. We’ll construct our testing set inside our train_denoiser.py
script.
Let’s continue defining our configuration file:
# define the path to our output features CSV file then initialize # the sampling probability for a given row FEATURES_PATH = "features.csv" SAMPLE_PROB = 0.02 # define the path to our document denoiser model MODEL_PATH = "denoiser.pickle"
Line 13 defines the path to our output features.csv
file. Our features here will consist of:
- A local
5 x 5
region sampled via sliding window from the noisy input image - The center of the
5 x 5
region, denoted as(x, y)
-coordinate(2, 2)
, for the corresponding cleaned image
However, if we wrote every feature/target combination to disk, we would end up with millions of rows and a CSV many gigabytes in size. So, instead of exhaustively computing all sliding window and target combinations, we’ll instead only write them to disk with SAMPLES_PROB
probability.
Finally, Line 17 specifies the path to MODEL_PATH
, our output serialized model.
Creating Our Blur and Threshold Helper Function
To help our RFR predict background (i.e., noisy) from foreground (i.e., text) pixels, we need to define a helper function that will pre-process our images before we train the model and make predictions with it.
The flow of our image processing operations can be seen in Figure 4. First, we take our input image, blur it (top-left), and then subtract the blurred image from the input image (top-right). We do this step to approximate the foreground of the image since, by nature, blurring will blur focused features and reveal more of the “structural” components of the image.
Next, we threshold the approximate foreground region by setting any pixel values greater than zero to zero (Figure 4, bottom-left).
The final step is to perform min-max scaling (bottom-right), which brings the pixel intensities back to the range [0, 1]
(or [0, 255]
, depending on your data type). This final image will serve as noisy input when we perform our sliding window sampling.
Now that we understand the general pre-processing steps let’s implement them in Python code.
Open the helpers.py
file in the denoising
submodule of pyimagesearch
, and let’s get to work defining our blur_and_threshold
function:
# import the necessary packages import numpy as np import cv2 def blur_and_threshold(image, eps=1e-7): # apply a median blur to the image and then subtract the blurred # image from the original image to approximate the foreground blur = cv2.medianBlur(image, 5) foreground = image.astype("float") - blur # threshold the foreground image by setting any pixels with a # value greater than zero to zero foreground[foreground > 0] = 0
The blur_and_threshold
function accepts two parameters:
image
: The input image that we’ll pre-process.eps
: An epsilon value used to prevent division by zero.
We then apply a median blur to the image to reduce noise and subtract the blur
from the original image
, resulting in a foreground
approximation (Lines 8 and 9).
From there, we threshold the foreground
image by setting any pixel intensities greater than zero to zero (Line 13).
The final step here is to perform min-max scaling:
# apply min/max scaling to bring the pixel intensities to the # range [0, 1] minVal = np.min(foreground) maxVal = np.max(foreground) foreground = (foreground - minVal) / (maxVal - minVal + eps) # return the foreground-approximated image return foreground
Here, we find the minimum and maximum values in the foreground
image. We use these values to scale the pixel intensities in the foreground
image to the range [0, 1]
.
This foreground-approximated image is then returned to the calling function.
Implementing the Feature Extraction Script
With our blur_and_threshold
function defined, we can move on to our build_features.py
script.
As the name suggests, this script is responsible for creating our 5 x 5 - 25-d
feature vectors from the noisy image and then extracting the target (i.e., cleaned) pixel value from the corresponding gold standard image.
We’ll save these features to disk in CSV format and then train a Random Forest Regression model on them in the section on “Implementing Our Denoising Training Script.”
Let’s get started with our implementation now:
# import the necessary packages from config import denoise_config as config from pyimagesearch.denoising import blur_and_threshold from imutils import paths import progressbar import random import cv2
Line 2 imports our config
to access our dataset file paths and output CSV file path. Notice that we’re using the blur_and_threshold
function here.
The following code block grabs the paths to all images in our TRAIN_PATH
(noisy images) and CLEANED_PATH
(cleaned images that our RFR will learn to predict):
# grab the paths to our training images trainPaths = sorted(list(paths.list_images(config.TRAIN_PATH))) cleanedPaths = sorted(list(paths.list_images(config.CLEANED_PATH))) # initialize the progress bar widgets = ["Creating Features: ", progressbar.Percentage(), " ", progressbar.Bar(), " ", progressbar.ETA()] pbar = progressbar.ProgressBar(maxval=len(trainPaths), widgets=widgets).start()
Note that trainPaths
contain all our noisy images. The cleanedPaths
contain the corresponding cleaned images.
Figure 5 shows an example. On the top is our input training image. On the bottom, we have the corresponding cleaned version of the image. We’ll take 5 x 5
regions from both the trainPaths
and the cleanedPaths
— the goal is to use the noisy 5 x 5
regions to predict the cleaned versions.
Let’s start looping over these image combinations now:
# zip our training paths together, then open the output CSV file for # writing imagePaths = zip(trainPaths, cleanedPaths) csv = open(config.FEATURES_PATH, "w") # loop over the training images together for (i, (trainPath, cleanedPath)) in enumerate(imagePaths): # load the noisy and corresponding gold-standard cleaned images # and convert them to grayscale trainImage = cv2.imread(trainPath) cleanImage = cv2.imread(cleanedPath) trainImage = cv2.cvtColor(trainImage, cv2.COLOR_BGR2GRAY) cleanImage = cv2.cvtColor(cleanImage, cv2.COLOR_BGR2GRAY)
On Line 21, we use Python’s zip
function to combine the trainPaths
and cleanedPaths
. We then open our output csv
file for writing on Line 22.
Line 25 starts a loop over our combinations of imagePaths
. For each trainPath
, we also have the corresponding cleanedPath
.
We load our trainImage
and cleanImage
from disk and convert them to grayscale (Lines 28-31).
Next, we need to pad both trainImage
and cleanImage
with a 2-pixel border in every direction:
# apply 2x2 padding to both images, replicating the pixels along # the border/boundary trainImage = cv2.copyMakeBorder(trainImage, 2, 2, 2, 2, cv2.BORDER_REPLICATE) cleanImage = cv2.copyMakeBorder(cleanImage, 2, 2, 2, 2, cv2.BORDER_REPLICATE) # blur and threshold the noisy image trainImage = blur_and_threshold(trainImage) # scale the pixel intensities in the cleaned image from the range # [0, 255] to [0, 1] (the noisy image is already in the range # [0, 1]) cleanImage = cleanImage.astype("float") / 255.0
Why do we need to bother with the padding? We’re sliding a window from left-to-right and top-to-bottom of the input image and using the pixels inside the window to predict the output center pixel located at x = 2
, y = 2
, not unlike a convolution operation (only with convolution our filters are fixed and defined).
Like convolution, you need to pad your input images such that the output image is not smaller in size. Please refer to my guide on Convolutions with OpenCV and Python if you are unfamiliar with the concept.
After padding is complete, we blur and threshold the trainImage
and manually scale the cleanImage
to the range [0, 1]
. The trainImage
is already scaled to the range [0, 1]
due to the min-max scaling inside blur_and_threshold
.
With our images pre-processed, we can now slide a 5 x 5
window across them:
# slide a 5x5 window across the images for y in range(0, trainImage.shape[0]): for x in range(0, trainImage.shape[1]): # extract the window ROIs for both the train image and # clean image, then grab the spatial dimensions of the # ROI trainROI = trainImage[y:y + 5, x:x + 5] cleanROI = cleanImage[y:y + 5, x:x + 5] (rH, rW) = trainROI.shape[:2] # if the ROI is not 5x5, throw it out if rW != 5 or rH != 5: continue
Lines 49 and 50 slide a 5 x 5
window from left-to-right and top-to-bottom across the trainImage
and cleanImage
. At each sliding window stop, we extract the 5 x 5
ROI of the training image and clean image (Lines 54 and 55).
We grab the width and height of the trainROI
on Line 56, and if either the width or height is not five pixels (due to us being on the borders of the image), we throw out the ROI (because we are only concerned with 5 x 5
regions).
Next, we construct our feature vectors and save the row to our CSV file:
# our features will be the flattened 5x5=25 raw pixels # from the noisy ROI while the target prediction will # be the center pixel in the 5x5 window features = trainROI.flatten() target = cleanROI[2, 2] # if we wrote *every* feature/target combination to disk # we would end up with millions of rows -- let's only # write rows to disk with probability N, thereby reducing # the total number of rows in the file if random.random() <= config.SAMPLE_PROB: # write the target and features to our CSV file features = [str(x) for x in features] row = [str(target)] + features row = ",".join(row) csv.write("{}\n".format(row)) # update the progress bar pbar.update(i) # close the CSV file pbar.finish() csv.close()
Line 65 takes the 5 x 5
pixel region from the trainROI
and flattens it into a 5 x 5 = 25-d
list — this list serves as our feature vector.
Line 66 then extracts the cleaned/gold-standard pixel value from the center of the cleanROI
. This pixel value serves as what we want our RFR to predict.
At this point, we could write our combination of a feature vector and target value to disk; however, if we were to write every feature/target combination to the CSV file, we would end up with a file many gigabytes in size.
To avoid having a massive CSV file, we would need to process it in the next step. So, we instead only allow SAMPLE_PROB
(in this case, 2%) of the rows to be written to disk (Line 72). Doing this sampling reduces the resulting CSV file size and makes it easier to manage.
Line 74 constructs our row of features
and prepends the target
pixel value. We then write the row to our CSV file. We repeat this process for all imagePaths
.
Running the Feature Extraction Script
We are now ready to run our feature extractor. First, open a terminal and then execute the build_features.py
script:
$ python build_features.py Creating Features: 100% |#########################| Time: 0:01:05
The entire feature extractor process took just over one minute on my 3 GHz Intel Xeon W processor.
Inspecting my project directory structure, you can now see the resulting CSV file of features:
$ ls -l *.csv adrianrosebrock staff 273968497 Oct 23 06:21 features.csv
If you were to open the features.csv
file in your system, you would see that each row contains 26 entries.
The first entry in the row is the target output pixel. We will try to predict the output pixel value based on the contents of the remainder of the row, which are the 5 x 5 = 25
input ROI pixels.
The next section covers how to train an RFR model to do exactly that.
Implementing Our Denoising Training Script
Now that our features.csv
file has been generated, we can move on to the training script. This script is responsible for loading our features.csv
file and training an RFR to accept a 5 x 5
region of a noisy image and then predict the cleaned center pixel value.
Let’s get started reviewing the code:
# import the necessary packages from config import denoise_config as config from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error from sklearn.model_selection import train_test_split import numpy as np import pickle
Lines 2-7 handle our required Python packages, including:
config
: Our project configuration holding our output file paths and training variablesRandomForestRegressor
: The scikit-learn implementation of the regression model we’ll use to predict pixel valuesmean_squared_error
: Our error/loss function — the lower this value, the better job we are doing at denoising our imagestrain_test_split
: Used to create a training/testing split from ourfeatures.csv
filepickle
: Used to serialize our trained RFR to disk
Let’s move on to loading our CSV file from disk:
# initialize lists to hold our features and target predicted values print("[INFO] loading dataset...") features = [] targets = [] # loop over the rows in our features CSV file for row in open(config.FEATURES_PATH): # parse the row and extract (1) the target pixel value to predict # along with (2) the 5x5=25 pixels which will serve as our feature # vector row = row.strip().split(",") row = [float(x) for x in row] target = row[0] pixels = row[1:] # update our features and targets lists, respectively features.append(pixels) targets.append(target)
Lines 11 and 12 initialize our features
(5 x 5
pixel regions) and targets (target output pixel values we want to predict).
We start looping over all lines of our CSV file on Line 15. For each row
, we extract both the target
and pixel
values (Lines 19-22). We then update both our features
and targets
lists, respectively.
With the CSV file loaded into memory, we can construct our training and testing split:
# convert the features and targets to NumPy arrays features = np.array(features, dtype="float") target = np.array(targets, dtype="float") # construct our training and testing split, using 75% of the data for # training and the remaining 25% for testing (trainX, testX, trainY, testY) = train_test_split(features, target, test_size=0.25, random_state=42)
Here, we use 75% of our data for training and mark the remaining 25% for testing. This type of split is fairly standard in the machine learning field.
Finally, we can train our RFR:
# train a random forest regressor on our data print("[INFO] training model...") model = RandomForestRegressor(n_estimators=10) model.fit(trainX, trainY) # compute the root mean squared error on the testing set print("[INFO] evaluating model...") preds = model.predict(testX) rmse = np.sqrt(mean_squared_error(testY, preds)) print("[INFO] rmse: {}".format(rmse)) # serialize our random forest regressor to disk f = open(config.MODEL_PATH, "wb") f.write(pickle.dumps(model)) f.close()
Line 39 initializes our RandomForestRegressor
, instructing it to train 10
separate regression trees. The model is then trained on Line 40.
After training is complete, we compute the root-mean-square error (RMSE) to measure how good a job we’ve done at predicting cleaned, denoised images. The lower the error value, the better the job we’ve done.
Finally, we serialize our trained RFR model to disk such that we can use it to make predictions on our noisy images.
Training Our Document Denoising Model
With our train_denoiser.py
script implemented, we are now ready to train our automatic image denoiser! First, open a shell and then execute the train_denoiser.py
script:
$ time python train_denoiser.py [INFO] loading dataset... [INFO] training model... [INFO] evaluating model... [INFO] rmse: 0.04990744293857625 real 1m18.708s user 1m19.361s sys 0m0.894s
Training our script takes just over one minute, resulting in an RMSE of ≈0.05
. This is a very low loss value, indicating that our model successfully accepts noisy input pixel ROIs and correctly predicts the target output value.
Inspecting our project directory structure, you’ll see that the RFR model has been serialized to disk as denoiser.pickle
:
$ ls -l *.pickle adrianrosebrock staff 77733392 Oct 23 denoiser.pickle
We’ll load our trained denoiser.pickle
model from disk in the next section and then use it to automatically clean and pre-process our input documents.
Creating the Document Denoiser Script
This project’s final step is to take our trained denoiser model to clean our input images automatically.
Open denoise_document.py
now, and we’ll see how this process is done:
# import the necessary packages from config import denoise_config as config from pyimagesearch.denoising import blur_and_threshold from imutils import paths import argparse import pickle import random import cv2
Lines 2-8 handle importing our required Python packages. We then move on to parsing our command line arguments:
# construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-t", "--testing", required=True, help="path to directory of testing images") ap.add_argument("-s", "--sample", type=int, default=10, help="sample size for testing images") args = vars(ap.parse_args())
Our denoise_document.py
script accepts two command line arguments:
--testing
: The path to the directory containing the testing images for Kaggle’s Denoising Dirty Documents dataset--sample
: Number of testing images we’ll sample when applying our denoising model
Speaking of our denoising model, let’s load the serialized model from disk:
# load our document denoiser from disk model = pickle.loads(open(config.MODEL_PATH, "rb").read()) # grab the paths to all images in the testing directory and then # randomly sample them imagePaths = list(paths.list_images(args["testing"])) random.shuffle(imagePaths) imagePaths = imagePaths[:args["sample"]]
We also grab all imagePaths
part of the testing set, randomly shuffle them, and then select a total of --sample
images where we apply our automatic denoiser model.
Let’s loop over our sample of imagePaths
:
# loop over the sampled image paths for imagePath in imagePaths: # load the image, convert it to grayscale, and clone it print("[INFO] processing {}".format(imagePath)) image = cv2.imread(imagePath) image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) orig = image.copy() # pad the image followed by blurring/thresholding it image = cv2.copyMakeBorder(image, 2, 2, 2, 2, cv2.BORDER_REPLICATE) image = blur_and_threshold(image)
Here, we are performing the same pre-processing steps that we utilized during the training phase:
- We load the input image from disk
- Convert it to grayscale
- Pad the image with two pixels in every direction
- Apply the
blur_and_threshold
function
Now we need to loop over the processed image
and extract every 5 x 5
pixel neighborhood:
# initialize a list to store our ROI features (i.e., 5x5 pixel # neighborhoods) roiFeatures = [] # slide a 5x5 window across the image for y in range(0, image.shape[0]): for x in range(0, image.shape[1]): # extract the window ROI and grab the spatial dimensions roi = image[y:y + 5, x:x + 5] (rH, rW) = roi.shape[:2] # if the ROI is not 5x5, throw it out if rW != 5 or rH != 5: continue # our features will be the flattened 5x5=25 pixels from # the training ROI features = roi.flatten() roiFeatures.append(features)
Line 42 initializes a list, roiFeatures
, to store every 5 x 5
neighborhood.
We then slide a 5 x 5
window from left-to-right and top-to-bottom across the image
. At every step of the window, we extract the roi
(Line 48), grab its spatial dimensions (Line 49), and throw it out if the ROI size is not 5 x 5
(Lines 52 and 53).
We then take our 5 x 5
pixel neighborhood, flatten it into a list of features
, and update our roiFeatures
list (Lines 57 and 58).
Outside of our sliding window for
loops now, we have our roiFeatures
populated with every possible 5 x 5
pixel neighborhood.
We can then make predictions on these roiFeatures
, resulting in the final cleaned image:
# use the ROI features to predict the pixels of our new denoised # image pixels = model.predict(roiFeatures) # the pixels list is currently a 1D array so we need to reshape # it to a 2D array (based on the original input image dimensions) # and then scale the pixels from the range [0, 1] to [0, 255] pixels = pixels.reshape(orig.shape) output = (pixels * 255).astype("uint8") # show the original and output images cv2.imshow("Original", orig) cv2.imshow("Output", output) cv2.waitKey(0)
Line 62 calls the .predict
method our RFR, resulting in pixels
, our foreground versus background predictions.
However, our pixels
list is currently a 1D array, so we must take care to reshape
the array into a 2D image and then scale the pixel intensities back to the range [0, 255]
(Lines 67 and 68).
Finally, we can show on our screen both the original (noisy image) and the output (cleaned image).
Running Our Document Denoiser
You made it! This has been a long chapter, but we’ve finally ready to apply our document denoiser to our test data.
To see our denoise_document.py
script in action, open a terminal and execute the following command:
$ python denoise_document.py --testing denoising-dirty-documents/test [INFO] processing denoising-dirty-documents/test/133.png [INFO] processing denoising-dirty-documents/test/160.png [INFO] processing denoising-dirty-documents/test/40.png [INFO] processing denoising-dirty-documents/test/28.png [INFO] processing denoising-dirty-documents/test/157.png [INFO] processing denoising-dirty-documents/test/190.png [INFO] processing denoising-dirty-documents/test/100.png [INFO] processing denoising-dirty-documents/test/49.png [INFO] processing denoising-dirty-documents/test/58.png [INFO] processing denoising-dirty-documents/test/10.png
Our results can be seen in Figure 6. The left image for each sample shows the noisy input document, including stains, crinkles, folds, etc. The right then shows the output cleaned image as generated by our RFR.
As you can see, our RFR is doing a great job cleaning these images for us automatically!
What's next? We recommend PyImageSearch University.
86 total classes • 115+ hours of on-demand code walkthrough videos • Last updated: October 2024
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled
I strongly believe that if you had the right teacher you could master computer vision and deep learning.
Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?
That’s not the case.
All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.
If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.
Inside PyImageSearch University you'll find:
- ✓ 86 courses on essential computer vision, deep learning, and OpenCV topics
- ✓ 86 Certificates of Completion
- ✓ 115+ hours of on-demand video
- ✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
- ✓ Pre-configured Jupyter Notebooks in Google Colab
- ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
- ✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
- ✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
- ✓ Access on mobile, laptop, desktop, etc.
Summary
In this tutorial, you learned how to denoise dirty documents using computer vision and machine learning.
Using this method, we could accept images of documents that had been “damaged,” including rips, tears, stains, crinkles, folds, etc. Then, by applying machine learning in a novel way, we could clean up these images to near pristine conditions, making it easier for OCR engines to detect the text, extract it, and OCR it correctly.
When you find yourself applying OCR to real-world images, especially scanned documents, you’ll inevitably run into documents that are of poor quality. Unfortunately, when that happens, your OCR accuracy will likely suffer.
Instead of throwing in the towel, consider how the techniques used in this tutorial may help. Is it possible to manually pre-process a subset of these images and then use them as training data? From there, you can train a model that can accept a noisy pixel ROI and then produce a pristine, cleaned output.
Typically, we don’t use raw pixels as inputs to machine learning models (the exception being a convolutional neural network, of course). Usually, we’ll quantify an input image using some feature detector or descriptor extractor. From there, the resulting feature vector is handed off to a machine learning model.
Rarely does one see standard machine learning models operating on raw pixel intensities. It’s a neat trick that doesn’t feel like it should work in practice. However, as you saw here, this method works!
I hope you can use this tutorial as a starting point when implementing your document denoising pipelines.
To go deeper, you could use denoising autoencoders to improve denoising quality. In this chapter, we used a random forest regressor, an ensemble of different decision trees. Another ensemble you may want to explore is extreme gradient boosting or XGBoost, for short.
Citation Information
A. Rosebrock, “Using Machine Learning to Denoise Images for Better OCR Accuracy,” PyImageSearch, 2021, https://pyimagesearch.com/2021/10/20/using-machine-learning-to-denoise-images-for-better-ocr-accuracy/
@article{Rosebrock_2021_Denoise,
author = {Adrian Rosebrock},
title = {Using Machine Learning to Denoise Images for Better {OCR} Accuracy},
journal = {PyImageSearch},
year = {2021},
note = {https://pyimagesearch.com/2021/10/20/using-machine-learning-to-denoise-images-for-better-ocr-accuracy/},
}
To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!
Download the Source Code and FREE 17-page Resource Guide
Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!
Comment section
Hey, Adrian Rosebrock here, author and creator of PyImageSearch. While I love hearing from readers, a couple years ago I made the tough decision to no longer offer 1:1 help over blog post comments.
At the time I was receiving 200+ emails per day and another 100+ blog post comments. I simply did not have the time to moderate and respond to them all, and the sheer volume of requests was taking a toll on me.
Instead, my goal is to do the most good for the computer vision, deep learning, and OpenCV community at large by focusing my time on authoring high-quality blog posts, tutorials, and books/courses.
If you need help learning computer vision and deep learning, I suggest you refer to my full catalog of books and courses — they have helped tens of thousands of developers, students, and researchers just like yourself learn Computer Vision, Deep Learning, and OpenCV.
Click here to browse my full catalog.