In this tutorial, you will learn how to build a scalable image hashing search engine using OpenCV, Python, and VP-Trees.
Image hashing algorithms are used to:
- Uniquely quantify the contents of an image using only a single integer.
- Find duplicate or near-duplicate images in a dataset of images based on their computed hashes.
Back in 2017, I wrote a tutorial on image hashing with OpenCV and Python (which is required reading for this tutorial). That guide showed you how to find identical/duplicate images in a given dataset.
However, there was a scalability problem with that original tutorial — namely that it did not scale!
To find near-duplicate images, our original image hashing method would require us to perform a linear search, comparing the query hash to each individual image hash in our dataset.
In a practical, real-world application that’s far too slow — we need to find a way to reduce that search to sub-linear time complexity.
But how can we reduce search time so dramatically?
The answer is a specialized data structure called a VP-Tree.
Using a VP-Tree we can reduce our search complexity from O(n) to O(log n), enabling us to obtain our sub-linear goal!
In the remainder of this tutorial you will learn how to:
- Build an image hashing search engine to find both identical and near-identical images in a dataset.
- Utilize a specialized data structure, called a VP-Tree, that can be used used to scale image hashing search engines to millions of images.
To learn how to build your first image hashing search engine with OpenCV, just keep reading!
Looking for the source code to this post?
Jump Right To The Downloads SectionBuilding an Image Hashing Search Engine with VP-Trees and OpenCV
In the first part of this tutorial, I’ll review what exactly an image search engine is for newcomers to PyImageSearch.
Then, we’ll discuss the concept of image hashing and perceptual hashing, including how they can be used to build an image search engine.
We’ll also take a look at problems associated with image hashing search engines, including algorithmic complexity.
Note: If you haven’t read my tutorial on Image Hashing with OpenCV and Python, make sure you do so now. That guide is required reading before you continue here.
From there, we’ll briefly review Vantage-point Trees (VP-Trees) which can be used to dramatically improve the efficiency and performance of image hashing search engines.
Armed with our knowledge we’ll implement our own custom image hashing search engine using VP-Trees and then examine the results.
What is an image search engine?
In this section, we’ll review the concept of an image search engine and direct you to some additional resources.
PyImageSearch has roots in image search engines — that was my main interest when I started the blog back in 2014. This tutorial is a fun one for me to share as I have a soft spot for image search engines as a computer vision topic.
Image search engines are a lot like textual search engines, only instead of using text as a query, we instead use an image.
When you use a text search engines, such as Google, Bing, or DuckDuckGo, etc., you enter your search query — a word or phrase. Indexed websites of interest are returned to you as results, and ideally, you’ll find what you are looking for.
Similarly, for an image search engine, you present a query image (not a textual word/phrase). The image search engine then returns similar image results based solely on the contents of the image.
Of course, there is a lot that goes on under the hood in any type of search engine — just keep this key concept of query/results in mind going forward as we build an image search engine today.
To learn more about image search engines, I suggest you refer to the following resources:
- The complete guide to building an image search engine with Python and OpenCV
- A great guide for those who want to get started with enough knowledge to be dangerous.
- Image Search Engines Blog Category
- This category link returns all of my image search engine content on the PyImageSearch blog.
- PyImageSearch Gurus
- My flagship computer vision course has 13 modules, one of which is dedicated to Content-Based Image Retrieval (a fancy name for image search engines).
Read those guides to obtain a basic understanding of what an image search engine is, then come back to this post to learn about image hash search engines.
What is image hashing/perceptual hashing?
Image hashing, also called perceptual hashing, is the process of:
- Examining the contents of an image.
- Constructing a hash value (i.e., an integer) that uniquely quantifies an input image based on the contents of the image alone.
One of the benefits of using image hashing is that the resulting storage used to quantify the image is super small.
For example, let’s suppose we have an 800x600px image with 3 channels. If we were to store that entire image in memory using an 8-bit unsigned integer data type, the image would require 1.44MB of RAM.
Of course, we would rarely, if ever, store the raw image pixels when quantifying an image.
Instead, we would use algorithms such as keypoint detectors and local invariant descriptors (i.e., SIFT, SURF, etc.).
Applying these methods can typically lead to 100s to 1000s of features per image.
If we assume a modest 500 keypoints detected, each resulting in a feature vector of 128-d with a 32-bit floating point data type, we would require a total of 0.256MB to store the quantification of each individual image in our dataset.
Image hashing, on the other hand, allows us to quantify an image using only a 32-bit integer, requiring only 4 bytes of memory!
Furthermore, image hashes should also be comparable.
Let’s suppose we compute image hashes for three input images, two of which near-identical images:
To compare our image hashes we will use the Hamming distance. The Hamming distance, in this context, is used to compare the number of different bits between two integers.
In practice, this means that we count the number of 1s when taking the XOR between two integers.
Therefore, going back to our three input images above, the Hamming distance between our two similar images should be smaller (indicating more similarity) than the Hamming distance between the third less similar image:
Again, note how the Hamming distance between the two images is smaller than the distances between the third image:
- The smaller the Hamming distance is between two hashes, the more similar the images are.
- And conversely, the larger the Hamming distance is between two hashes, the less similar the images are.
Also note how the distance between identical images (i.e., along the diagonal of Figure 5) are all zero — the Hamming distance between two hashes will be zero if the two input images are identical, otherwise the distance will be > 0, with larger values indicating less similarity.
There are a number of image hashing algorithms, but one of the most popular ones is called the difference hash, which includes four steps:
- Step #1: Convert the input image to grayscale.
- Step #2: Resize the image to fixed dimensions, N + 1 x N, ignoring aspect ratio. Typically we set N=8 or N=16. We use N + 1 for the number of rows so that we can compute the difference (hence “difference hash”) between adjacent pixels in the image.
- Step #3: Compute the difference. If we set N=8 then we have 9 pixels per row and 8 pixels per column. We can then compute the difference between adjacent column pixels, yielding 8 differences. 8 rows of 8 differences (i.e., 8×8) results in 64 values.
- Step #4: Finally, we can build the hash. In practice all we actually need to perform is a “greater than” operation comparing the columns, yielding binary values. These 64 binary values are compacted into an integer, forming our final hash.
Typically, image hashing algorithms are used to find near-duplicate images in a large dataset.
I’ve covered image hashing in detail inside this tutorial so if the concept is new to you, I would suggest reading that guide before continuing here.
What is an image hashing search engine?
An image hashing search engine consists of two components:
- Indexing: Taking an input dataset of images, computing the hashes, and storing them in a data structure to facilitate fast, efficient search.
- Searching/Querying: Accepting an input query image from the user, computing the hash, and finding all near-identical images in our indexed dataset.
A great example of an image hashing search engine is TinEye, which is actually a reverse image search engine.
A reverse image search engine:
- Accepts an input image.
- Finds all near-duplicates of that image on the web, telling you the website/URL of where the near duplicate can be found.
Using this tutorial you will learn how to build your own TinEye!
What makes scaling image hashing search engines problematic?
One of the biggest issues with building an image hashing search engine is scalability — the more images you have, the longer it can take to perform the search.
For example, let’s suppose we have the following scenario:
- We have a dataset of 1,000,000 images.
- We have already computed image hashes for each of these 1,000,000 images.
- A user comes along, presents us with an image, and then asks us to find all near-identical images in that dataset.
How might you go about performing that search?
Would you loop over all 1,000,000 image hashes, one by one, and compare them to the hash of the query image?
Unfortunately, that’s not going to work. Even if you assume that each Hamming distance comparison takes 0.00001 seconds, with a total 1,000,000 images, it would take you 10 seconds to complete the search — far too slow for any type of search engine.
Instead, to build an image hashing search engine that scales, you need to utilize specialized data structures.
What are VP-Trees and how can they help scale image hashing search engines?
In order to scale our image hashing search engine, we need to use a specialized data structure that:
- Reduces our search from linear complexity,
O(n)
… - …down to sub-linear complexity, ideally
O(log n)
.
To accomplish that task we can use Vantage-point Trees (VP-Trees). VP-Trees are a metric tree that operates in a metric space by selecting a given position in space (i.e., the “vantage point”) and then partitioning the data points into two sets:
- Points that are near the vantage point
- Points that are far from the vantage point
We then recursively apply this process, partitioning the points into smaller and smaller sets, thus creating a tree where neighbors in the tree have smaller distances.
To visualize the process of constructing a VP-Tree, consider the following figure:
First, we select a point in space (denoted as the v in the center of the circle) — we call this point the vantage point. The vantage point is the point furthest from the parent vantage point in the tree.
We then compute the median, μ, for all points, X.
Once we have μ, we then divide X into two sets, S1 and S2:
- All points with distance <= μ belong to S1.
- All points with distance > μ belong to S2.
We then recursively apply this process, building a tree as we go, until we are left with a child node.
A child node contains only a single data point (in this case, one individual hash). Child nodes that are closer together in the tree thus have:
- Smaller distances between them.
- And therefore assumed to be more similar to each other than the rest of the data points in the tree.
After recursively applying the VP-Tree construction method, we end up with a data structure, that, as the name suggests, is a tree:
Notice how we recursively split subsets of our dataset into smaller and smaller subsets, until we eventually reach the child nodes.
VP-Trees take O(n log n)
to build, but once we’ve constructed it, a search takes only O(log n)
, thus reducing our search time to sub-linear complexity!
Later in this tutorial, you’ll learn to utilize VP-Trees with Python to build and scale our image hashing search engine.
Note: This section is meant to be a gentle introduction to VP-Trees. If you are interested in learning more about them, I would recommend (1) consulting a data structures textbook, (2) following this guide from Steve Hanov’s blog, or (3) reading this writeup from Ivan Chen.
The CALTECH-101 dataset
The dataset we’ll be working today is the CALTECH-101 dataset which consists of 9,144 total images across 101 categories (with 40 to 800 images per category).
The dataset is large enough to be interesting to explore from an introductory to image hashing perspective but still small enough that you can run the example Python scripts in this guide without having to wait hours and hours for your system to finish chewing on the images.
You can download the CALTECH 101 dataset from their official webpage or you can use the following wget
command:
$ wget http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz $ tar xvzf 101_ObjectCategories.tar.gz
Project structure
Let’s inspect our project structure:
$ tree --dirsfirst . ├── pyimagesearch │ ├── __init__.py │ └── hashing.py ├── 101_ObjectCategories [9,144 images] ├── queries │ ├── accordion.jpg │ ├── accordion_modified1.jpg │ ├── accordion_modified2.jpg │ ├── buddha.jpg │ └── dalmation.jpg ├── index_images.py └── search.py
The pyimagesearch
module contains hashing.py
which includes three hashing functions. We will review the functions in the “Implementing our hashing utilities” section below.
Our dataset is in the 101_ObjectCategories/
folder (CALTECH-101) contains 101 sub-directories with our images. Be sure to read the previous section to learn how to download the dataset.
There are five query images in the queries/
directory. We will search for images with similar hashes to these images. The accordion_modified1.jpg
and accordion_modiied2.jpg
images will present unique challenges to our VP-Trees image hashing search engine.
The core of today’s project lies in two Python scripts: index_images.py
and search.py
:
- Our indexer will calculate hashes for all 9,144 images and organize the hashes in a VP-Tree. This index will reside in two
.pickle
files: (1) a dictionary of all computed hashes, and (2) the VP-Tree. - The searcher will calculate the hash for a query image and search the VP-Tree for the closest images via Hamming Distance. The results will be returned to the user.
If that sounds like a lot, don’t worry! This tutorial will break everything down step-by-step.
Configuring your development environment
For this blog post, your development environment needs the following packages installed:
Luckily for us, everything is pip-installable. My recommendation for you is to follow the first OpenCV link to pip-install OpenCV in a virtual environment on your system. From there you’ll just pip-install everything else in the same environment.
It will look something like this:
# setup pip, virtualenv, and virtualenvwrapper (using the "pip install OpenCV" instructions) $ workon <env_name> $ pip install numpy $ pip install opencv-contrib-python $ pip install imutils $ pip install vptree
Replace <env_name>
with the name of your virtual environment. The workon
command will only be available once you set up virtualenv
and virtualenvwrapper
following these instructions.
Implementing our image hashing utilities
Before we can build our image hashing search engine, we first need to implement a few helper utilities.
Open up the hashing.py
file in the project structure and insert the following code:
# import the necessary packages import numpy as np import cv2 def dhash(image, hashSize=8): # convert the image to grayscale gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # resize the grayscale image, adding a single column (width) so we # can compute the horizontal gradient resized = cv2.resize(gray, (hashSize + 1, hashSize)) # compute the (relative) horizontal gradient between adjacent # column pixels diff = resized[:, 1:] > resized[:, :-1] # convert the difference image to a hash return sum([2 ** i for (i, v) in enumerate(diff.flatten()) if v])
We begin by importing OpenCV and NumPy (Lines 2 and 3).
The first function we’ll look at, dhash
, is used to compute the difference hash for a given input image. Recall from above that our dhash
requires four steps: (1) convert to grayscale, (2) resize, (3) compute the difference, and (4) build the hash. Let’s break it down a little further:
- Line 7 converts the image to grayscale.
- Line 11 resizes the image to N + 1 rows by N columns, ignoring the aspect ratio. This ensures that the resulting image hash will match similar photos regardless of their initial spatial dimensions.
- Line 15 computes the horizontal gradient difference between adjacent column pixels. Assuming
hashSize=8
, will be 8 rows of 8 differences (there are 9 rows allowing for 8 comparisons). We will thus have a 64-bit hash as 8×8=64. - Line 18 converts the difference image to a hash.
For more details, refer to this blog post.
Next, let’s look at convert_hash
function:
def convert_hash(h): # convert the hash to NumPy's 64-bit float and then back to # Python's built in int return int(np.array(h, dtype="float64"))
When I first wrote the code for this tutorial, I found that the VP-Tree implementation we’re using internally converts points to a NumPy 64-bit float. That would be okay; however, hashes need to be integers and if we convert them to 64-bit floats, they become an unhashable data type. To overcome the limitation of the VP-Tree implementation, I came up with the convert_hash
hack:
- We accept an input hash,
h
. - That hash is then converted to a NumPy 64-bit float.
- And that NumPy float is then converted back to Python’s built-in integer data type.
This hack ensures that hashes are represented consistently throughout the hashing, indexing, and searching process.
We then have one final helper method, hamming
, which is used to compute the Hamming distance between two integers:
def hamming(a, b): # compute and return the Hamming distance between the integers return bin(int(a) ^ int(b)).count("1")
The Hamming distance is simply a count
of the number of 1s when taking the XOR (^
) between two integers (Line 27).
Implementing our image hash indexer
Before we can perform a search, we first need to:
- Loop over our input dataset of images.
- Compute difference hash for each image.
- Build a VP-Tree using the hashes.
Let’s start that process now.
Open up the index_images.py
file and insert the following code:
# import the necessary packages from pyimagesearch.hashing import convert_hash from pyimagesearch.hashing import hamming from pyimagesearch.hashing import dhash from imutils import paths import argparse import pickle import vptree import cv2 # construct the argument parser and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-i", "--images", required=True, type=str, help="path to input directory of images") ap.add_argument("-t", "--tree", required=True, type=str, help="path to output VP-Tree") ap.add_argument("-a", "--hashes", required=True, type=str, help="path to output hashes dictionary") args = vars(ap.parse_args())
Lines 2-9 import the packages, functions, and modules necessary for this script. In particular Lines 2-4 import our three hashing related functions: convert_hash
, hamming
, and dhash
. Line 8 imports the vptree
implementation that we will be using.
Next, Lines 12-19 parse our command line arguments:
--images
: The path to our images which we will be indexing.--tree
: The path to the output VP-tree.pickle
file which will be serialized to disk.--hashes
: The path to the output hashes dictionary which will be stored in.pickle
format.
Now let’s compute hashes for all images:
# grab the paths to the input images and initialize the dictionary # of hashes imagePaths = list(paths.list_images(args["images"])) hashes = {} # loop over the image paths for (i, imagePath) in enumerate(imagePaths): # load the input image print("[INFO] processing image {}/{}".format(i + 1, len(imagePaths))) image = cv2.imread(imagePath) # compute the hash for the image and convert it h = dhash(image) h = convert_hash(h) # update the hashes dictionary l = hashes.get(h, []) l.append(imagePath) hashes[h] = l
Lines 23 and 24 grab image paths and initialize our hashes
dictionary.
Line 27 then begins a loop over all the imagePaths
. Inside the loop, we:
- Load the
image
(Line 31). - Compute and convert the hash,
h
(Lines 34 and 35). - Grab a list of all image paths,
l
, with the same hash (Line 38). - Add this
imagePath
to the list,l
(Line 39). - Update our dictionary with the hash as the key and our list of image paths with the same corresponding hash as the value (Line 40).
From here, we build our VP-Tree:
# build the VP-Tree print("[INFO] building VP-Tree...") points = list(hashes.keys()) tree = vptree.VPTree(points, hamming)
To construct the VP-Tree, Lines 44 and 45 pass in (1) a list of data points (i.e., the hash integer values themselves), and (2) our distance function (the Hamming distance method) to the VPTree
constructor.
Internally, the VP-Tree computes the Hamming distances between all input points
and then constructs the VP-Tree such that data points with smaller distances (i.e., more similar images) lie closer together in the tree space. Be sure to refer the “What are VP-Trees and how can they help scale image hashing search engines?” section and Figures 7, 8, and 9.
With our hashes
dictionary populated and VP-Tree constructed, we’ll now serialize them both to disk as .pickle
files:
# serialize the VP-Tree to disk print("[INFO] serializing VP-Tree...") f = open(args["tree"], "wb") f.write(pickle.dumps(tree)) f.close() # serialize the hashes to dictionary print("[INFO] serializing hashes...") f = open(args["hashes"], "wb") f.write(pickle.dumps(hashes)) f.close()
Extracting image hashes and building the VP-Tree
Now that we’ve implemented our indexing script, let’s put it to work. Make sure you’ve:
- Downloaded the CALTECH-101 dataset using the instructions above.
- Used the “Downloads” section of this tutorial to download the source code and example query images.
- Extracted the .zip of the source code and changed directory to the project.
From there, open up a terminal and issue the following command:
$ time python index_images.py --images 101_ObjectCategories \ --tree vptree.pickle --hashes hashes.pickle [INFO] processing image 1/9144 [INFO] processing image 2/9144 [INFO] processing image 3/9144 [INFO] processing image 4/9144 [INFO] processing image 5/9144 ... [INFO] processing image 9140/9144 [INFO] processing image 9141/9144 [INFO] processing image 9142/9144 [INFO] processing image 9143/9144 [INFO] processing image 9144/9144 [INFO] building VP-Tree... [INFO] serializing VP-Tree... [INFO] serializing hashes... real 0m10.947s user 0m9.096s sys 0m1.386s
As our output indicates, we were able to hash all 9,144 images in just over 10 seconds.
Checking project directory after running the script we’ll find two .pickle
files:
$ ls -l *.pickle -rw-r--r-- 1 adrianrosebrock 796620 Aug 22 07:53 hashes.pickle -rw-r--r-- 1 adrianrosebrock 707926 Aug 22 07:53 vptree.pickle
The hashes.pickle
(796.62KB) file contains our computed hashes, mapping the hash integer value to file paths with the same hash. The vptree.pickle
(707.926KB) is our constructed VP-Tree.
We’ll be using this VP-Tree to perform queries and searches in the following section.
Implementing our image hash searching script
The second component of an image hashing search engine is the search script. The search script will:
- Accept an input query image.
- Compute the hash for the query image.
- Search the VP-Tree using the query hash to find all duplicate/near-duplicate images.
Let’s implement our image hash searcher now — open up the search.py
file and insert the following code:
# import the necessary packages from pyimagesearch.hashing import convert_hash from pyimagesearch.hashing import dhash import argparse import pickle import time import cv2 # construct the argument parser and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-t", "--tree", required=True, type=str, help="path to pre-constructed VP-Tree") ap.add_argument("-a", "--hashes", required=True, type=str, help="path to hashes dictionary") ap.add_argument("-q", "--query", required=True, type=str, help="path to input query image") ap.add_argument("-d", "--distance", type=int, default=10, help="maximum hamming distance") args = vars(ap.parse_args())
Lines 2-9 import the necessary components for our searching script. Notice that we need the dhash
and convert_hash
functions once again as we’ll have to compute the hash for our --query
image.
Lines 10-19 parse our command line arguments (the first three are required):
--tree
: The path to our pre-constructed VP-Tree on disk.--hashes
: The path to our pre-computed hashes dictionary on disk.--query
: Our query image’s path.--distance
: The maximum hamming distance between hashes is set with adefault
of10
. You may override it if you so choose.
It’s important to note that the larger the --distance
is, the more hashes the VP-Tree will compare, and thus the searcher will be slower. Try to keep your --distance
as small as possible without compromising the quality of your results.
Next, we’ll (1) load our VP-Tree + hashes dictionary, and (2) compute the hash for our --query
image:
# load the VP-Tree and hashes dictionary print("[INFO] loading VP-Tree and hashes...") tree = pickle.loads(open(args["tree"], "rb").read()) hashes = pickle.loads(open(args["hashes"], "rb").read()) # load the input query image image = cv2.imread(args["query"]) cv2.imshow("Query", image) # compute the hash for the query image, then convert it queryHash = dhash(image) queryHash = convert_hash(queryHash)
Lines 23 and 24 load the pre-computed index including the VP-Tree and hashes
dictionary.
From there, we load and display the --query
image (Lines 27 and 28).
We then take the query image
and compute the queryHash
(Lines 31 and 32).
At this point, it is time to perform a search using our VP-Tree:
# perform the search print("[INFO] performing search...") start = time.time() results = tree.get_all_in_range(queryHash, args["distance"]) results = sorted(results) end = time.time() print("[INFO] search took {} seconds".format(end - start))
Lines 37 and 38 perform a search by querying the VP-Tree for hashes with the smallest Hamming distance relative to the queryHash
. The results
are sorted
so that “more similar” hashes are at the front of the results
list.
Both of these lines are sandwiched with timestamps for benchmarking purposes, the results of which are printed via Line 40.
Finally, we will loop over results
and display each of them:
# loop over the results for (d, h) in results: # grab all image paths in our dataset with the same hash resultPaths = hashes.get(h, []) print("[INFO] {} total image(s) with d: {}, h: {}".format( len(resultPaths), d, h)) # loop over the result paths for resultPath in resultPaths: # load the result image and display it to our screen result = cv2.imread(resultPath) cv2.imshow("Result", result) cv2.waitKey(0)
Line 43 begins a loop over the results
:
- The
resultPaths
for the current hash,h
, are grabbed from the hashes dictionary (Line 45). - Each
result
image is displayed as a key is pressed on the keyboard (Lines 50-54).
Image hashing search engine results
We are now ready to test our image search engine!
But before we do that, make sure you have:
- Downloaded the CALTECH-101 dataset using the instructions above.
- Used the “Downloads” section of this tutorial to download the source code and example query images.
- Extracted the .zip of the source code and changed directory to the project.
- Ran the
index_images.py
file to generate thehashes.pickle
andvptree.pickle
files.
After all the above steps are complete, open up a terminal and execute the following command:
python search.py --tree vptree.pickle --hashes hashes.pickle \ --query queries/buddha.jpg [INFO] loading VP-Tree and hashes... [INFO] performing search... [INFO] search took 0.015203237533569336 seconds [INFO] 1 total image(s) with d: 0, h: 8.162938100012111e+18
On the left, you can see our input query image of our Buddha. On the right, you can see that we have found the duplicate image in our indexed dataset.
The search itself took only 0.015 seconds.
Additionally, note that the distance between the input query image and the hashed image in the dataset is zero, indicating that the two images are identical.
Let’s try again, this time with an image of a Dalmatian:
$ python search.py --tree vptree.pickle --hashes hashes.pickle \ --query queries/dalmation.jpg [INFO] loading VP-Tree and hashes... [INFO] performing search... [INFO] search took 0.014827728271484375 seconds [INFO] 1 total image(s) with d: 0, h: 6.445556196029652e+18
Again, we see that our image hashing search engine has found the identical Dalmatian in our indexed dataset (we know the images are identical due to the Hamming distance of zero).
The next example is of an accordion:
$ python search.py --tree vptree.pickle --hashes hashes.pickle \ --query queries/accordion.jpg [INFO] loading VP-Tree and hashes... [INFO] performing search... [INFO] search took 0.014187097549438477 seconds [INFO] 1 total image(s) with d: 0, h: 3.380309217342405e+18
We once again find our identical matched image in the indexed dataset.
We know our image hashing search engine is working great for identical images…
…but what about images that are slightly modified?
Will our hashing search engine still perform well?
Let’s give it a try:
$ python search.py --tree vptree.pickle --hashes hashes.pickle \ --query queries/accordion_modified1.jpg [INFO] loading VP-Tree and hashes... [INFO] performing search... [INFO] search took 0.014217138290405273 seconds [INFO] 1 total image(s) with d: 4, h: 3.380309217342405e+18
Here I’ve added a small red square in the bottom left corner of the accordion query image. This addition will change the difference hash value!
However, if you take a look at the output result, you’ll see that we were still able to detect the near-duplicate image.
We were able to find the near-duplicate image by comparing the Hamming distance between the hashes. The difference in hash values is 4, indicating that 4 bits differ between the two hashes.
Next, let’s try a second query, this one much more modified than the first:
$ python search.py --tree vptree.pickle --hashes hashes.pickle \ --query queries/accordion_modified2.jpg [INFO] loading VP-Tree and hashes... [INFO] performing search... [INFO] search took 0.013727903366088867 seconds [INFO] 1 total image(s) with d: 9, h: 3.380309217342405e+18
Despite dramatically altering the query by adding in a large blue rectangle, a yellow circle, and text, we’re still able to find the near-duplicate image in our dataset in under 0.014 seconds!
Whenever you need to find duplicate or near-duplicate images in a dataset, definitely consider using image hashing and image searching algorithms — when used correctly, they can be extremely powerful!
What's next? We recommend PyImageSearch University.
86 total classes • 115+ hours of on-demand code walkthrough videos • Last updated: October 2024
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled
I strongly believe that if you had the right teacher you could master computer vision and deep learning.
Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?
That’s not the case.
All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.
If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.
Inside PyImageSearch University you'll find:
- ✓ 86 courses on essential computer vision, deep learning, and OpenCV topics
- ✓ 86 Certificates of Completion
- ✓ 115+ hours of on-demand video
- ✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
- ✓ Pre-configured Jupyter Notebooks in Google Colab
- ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
- ✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
- ✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
- ✓ Access on mobile, laptop, desktop, etc.
Summary
In this tutorial, you learned how to build a basic image hashing search engine using OpenCV and Python.
To build an image hashing search engine that scaled we needed to utilize VP- Trees, a specialized metric tree data structure that recursively partitions a dataset of points such that nodes of the tree that are closer together are more similar than nodes that are farther away.
By using VP-Trees we were able to build an image hashing search engine capable of finding duplicate and near-duplicate images in a dataset in under 0.01 seconds.
Furthermore, we demonstrated that our combination of hashing algorithm and VP-Tree search was capable of finding matches in our dataset, even if our query image was modified, damaged, or altered!
If you are ever building a computer vision application that requires quickly finding duplicate or near-duplicate images in a large dataset, definitely give this method a try.
To download the source code to this post, and be notified when future posts are published here on PyImageSearch, just enter your email address in the form below!
Download the Source Code and FREE 17-page Resource Guide
Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!
Walid
Your weekly post makes me like Mondays 🙂
I guess this can be used to remove redundat images in training or validation datasets…
Thanks a lot
Walid
Adrian Rosebrock
You can use it for that, certainly. Another great application would be to build a “reverse image search engine” find index all websites that a given image appears on.
Suleiman jae
Am new in computer science and I love python and opencv and want to learn them a lot. How can I implement the reverse image search image.
Adrian Rosebrock
This tutorial covers how to build the CV component of a reverse image search engine. The next steps for you would be to:
1. Look into crawlers/indexers. I like the Scrapy library.
2. Crawl a set of images and look for all images in the HTML source.
3. Grab the image, hash it, and store the web page it was returned on.
I would also recommend having some basic knowledge of databases.
Lambert Wolterbeek Muller
Great article, many thanks.
I really appreciate your effort in publishing working code *and* taking the time to explain the code in great detail.
I wanted to ask: if we search the VP-tree for an image similar to image “A” and the search returns “B”, do we then know for sure that “B” is the image that is closest to “A” in our entire image library? or is there no such guarantee, and could there be an image “C” in our library that is in fact closer to “A”?
in other words: does the search through the VP-tree yield an “acceptable” answer or “best” answer?
Adrian Rosebrock
The VP-Tree will give you the “best” answers but make sure you sort them based on the distances returned by the tree, as depending on how the tree is traversed, the distance values themselves might be out of order.
David Bonn
Great and timely post for me, Adrian. This changes my whole approach to finding similar images in a dataset.
Adrian Rosebrock
Thanks David, I’m glad you found it helpful! 😀
Jeff
Adrian, Thank you so much for all your effort and details that you have done in this post, I loved it how you explain, it is extremely easy to understand and I love what you are doing, please don’t stop posting and thank you again for sharing your knowledge. I am saving to purchase the Guru’s course.
Adrian Rosebrock
Thanks Jeff. I don’t have any plans on stopping my sharing of content 😉
See you in the course!
Denis Brion
What would happen if I made two pictures
original one
same with a 15 % margin (with arbitray pixels), say, at the top?
(if the original picture is copyright protected, I bet second one, is , too)
Another easy and cheap way of making two very near (from my point of view) pictures would be to distort by affine transforms (1rst case is a translation, a particular case of affine transform).
If one is interested in plagiarism detection, this would be very easy things which should be detected.
Shivam
Adrian is it possible to update “tree” with new images. i.e. if I want to add an image in the search space that I didn’t consider while initial creation.
Adrian Rosebrock
See my zhi zhou.
Anthony The Koala
Dear Dr Adrian,
Thank you for this post.
I have two questions please:
(1) This is inspired by the today’s blog, the 2017 blogpost on “image hashing and opencv and python” with a picture of a beagle at the top page and the “hotdog/no hotdog” as in the show “Silicon Valley”.
You have a database ‘hashes’ of dogs.
Suppose one takes a picture of a beagle. The picture you take is different from the one stored in the database with factors such as lighting, contrast, gamma, orientation of beagle.
Assuming you scale the picture to a particular size, turn it into a greyscale picture and perform a hash function on the just-taken photo, will the search engine in today’s example retrieve the photo of the beagle stored in the database?
(2) I was reading about hashing used in cryptography. In cryptography applications, to ensure that the file you received is what was sent by the server, you compare the published hash, MD5 (say) you compute on the received file with the published hash. It should be exact.
If there is a slight variation in the file sent, the computed hash (MD5) is supposed to be completely different than the published/expected hash value.
NOW, in today’s tutorial, particularly the 2017 tutorial, if you take a photo, compute its hash and compare it to the hash of a similar-but-not-exact photo on the hash database, the hash values should be close.
The question is what is the difference between hashing used in cryptography and hashing a photo used in your blog?
Thank you,
Anthony of Sydney
Adrian Rosebrock
1. No, this image search engine is only to be used for duplicate/near-duplicate detection. It will not work if you start rotating an image or changing viewpoint. For that you would need a more advanced method using keypoint detectors, local invariant descriptors, and the BOVW model. All of that is covered in the PyImageSearch Gurus course.
2. A cryptography hash doesn’t care about the contents of an image. You change a bit in the file and the hash will change. Cryptography hashes have no idea what is actually in the image. Perceptual hashes, like this one, does. If you read the 2017 post you’ll find that I discuss cryptography hashes and compare them to image hashes.
Matteo
I Adrian, thanks a Lot for your job ! a question: how Many images can be managed using this code (and getting result in few second) ? If i have a dataset with 10 milion images, this can be work ?
Thanks a Lot for your amazing job !
Adrian Rosebrock
Hey Matteo — I would recommend you try it and see 🙂 You write a Python script that copies a single image 10,000,000 times. Then hash your images and build your dataset. Building mini-projects like that is a great way to learn.
Denis Rothman
Dear Adrian,
First of all, thanks for your great courses!
I’ve been around in AI for quite some time and can safely say you are a MASTER of AI and unique!!!
I was moved by your initial Hash post about your childhood.
Now for my question. While swimming this afternoon, I found an algorithm for hash search engine which is a variant of your analysis but without a VP-Tree. What do you think of this, please?
I.Creating the data:
Step 1: looping over the images in the dataset
Step 2: generating a hash code per image as you mentioned: load->grayscale->resize->flatten->enumerate->generate hash code
Step 3: rename each imge= image hash_code.jpg, e.g. and the initial label can be stored in hash_code.txt, e.g.
Of course, there can be repositories or whatever. Let’s skip this part.
I. Search engine without a VP-tree
Step 4: Load the image to search and calculate it’s hashcode (see Step 2).
Step 5: run the os.path.exists(hash_code.jgp) command and if it’s true, open the file and its label
Step 6: if Step 5 is false, then let’s say = Euclidean threshold distance= etd. Let’s say that it can be squared, or square rooted or whatever. In any case, it’s a distance.
Step 7A: loop 5,e.g. times in one direction dir in range(0,4) . Each time etd+=etd + dir*some paramter. Each time, run os.path.exists(hash_code+etd.jpg)
Step 7B: The same as 7A but in the other direction. os.path.exists(hash_code-etd.jpg).
It terms of big O notation:
– if the 7A+7B= threshold steps=k
then O(k) no matter what the size of the “database” is. That’s even less than O(log n) so it would be, I think, sub-sublinear.
Note: I’ve spent the last 35+ year optimizing mainstream equations and algorithms, often rewriting them from scratch and merging them into meta-algorithms.
What do you think, please?
I’m looking forward to see your great mind churn a few minutes on this one!
Thanks in advance.
Very warm regards,
Denis
Adrian Rosebrock
Hey Denis — that could work but you should consider the I/O overhead of asking the OS to lookup a filename and check for existence. While it’s conceptually simple, from an engineering perspective it’s a huge bottleneck. When working with search engines you should try to fit the index into RAM and remove as many I/O calls as possible.
John F
Wonderful post, thanks so much Adrian!
Adrian Rosebrock
Thanks John, I’m glad you enjoyed it!
zhi zhou
Hello Adrian
Thanks for your great articles 🙂 I have a question about Image Search Engine after reading this post. Since the search domain in this post is local images, how should I let the domain to extend to internet? Should I add image-fetch bot such as spider or anything else to get images from internet and creating index-tree(VP-Tree) dynamically?
I have implemented an image search engine demo service according to your post, including restful API returned with json result (near-duplicated image urls, dhash values and distances etc.). If I add a spider in my app, Should I save the image locally after grabbing them from internet or just calculate the dhash and save it with the image-url(ignore the image file)? Any advice on this? And how often to update the index-tree? any time or just daily or weekly?New to search engine,Thanks!
Adrian Rosebrock
Yes. Basically you would use a spider (give Scrapy a try) to continually fetch new images. Hash those images and build smaller VP-Trees every few days or after every N images. Query every tree you have when searching. Then, after X days, where X is larger than N, consolidate the trees into a single tree. This process is called “index folding”.
If you want to learn more about image search engines you should refer to the PyImageSearch Gurus course.
Anchal
Great post, thanks a lot.
I wanted to ask if i should implement the indexing in batches, given that I want to make it scalable for millions of images and might end up doing the indexing multiple times.
Thanks again
Adrian Rosebrock
Great question. Stay tuned for next week’s post where I’ll show you how to distribute the hashing load and dramatically increase the indexing speed 🙂
zhzhi
Hi Adrian
Thanks for your great post! I have try it refer to your code found that I cannot get the right result displayed in your Figure 1. I used ahash, phash and dhash (https://github.com/JohannesBuchner/imagehash/blob/master/imagehash/__init__.py) , with setting hamming distance threshold is 10.
Is your result using the color histogram algorithm in Figure 1 instead of image hash? It seems the similar images have the same color distribution in Figure 1. Thanks!
zhzhi
In my opinion, image hash is not used to search similar images, we can use it to find duplicated images or near-duplicated images, right ?
Adrian Rosebrock
Yes, you are correct.
Adrian Rosebrock
That’s odd, I’m not sure why you would have different results. What version of Python are you using?
JISHNU PRAKASH K P
Hello Adrian,
Very inspiring content.
How can I improve the accuracy of image search?
What about using Locality sensitive hashing.. how can I use it?
In fact, I have to find similar looking real estate property pictures. Can you please guide me to implement LSH.
Adrian Rosebrock
I don’t have any LSH tutorials at the moment. I may consider doing it in the future but I cannot guarantee if/when that may be.
Parasa V.N.S Hanumantha rao
Is it possible to use Elasticsearch for reducing the searching time if yes how?
Adrian Rosebrock
Elasticsearch is based on Lucene (which is used for traditional text retrieval, not CBIR). You could use Elasiticsearch to store metadata on an image but it wouldn’t actually help with reducing the amount of time a search takes.
Parasa V.N.S Hanumantha rao
Can we use the same to face search? because when the data set increases it really takes a lot of time
Adrian Rosebrock
You should consider parallelizing across the system bus.
franclin
Hi Adrian,
Thanks a million for these tutorials, they’re really helpful and easy to follow. I just wanted to share some thoughts on how to solve this problem. What about using a CNN like ResNet or VGG to train all the images in the dataset and then use the query image to make predictions on the newly trained CNN? Or maybe using a pre-trained CNN on imagenet in Keras and perform predictions on that model using the query image?
Another idea that I would like to suggest would be to extract the query image features using a headless pre-trained model then perform a KNN search on the GPU?
Once more, thank you and keep up the good work.
Adrian Rosebrock
Hey Franclin — it sounds like you’re interested in using a CNN as an arbitrary feature extractor and then using those features to build an image search engine. Is that correct?
franclin
Yes indeed Adrian that’s it exactly what I meant. Could you please share any thoughts on that idea?
Thanks a million once more.
Adrian Rosebrock
You can do that but you would need to use a bag-of-visual-words model to scale it. I cover that inside the PyImageSearch Gurus course.
Bill Buchanan
Hi Adrian, fascinating stuff 🙂
I’m working on a project to use a drone to check warehouse inventory. I need to assess if today’s photo of each shelf is the same as yesterday’s.
Problem is, the drone won’t be in exactly the same position on each run. Would image hashing/Hamming be your chosen approach or could you suggest any techniques that might work better?
Adrian Rosebrock
I would suggest keypoint detection, local invariant descriptors, and feature matching. All of that is covered inside Practical Python and OpenCV.
Yingbo
Wonderful! Adrian. Have you integrated this example into your book?
Adrian Rosebrock
No, primarily because I don’t cover image hashing in my books. Perhaps in a future edition of the PyImageSearch Gurus course I will include it!
Bishal Neupane
Hi Adrain, thanks for the great article as always. I was wondering how could I come up with similar hashes when the content in the image is the same except for the color of the object. The problem is converting RGB to grayscale produces different grayscale images and the hashes are far away from each other. I could not find any transformation where I could completely discard the color information in the images. Is it even feasible with some techniques? Waiting for your feedback. Thanks.
Adrian Rosebrock
Are the images in fact identical except for color? I get the impression that the viewpoint of the object may be different or some other objects may be present. Keep in mind that this method will not work for that level of similarity comparison. I would suggest using keypoint detectors, local invariant descriptors, and keypoint matching, all of which are covered in the PyImageSearch Gurus course.
Bishal Neupane
Yes the images are in fact almost identical, just with different color. Would keypoint detectors approach tackle different colors but same object case too?
Adrian Rosebrock
Yes, keypoint detectors would work well in that situation as they are not typically dependent on object color.
Ujjwal Khandelwal
Awesome work Adrian! Saved a lot of browsing time. Thanks a ton 🙂
Adrian Rosebrock
Thanks Ujjwal 🙂
Jonatron
I think it’s worth mentioning that hamming distance in pure python is very slow. Cython can make it hundreds of times faster. I contributed cython hamming distance to imagededup at https://github.com/idealo/imagededup/pull/56
Adrian Rosebrock
Thanks for sharing Jonatron!
Vijay Zharotia
Hi Adrian,
I have used the code for searching similar images on chest x-ray database. Since x-ray are gray scale, do you think we need to some better method? I have 10000 images and results are good despite pictures are in gray scale. It is faster. what are the limitation of this method. results are slightly difference but has overlaps. any suggestion.
Adrian Rosebrock
What is the end goal of the project? Are you looking for duplicate/near-duplicate images?
Secondly, how are you quantifying how a chest X-ray is similar/different?
aravind
what type image hashing used here?
Adrian Rosebrock
The blog post details the hashing algorithm used.
jack
Hi Adrian, please help me to add something to it, like i want to know which image is matched its name, i know that following “imshow” is shownig the result and input query image. But i want to know the image name to print along with the result.
Ben Huth
Is there any way to simply append the hashes to the lists? I don’t want to have to keep the images around and only need to be able to add new images every time I run the script
Adrian Rosebrock
Yes, I would recommend you read up on Python basics, including appending items to lists as well as serializing and deserializing files.