Last updated on Feb 13, 2024.
In last week’s blog post we learned how to install the Tesseract binary for Optical Character Recognition (OCR).
We then applied the Tesseract program to test and evaluate the performance of the OCR engine on a very small set of example images.
A dataset is instrumental for Optical Character Recognition (OCR) tasks because it enables the model to learn and understand various fonts, sizes, and orientations of text. This in turn leads to improved OCR accuracy in real-world applications.
Roboflow has free tools for each stage of the computer vision pipeline that will streamline your workflows and supercharge your productivity.
Sign up or Log in to your Roboflow account to access state of the art dataset libaries and revolutionize your computer vision pipeline.
You can start by choosing your own datasets or using our PyimageSearch’s assorted library of useful datasets.
Bring data in any of 40+ formats to Roboflow, train using any state-of-the-art model architectures, deploy across multiple platforms (API, NVIDIA, browser, iOS, etc), and connect to applications or 3rd party tools.
As our results demonstrated, Tesseract works best when there is a (very) clean segmentation of the foreground text from the background. In practice, it can be extremely challenging to guarantee these types of segmentations. Hence, we tend to train domain-specific image classifiers and detectors.
Nevertheless, it’s important that we understand how to access Tesseract OCR via the Python programming language in the case that we need to apply OCR to our own projects (provided we can obtain the nice, clean segmentations required by Tesseract).
Use Cases and Applications of Tesseract
Tesseract, a powerful OCR tool, finds applications across diverse industries by automating data extraction and streamlining workflows. In finance and accounting, it aids in digitizing invoices, receipts, and bank statements, reducing manual data entry and enhancing accuracy. The healthcare sector benefits from its ability to convert medical records, prescriptions, and lab reports into electronic formats, improving compliance with regulations like HIPAA and facilitating better patient care.
In e-commerce and retail, Tesseract enhances inventory management, automates order processing from shipping labels, and digitizes customer feedback for actionable insights. Additionally, in education and research, it simplifies tasks like digitizing textbooks, automating exam evaluations, and processing large datasets for analysis. These applications showcase Tesseract’s versatility in improving efficiency, accuracy, and accessibility in real-world scenarios.
Example projects involving OCR may include building a mobile document scanner that you wish to extract textual information from or perhaps you’re running a service that scans paper medical records and you’re looking to put the information into a HIPA-Compliant database.
In the remainder of this blog post, we’ll learn how to install the Tesseract OCR + Python “bindings” followed by writing a simple Python script to call these bindings. By the end of the tutorial, you’ll be able to convert text in an image to a Python string data type.
To learn more about using Tesseract and Python together with OCR, just keep reading.
- Update Feb 2024: Added section detailing how Tesseract version can have huge impacts on OCR accuracy.
Looking for the source code to this post?
Jump Right To The Downloads SectionUsing Tesseract OCR with Python
This blog post is divided into three parts.
First, we’ll learn how to install the pytesseract package so that we can access Tesseract via the Python programming language.
Next, we’ll develop a simple Python script to load an image, binarize it, and pass it through the Tesseract OCR system.
Finally, we’ll test our OCR pipeline on some example images and review the results.
To download the source code + example images to this blog post, be sure to use the “Downloads” section below.
Installing the Tesseract + Python “bindings”
Let’s begin by getting pytesseract
installed. To install pytesseract
we’ll take advantage of pip
.
If you’re using a virtual environment (which I highly recommend so that you can separate different projects), use the workon
command followed by the appropriate virtual environment name. In this case, our virtualenv is named cv
.
$ workon cv
Next let’s install Pillow, a more Python-friendly port of PIL (a dependency) followed by pytesseract
.
$ pip install pillow $ pip install pytesseract
Note: pytesseract
does not provide true Python bindings. Rather, it simply provides an interface to the tesseract
binary. If you take a look at the project on GitHub you’ll see that the library is writing the image to a temporary file on disk followed by calling the tesseract
binary on the file and capturing the resulting output. This is definitely a bit hackish, but it gets the job done for us.
Let’s move forward by reviewing some code that segments the foreground text from the background and then makes use of our freshly installed pytesseract
.
Applying OCR with Tesseract and Python
Let’s begin by creating a new file named ocr.py
:
# import the necessary packages from PIL import Image import pytesseract import argparse import cv2 import os # construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-i", "--image", required=True, help="path to input image to be OCR'd") ap.add_argument("-p", "--preprocess", type=str, default="thresh", help="type of preprocessing to be done") args = vars(ap.parse_args())
Lines 2-6 handle our imports. The Image
class is required so that we can load our input image from disk in PIL format, a requirement when using pytesseract
.
Our command line arguments are parsed on Lines 9-14. We have two command line arguments:
--image
: The path to the image we’re sending through the OCR system.--preprocess
: The preprocessing method. This switch is optional and for this tutorial and can accept two values:thresh
(threshold) orblur
.
Next we’ll load the image, binarize it, and write it to disk.
# load the example image and convert it to grayscale image = cv2.imread(args["image"]) gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # check to see if we should apply thresholding to preprocess the # image if args["preprocess"] == "thresh": gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1] # make a check to see if median blurring should be done to remove # noise elif args["preprocess"] == "blur": gray = cv2.medianBlur(gray, 3) # write the grayscale image to disk as a temporary file so we can # apply OCR to it filename = "{}.png".format(os.getpid()) cv2.imwrite(filename, gray)
First, we load --image
from disk into memory (Line 17) followed by converting it to grayscale (Line 18).
Next, depending on the pre-processing method specified by our command line argument, we will either threshold or blur the image. This is where you would want to add more advanced pre-processing methods (depending on your specific application of OCR) which are beyond the scope of this blog post.
The if
statement and body on Lines 22-24 perform a threshold in order to segment the foreground from the background. We do this using both cv2.THRESH_BINARY
and cv2.THRESH_OTSU
flags. For details on Otsu’s method, see “Otsu’s Binarization” in the official OpenCV documentation.
We will see later in the results section that this thresholding method can be useful to read dark text that is overlaid upon gray shapes.
Alternatively, a blurring method may be applied. Lines 28-29 perform a median blur when the --preprocess
flag is set to blur
. Applying a median blur can help reduce salt and pepper noise, again making it easier for Tesseract to correctly OCR the image.
After pre-processing the image, we use os.getpid
to derive a temporary image filename
based on the process ID of our Python script (Line 33).
The final step before using pytesseract
for OCR is to write the pre-processed image, gray
, to disk saving it with the filename
from above (Line 34).
We can finally apply OCR to our image using the Tesseract Python “bindings”:
# load the image as a PIL/Pillow image, apply OCR, and then delete # the temporary file text = pytesseract.image_to_string(Image.open(filename)) os.remove(filename) print(text) # show the output images cv2.imshow("Image", image) cv2.imshow("Output", gray) cv2.waitKey(0)
Using pytesseract.image_to_string
on Line 38 we convert the contents of the image into our desired string, text
. Notice that we passed a reference to the temporary image file residing on disk.
This is followed by some cleanup on Line 39 where we delete the temporary file.
Line 40 is where we print text to the terminal. In your own applications, you may wish to do some additional processing here such as spellchecking for OCR errors or Natural Language Processing rather than simply printing it to the console as we’ve done in this tutorial.
Finally, Lines 43 and 44 handle displaying the original image and pre-processed image on the screen in separate windows. The cv2.waitKey(0)
on Line 34 indicates that we should wait until a key on the keyboard is pressed before exiting the script.
Let’s see our handywork in action.
Tesseract OCR and Python results
Now that ocr.py
has been created, it’s time to apply Python + Tesseract to perform OCR on some example input images.
In this section, we will try OCR’ing three sample images using the following process:
- First, we will run each image through the Tesseract binary as-is.
- Then we will run each image through
ocr.py
(which performs pre-processing before sending through Tesseract). - Finally, we will compare the results of both of these methods and note any errors.
Our first example is a “noisy” image. This image contains our desired foreground black text on a background that is partly white and partly scattered with artificially generated circular blobs. The blobs act as “distractors” to our simple algorithm.
Using the Tesseract binary, as we learned last week, we can apply OCR to the raw, unprocessed image:
$ tesseract images/example_01.png stdout Noisy image to test Tesseract OCR
Tesseract performed well with no errors in this case.
Now let’s confirm that our newly made script, ocr.py
, also works:
$ python ocr.py --image images/example_01.png Noisy image to test Tesseract OCR
As you can see in this screenshot, the thresholded image is very clear and the background has been removed. Our script correctly prints the contents of the image to the console.
Next, let’s test Tesseract and our pre-processing script on an image with “salt and pepper” noise in the background:
We can see the output of the tesseract
binary below:
$ tesseract images/example_02.png stdout Detected 32 diacritics " Tesséra‘c't Will Fail With Noisy Backgrounds
Unfortunately, Tesseract did not successfully OCR the text in the image.
However, by using the blur
pre-processing method in ocr.py
we can obtain better results:
$ python ocr.py --image images/example_02.png --preprocess blur Tesseract Will Fail With Noisy Backgrounds
Success! Our blur pre-processing step enabled Tesseract to correctly OCR and output our desired text.
Finally, let’s try another image, this one with more text:
The above image is a screenshot from the “Prerequisites” section of my book, Practical Python and OpenCV — let’s see how the Tesseract binary handles this image:
$ tesseract images/example_03.png stdout PREREQUISITES In order In make the rnosi of this, you will need (a have a little bit of pregrarrmung experience. All examples in this book are in the Python programming language. Familiarity with Pyihon or other scriphng languages is suggesied, but mm required. You'll also need (a know some basic mathematics. This book is handson and example driven: leis of examples and lots of code, so even if your math skills are noi up to par. do noi worry! The examples are very damned and heavily documented (a help yuu follaw along.
Followed by testing the image with ocr.py
:
$ python ocr.py --image images/example_03.png PREREQUISITES Lu order to make the most ol this, you will need to have a little bit ol programming experience. All examples in this book are in the Python programming language. Familiarity with Python or other scripting languages is suggested, but not requixed. You’ll also need to know some basic mathematics. This book is handson and example driven: lots of examples and lots ol code, so even ii your math skills are not up to par, do not worry! The examples are very detailed and heavily documented to help you tollow along.
Notice misspellings in both outputs including, but not limited to, “In”, “of”, “required”, “programming”, and “follow”.
The output for both of these do not match; however, interestingly the pre-processed version has only 8 word errors whereas the non-pre-processed image has 17 word errors (over twice as many errors). Our pre-processing helps even on a clean background!
Python + Tesseract did a reasonable job here, but once again we have demonstrated the limitations of the library as an off-the-shelf classifier.
We may obtain good or acceptable results with Tesseract for OCR, but the best accuracy will come from training custom character classifiers on specific sets of fonts that appear in actual real-world images.
Don’t let the results of Tesseract OCR discourage you — simply manage your expectations and be realistic on Tesseract’s performance. There is no such thing as a true “off-the-shelf” OCR system that will give you perfect results (there are bound to be some errors).
Note: If your text is rotated, you may wish to do additional pre-processing as is performed in this previous blog post on correcting text skew. Otherwise, if you’re interested in building a mobile document scanner, you now have a reasonably good OCR system to integrate into it.
Tip: Improve OCR accuracy by upgrading your Tesseract version
Be sure to check the Tesseract version you have installed on your machine by using the tesseract -v
command:
$ tesseract -v tesseract 5.3.4
If you see Tesseract v5 or greater in your output, congrats, you are using the Long Short-Term Memory (LSTM) OCR model which is far more accurate than the previous versions of Tesseract!
If you see any version less than v5, then you should upgrade your Tesseract install — using the Tesseract v5 LSTM engine will lead to more accurate OCR results.
What's next? We recommend PyImageSearch University.
86+ total classes • 115+ hours hours of on-demand code walkthrough videos • Last updated: January 2025
★★★★★ 4.84 (128 Ratings) • 16,000+ Students Enrolled
I strongly believe that if you had the right teacher you could master computer vision and deep learning.
Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?
That’s not the case.
All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.
If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.
Inside PyImageSearch University you'll find:
- ✓ 86+ courses on essential computer vision, deep learning, and OpenCV topics
- ✓ 86 Certificates of Completion
- ✓ 115+ hours hours of on-demand video
- ✓ Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
- ✓ Pre-configured Jupyter Notebooks in Google Colab
- ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
- ✓ Access to centralized code repos for all 540+ tutorials on PyImageSearch
- ✓ Easy one-click downloads for code, datasets, pre-trained models, etc.
- ✓ Access on mobile, laptop, desktop, etc.
Summary
In today’s blog post we learned how to apply the Tesseract OCR engine with the Python programming language. This enabled us to apply OCR algorithms from within our Python script.
The biggest downside is with the limitations of Tesseract itself. Tesseract works best when there are extremely clean segmentations of the foreground text from the background.
Furthermore these segmentations need to be as high resolution (DPI) as possible and the characters in the input image cannot appear “pixelated” after segmentation. If characters do appear pixelated then Tesseract will struggle to correctly recognize the text — we found this out even when applying images captured under ideal conditions (a PDF screenshot).
OCR, while no longer a new technology, is still an active area of research in the computer vision literature especially when applying OCR to real-world, unconstrained images. Deep learning and Convolutional Neural Networks (CNNs) are certainly enabling us to obtain higher accuracy, but we are still a long way from seeing “near perfect” OCR systems. Furthermore, as OCR has many applications across many domains, some of the best algorithms used for OCR are commercial and require licensing to be used in your own projects.
My primary suggestion to readers when applying OCR to their own projects is to first try Tesseract and if results are undesirable move on to the Google Vision API.
If neither Tesseract nor the Google Vision API obtain reasonable accuracy, you might want to reassess your dataset and decide if it’s worth it to train your own custom character classifier — this is especially true if your dataset is noisy and/or contains very specific fonts you wish to detect and recognize. Examples of specific fonts include the digits on a credit card, the account and routing numbers found at the bottom of checks, or stylized text used in graphic design.
I hope you are enjoying this series of blog posts on Optical Character Recognition (OCR) with Python and OpenCV!
To be notified when new blog posts are published here on PyImageSearch, be sure to enter your email address in the form below!
Download the Source Code and FREE 17-page Resource Guide
Enter your email address below to get a .zip of the code and a FREE 17-page Resource Guide on Computer Vision, OpenCV, and Deep Learning. Inside you'll find my hand-picked tutorials, books, courses, and libraries to help you master CV and DL!
Balint
Hi Adrian! This series is super useful! I’m wondering if there’s going to be a post about training an own custom character classifier.
Adrian Rosebrock
Hi Balint — I actually demonstrate how to train a classifier to recognize handwritten digits inside Practical Python and OpenCV. A more thorough review (with source code) of general machine learning and object detection techniques is covered inside PyImageSearch Gurus.
Todor Arnaudov
Hi, Adrian,
I think some of the mistakes could be corrected with a bit of NLP post-processing, too.
For example with NLTK: http://www.nltk.org/
For a start, it would use dictionaries and a corpus of texts with computed n-grams of words and sequences of characters and part-of-speech tagging. The unlikely sequences would be spotted, similar ones with high frequency may be used for replacement or suggested for the suspicious segments.
I’ll be more specific if/when I try to do it myself.
Adrian Rosebrock
Absolutely. Anytime of natural language processing or domain specific regex can help improve the accuracy.
cam
Please ignore my comment, I hadn’t installed the main package: brew install tesseract, but installed tesseract-py.
Neeraj Bisht
$ python ocr.py –image Downloads/10011050/1050.jpg
…
gray = cv2.threshold(gray, 0, 255,
^
IndentationError: expected an indented block
Got this error while doing. Help
Adrian Rosebrock
Make sure you use the “Downloads” section at the bottom of this page to download the source code and example images used in this post. During the copy and paste of the code you introduced an indentation error to the Python script, causing the error. Again, simply download the code using the “Downloads” section to use the code I have provided for you.
Anthony The Koala
Dear Dr Jason,
Have there been any experiments by super-imposing different kinds of noise such as Gaussian, Poisson, the level of noise and the degree of noise reduction in order to determine the Tesseract package will respond to a particular noise family (Gaussian & Poisson) and the threshold of noise reduction for the Tesseract package to process images correctly?
To put it another way:
That is if the particular noise cannot be completely/significantly reduced can the Tesseract package successfully decode the text with say 99% accuracy?
Also is there a particular noise distribution that the Tesseract OCR will successfully decode text 99%.
Thank you,
Anthony of Sydney NSW
Adrian Rosebrock
Hi Anthony — it’s Adrian actually, not Jason 😉
Regarding your questions, I think these are better suited for the Tesseract researchers and developers. I’m sure they have a bunch of benchmark tests they run (sort of like unit tests, only for the machine learning world). This is especially true with their new v4 release of Tesseract that will use LSTMs. I would suggest asking your specific question over at the Tesseract GitHub page as I do not know the answers to these questions off the top of my head.
Dave
which python version is this for?
when i try to run it, it says: ImportError: No module named cv2
Adrian Rosebrock
Make sure you install Tesseract into the same environment that your OpenCV bindings are installed in. Did you use one of my tutorials when installing OpenCV? If so, don’t forget to use the
workon cv
command to access thecv
virtual environment and then install Tesseract.Dani
Hi Adrian,
How can I split a text from scanned document (binarized image) into lines in order to do OCR on each line?
Adrian Rosebrock
The Tesseract binary will automatically attempt to OCR each individual line for you. Is there a particular reason you want to go line-by-line?
Dani
I’ve noticed that scanned document with different font sizes is a bit problematic (very poor OCR percentage), especially when the text is not accurately horizontal.
I thought, doing OCR line by line will solve this.
Adrian Rosebrock
I can see how this might be problematic. Instead of using Tesseract, perhaps try the Google Vision API and compare results.
Andrew
Is there a reason you wrote the image to a temp file instead of using: pytesseract.image_to_string(Image.fromarray(gray)?
Adrian Rosebrock
The temporary file is in OpenCV format, so it’s written to disk first and then loaded via PIL/Pillow so it can be OCR’d by Tesseract. It’s a bit of a hack.
Nitish Singh
How do I limit the pytesseract to alphanumeric or any other custom list?
Adrian Rosebrock
The easiest method is to consult the Tesseract FAQs. The page I linked to details how to return only digits, but you can modify it to return specific characters.
Matthew Hale
Is there a way to accomplish this in tesseract 4.x? I see how it’s supposed to work in previous version, but the same commands don’t work with LSTM, and there doesn’t seem to be a solution yet other than retraining on a dataset with limited characters
Adrian Rosebrock
I’m sure there is but I’m honestly not sure what the right command line parameters are. I would definitely suggest reaching out to the Tesseract devs.
Fernando
Hello there, ur code works fine in the sample test. But sometimes doesn’t print the result on different images.
I already tried to change DPI image and resize then and couldn’t solve the problem.
I uploaded 2 different images that i’m using for test and the first one has been identify and print correctly but the second one, doesn’t.
http://imgur.com/a/OT3TX
Any idea?
Adrian Rosebrock
You need to localize the font first. Binarize (via thresholding) the image and extract the text regions. Then pass the regions through Tesseract. It’s likely that you are not applying enough pre-processing to your images. As I mentioned in the blog post, Tesseract works best when you can extract just the text regions and ignore the rest of the image.
Ankur
Hey !
I was trying to implement your code but I am facing problem here:
args = vars(ap.parse_args())
While running this it gives me following error:
pydevconsole.py: error: argument -i/–image is required
I was thinking if you direct me in the right direction?
Adrian Rosebrock
Hi Ankur — please read up on command line arguments and how they work before continuing.
Soham
Hey Ankur, I am also getting the same error. How did you fix yours?
Please help.
Nipun
Thanks Adrian for such a nice article. I am trying to achieve this on a video, which actually did work, but this slows down the whole process and I want to do this on live stream.
Is there a way where we can avoid creating a temporary file and sending it to tesseract via Pillow ?
Can we simply pass the matrix directly to the teserract, after doing some pre-processing on it ?
Adrian Rosebrock
The Python + Tesseract “bindings” require that an intermediate file be written to disk. To speedup the process you could create a RAM filesystem, but as far as I know, you can’t pass the matrix directly into the Tesseract binary.
Phill
Using pytesseract might not be optimal due to disk I/O operations and subprocess calling of tesseract via os.syscall.
There is another Python package that offers API acces to Tesseract.
https://pypi.python.org/pypi/tesserocr
Its docs are very well written.
Simple OCR of image in numpy array might be done like:
Profiled with profilehooks showed that 99% of time cost is due to api.GetUTF8Text() call.
Adrian Rosebrock
Thanks for sharing Phill!
Sébastien VINCENT
In the same vein as tesserocr, there is PyOCR, an other Python package which offers access to a more complete API access to Tesseract. It can be found at https://github.com/openpaperwork/pyocr
Example use:
print txt
Tarun Nanduri
how to solve invalid tessdata path?
Augustin
Hi Adrian,
Thanks for your articles, very useful!
I was wondering, I’ve seen that the next Tesseract version is going to use LSTM as a classifier. But, do you know what is implemented in the current version?
Adrian Rosebrock
The current version of Tesseract does not use the LSTM classifier by default. You would need to download the new release manually.
David
Hello,
Can you please help me and tell me where to find the different config options in
pytesseract.image_to_string(image, lang=None, boxes=False, config=None)? I know we can set the page segmentation mode in c++, is it possible with pytesseract?
Nasarudin
Hi Adrian, thank you for the article. It is great as always.
I wonder if you already tried using OCR on a screenshot. I read somewhere that screenshot only has 72 dpi which is insufficient to OCR that needs bigger dpi(300 and above if I am not mistaken).
My approach is to take a screenshot, process it by resizing/rescale up to 300%(already done), using the blur function to reduce noise(already done), and convert the image into black and white (have not try it yet)
I would like to know your opinion on this. Maybe you have better solution. Thank you.
Adrian Rosebrock
The larger the DPI, (normally) the better when it comes to OCR. As far as what DPI you are capturing a screenshot at, you would have to consult the documentation of your operating system/library used to take the screenshot.
shruthi
GOOD MORNING., SIR Is this tesseract can support any languages
Adrian Rosebrock
Tesseract supports a number of language packs.
Janderson
Thanks for your article, very useful! But I have a question. Is it possible use your script to make OCR PDF files? The Tesseract official docs explains well in C++ but I didn’t find anything in pytesseract. Any idea?
Dinesh Kumar
Thanks for your detailed article, Adrian. I am new to OCR, Tesseract and all. This helped me a lot. Thanks, man.
Adrian Rosebrock
Awesome, I’m glad to hear it Dinesh 🙂
James
Hi Adrian,
I’m new to OCR, it was a great help. But I’m curious to put this in web app, can you give me guidelines…
Adrian Rosebrock
I would suggest creating a REST-like API, as I do in this blog post.
James
Thank you. Sir
Ameer
Hi dear Adrian
Could we use this with other languages? If yes, may you point out to the main ideas how this is possible?
Thanks
Adrian Rosebrock
You can use Tesseract with C++ and C. See this link. You can also use the binary executable from any language where you can execute executables from within that language.
Soham Khapre
Hi Adrian! Thank you for your code. I am new to Python and OCR so I don’t understand much about it. I gave the image’s path address in line 14 of the code but still I am getting an error saying – argument -i/-C:\Python27\Lib\site-packages\pytesseract is required. The above path is where my image is stored.
Please help me at the earliest.
Adrian Rosebrock
You don’t actually need to modify Line 14. You need to supply it via command line argument. I would suggest reading up on command line arguments before continuing. I hope that helps!
Adrian Rosebrock
Another option would be to delete all code related to command line arguments and hard code your paths as separate variables.
Lucian
Hi
Is there a way to see the word which is being processed with a bounding box around ?
To make it simple, the goal I want to achieve is to create a bounding box and, given a word (by me), compare it to the one the OCR found. If the words are the same, delete/make blank the one with bounding box around (in the image).
It’s possible ?
Adrian Rosebrock
Tesseract accepts an input image and displays the output text. Tesseract does not draw any bounding boxes. I would suggest localizing each text region using an algorithm similar to this one and from there processing each bounding box.
Bill Runge
Hi Adrian
I have successfully installed and used tesseract-ocr and would like to experiment with the tessdata to see if I can improve the identification rate for a project that uses a relatively small number of dot matrix characters. In order to help with this I have been trying to install Qt Box Editor which appears to go well until the make command which fails because it appears that some of the tesseract and leptonica library files are not found. The install instructions suggest ensuring that the path to these is correct in the qtboxeditor.pro script, but I am not sure what the path is or where to insert it in the script. I would appreciate any insight you may have or a link to more detailed information. Thus far I have not found any useful help with this.
Thank you for some great blogs
Phil
Hi Adrian, I’m really keen to use this package but I can’t locate it (or tesserocr). I’m using Conda and have added bioconda channels. Are you aware of any availability change? I’ve even tried in 3.6 and a venv on 2.7 but no joy.
Oh wait, tesseract is only linux and tesserocr is only Windows (unusual?!). Maybe I’ll pip install on my venv and see if that works.
Have you used textract at all? This looks like another nifty piece of software but I’ve had deeper package compatibility issues instaling that
Adrian Rosebrock
Hey Phil, I have not used textract before. I haven’t used Windows in many years either so I’m not sure what the exact issue is here. I primarily recommend Linux and macOS for computer vision development. This enables me to replicate errors and provide guidance. I’m sorry I couldn’t be of more help!
Henry Saa
i have a problem when run the scrip
importerror no module named pytesseract
i find in the web but dont have solution please help me with this
Adrian Rosebrock
Hi Henry — please see the section “Installing the Tesseract + Python “bindings”” of this blog post.
imran
even i am facing the same problem
kayde
Hi Adrian,
Im using windows. I tried your code for this. it gave me “FileNotFoundError: [WinError 2] The system cannot find the file specified” error.
Btw i just need to pip install pillow and pytesseract ? what about tesseract-ocr ?
very new at this. appreciate the help. thx! 🙂
Jiri
Hi Kayde,
you can try to download teseract-ocr for windows https://digi.bib.uni-mannheim.de/tesseract/ (I used version 4.0)
+ you need to add PATH into local variables – see https://stackoverflow.com/questions/43262935/tesseract-python-the-system-cannot-find-the-file-specified/43264831
Then you will be able to use this script.
Manoj kumar
Thanks Jiri. Your post helped me in getting my job done. But i think we also need to have tesseract (apart from Tesseract-OCR) installed and also we have to add an extra line in the “ocr.py” file after the imports.
pytesseract.pytesseract.tesseract_cmd = “C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe”
Refer this Reddit post for more info.
https://www.reddit.com/r/learnpython/comments/51k06i/simple_way_to_go_about_ocr/
Paras
I tried your code but it fail my image has a lot of tables it in. Can you suggest how to do it ?
Sahil Aggarwal
got an error help please
usage: ocr.py [-h] -i IMAGE [-p PREPROCESS]
ocr.py: error: argument -i/–image is required
Adrian Rosebrock
Please read the comments and/or doing a ctrl + f and searching for your error before posting. I have addressed this question in my response to “Ankur” on July 14, 2017.
Isha
Hi! I know this is probably a very stupid query but I’m new to python so please bear with me. I’m getting the following error.
error: the following arguments are required: -i/–image
Adrian Rosebrock
Hey Isha — please see my reply to “Ankur” on July 14, 2017.
Ed
Writing to a file on disk is kind of awkward. A better way of doing it is to use temporary files using the python module ‘tempfile’:
import tempfile
filepath = tempfile.NamedTemporaryFile(suffix = “.png”, delete = False).name
cv2.imwrite(filepath, img)
text= pytesseract.image_to_string(Image.open(filepath))
The module will automatically delete the temporary file.
Septian Mulyana
On windows environment after install Tesseract doesn’t work
maybe using this library opencv work https://pyimagesearch.com/2017/07/17/credit-card-ocr-with-opencv-and-python/
Raúl
Hello Adrian, thank you very much for your article, I would like to request the creation of a tutorial using tesseract 4 to use LSTM or the extraction of data by zones using uzn files. Thank you very much, your blog is spectacular!
Adrian Rosebrock
Thank you for the suggestion, Raúl!
Vishal
How can we do Document Processing Automation like retrieving data from a form using python ?
santhosh
how to install pil
Nachiket
I got this error when I run the code:
“WindowsError: [Error 2] The system cannot find the file specified”
I didn’t change anything at all…
Siddharth
brew install tesseract on mac works, Find equivalent in windows
Issue: You haven’t installed the main tesseract file: tesseract / tesseract-ocr
pytesseract is just a wrapper on original tesseract.
jyoti
Hi Adrian,
I am looking for a solution for my problem related to pdf to excel.
Problem statement- I have pdf files. These files are of varied size ie from 5-50 pages. I want to extract not all but few tables from the pdf. And write those tables into csv/excel file in the same table format as in pdf. Those tables can be images, tables or scanned pics.
Expectation- the table data from pdf should be written to excel automatically. Looking for any best possible solution. Hopefully python and ocr/tesseract would work.
Or any other solution/API.
Any advice would be really great.
Adrian Rosebrock
Documenting understanding and document OCR is a pretty challenging aspect of computer vision. We have solutions that work in some cases, but not others. I would recommend you first devise a method that can detect the tables that you would like to extract. If that’s not possible you would need to OCR the entire document and then perform text processing to discard data you do not need. Have you tried Tesseract yet? If not, give it a try. From there you might want to try the Google Vision API which includes OCR functionality.
ASH
Instead of using tesseract or vision API if you just want to transfer contents from pdf to csv…you can just use pyPDF or any such package. You are looking at wrong solution.
Ajwad
Hi Adrian, I am having an issue with the line no 34 of the code, It’s saying that file can not be found. I even tried hard-coding it but didn’t work. Need help urgently, Will be really thankful to you.
Kind Regard,
Ajwad
Adrian Rosebrock
Make sure you are using the “Downloads” section of this blog post to download the source code. You may have introduced an error by accident if you were copying and pasting. Line 34 writes an image to disk so if you modified that line you could have specified a path to an input directory that doesn’t actually exist.
Vismaya
This is awesome !! But I really would like to know whether pytessaract or any other ocr techniques can read the values of a table line by line and map the values like in a balance sheet ?? Or if we train a network to perform the task also how do we feed the ground truths/mapping of the table values to one another ?
Adrian Rosebrock
Whether or not Tesseract will work well in this case is really dependent on how cleanly you can segment the text (foreground) from the background. Tesseract works best with clean segmentations. You may want to try the latest Tesseract release which includes LSTM networks. You may also want to look into the Google Vision API.
Akshay
Hi Adrian
I’m trying to search for text in a document image (screenshot of a pdf document) and highlight it. I’m trying to do it in python. Any suggestions or solution would be great.
Adrian Rosebrock
You might want to take a look at the Google Vision API which has fairly robust text detection and recognition.
kim
usage: ocr.py [-h] -i IMAGE [-p PREPROCESS]
ocr.py: error: argument -i/–image is required
what is the error??
Adrian Rosebrock
Hey Kim — make sure you read this blog post on command line arguments to solve your error.
Bobby
Adrian a lot of people appreciate your help, but we also would appreciate a brief answer instead of “hey read this post to answer your question”, because reading that post doesn’t remove the error when we run the code you supplied in this article. If you could just explain to us what needs to be done in order to fix the error, that would be more appreciated. You live and learn I guess.
Adrian Rosebrock
Hey Bobby — I answer 100’s of questions per day her on the PyImageSearch blog. If I answer the question and provide a solution inside an article I previously wrote I will absolutely link to it. If you read the post I linked to it would resolve the error.
Bobby
I guess the point is, we don’t want to work in the command line. Coders did that in the 80s. We now use GUI coding. Cmon man.
Adrian Rosebrock
You don’t need to use the command line if you don’t want to. GUIs and IDEs change. Command lines don’t. I’m not here to dictate how others code but using a GUI is not a reason to not learn how to use the command line.
ASH
He got owned…lol
Reza
Hi Adrian
how about arabic characters?
Adrian Rosebrock
You’ll want to take a look at Tesseract’s language packs.
harshini tummala
hello….
where do execute this code?
in pycharm ?
or anconda?
Adrian Rosebrock
I would suggest executing the code via your terminal so you can apply any command line arguments.
SKR
Adrian, thanks for this tutorial. Can you tell what approaches or techniques one should follow if text detection and/or recognition is required in an engineering drawing image where clean text are generally written as numerals, letters, combination of numerals and letters, often encircled or slanted or written with/between arrows showing dimensions? I think before applying OCR I need to do some heavy pre-processing but I am not able to figure out which ones. I would have left a snapshot of an example image had there been a provision to upload one. Thus, explaining the image in words;-) Thanks a lot for your time and guidance.
Adrian Rosebrock
What do you mean by “clean text”? Could you upload the image to Imgur and then link to it?
SKR
Hi Adrian, please see the technical drawing at the link on Imgur https://imgur.com/a/hx0uJjf
What I mean by “clean text” is that generally in technical drawings the text is very legible and clear and NOT like in handwritten text or calligraphic text or fancy text with font variation.In brief, I want to pre-process this image in order to recognize the multiple shapes involved, with their dimensions indicated by arrows and extracted text. What approach should I take to segment this image and create a table of shapes in this image with their dimension and position in the image? Can you point me to some resources or existing work? Thanks for your advice.
Adrian Rosebrock
You might want to start with multi-scale template matching or training a custom object detector, such as HOG + Linear SVM to detect the shapes in the image. From there you could pass the text itself into an OCR library such as Tesseract or the Google Vision API.
For what it’s worth, I cover how to train your own object detectors inside the PyImageSearch Gurus course and Deep Learning for Computer Vision with Python.
SKR
Many thanks Adrian, I will follow your advice and read about these things. Let me check your courses for understanding these training of object detectors.
Karthik K
Hi,
I am stuck at running the code in a loop. Basically I have a folder containing multiple files. How can I modify the code to sequentially read the files in a folder, perform ocr as given in this tutorial and write the output back to a file in the folder?
Please help!!
Adrian Rosebrock
Take a look at the imutils library which includes a method to loop over files in a directory. This blog post contains an example of looping over images in a file.
Apurva
Hey i need to OCR the hindi text from an image. I downloaded the hin.traineddata file and pasted it in tessdata folder . I am not able figure it out how to use tesseract for the extraction of hindi data from the image. Please help!!!
Charlotte
Hi Adrian,
Thank you for your script, it is very helpful. Do you think it is possible to do the same thing with a PDF file ? I would like to get the text of a PDF using OCR.
Thank you for your reply,
Charlotte
Adrian Rosebrock
OpenCV doesn’t natively load PDFs but you could convert the PDF to an image first and then run this code. Or you could use Tesseract directly.
al.krinker
You need to convert your pdf into image, something that can be used by tesseract. The easiest option in my opinion is to install ImageMagick
yum install ImageMagick
and then convert your pdf
convert -density 300 file.pdf -depth 8 file.tiff
ashok
Hello,
I have a screenshot of one image. in that some text is in french language.
I want to write a code for extract text from that screen shot’s image.
please help.
Arijit
Hi Adrian,
This code cannot recognize small fonts in image, instead it returns some unwanted charecters. Plz let me know the api how to handle small fonts
Adrian Rosebrock
The Tesseract API itself may not be able to handle small fonts. It’s also hard to know what “small” means in this context without seeing an example image.
jer
Hello ! Thank you very much for this code !
But I have problem ! When I run here’s what I have :
usage: image_to_read.py [-h] -i IMG_2.PNG [-p PREPROCESS]
image_to_read.py: error: the following arguments are required: -i/–IMG_2.png
Can you help me please ?
Thank you !
Adrian Rosebrock
You’re not supplying the command line arguments to your script. Reading this post will solve your error.
Sanda
hi,
I want to know how I can check the text orientation of the image. Bcz words can be located upside down. I am implementing a reading system for the blind.
I tried it with template matching, but font size can be varied. So that method did not succeed.
Can you help me, please?
Thank you in Advance
Adrian Rosebrock
Recognizing text orientation can be a challenging problem. I’ll be addressing it in this coming Monday’s “text detection” blog post. Be sure to keep an eye out for it.
Adrian Rosebrock
You might want to try detecting each line of text individually, extracting the ROI, and then passing that ROI through Tesseract. As far as very old newspaper articles, the images may simply be too noisy for Tesseract. It’s hard to say without seeing examples of the images.
Sanda
Hi
can I know how to identify text orientation of the words? I am implementing a text reading system for the blind as my university project, How can I know wheater words are upside or not
can u plz give me a solution.
Ahmed
how good is this for production ?
Adrian Rosebrock
Try it and see!
Ahmed
I tried this and the results are very bad on real-time data, I think deep learning has to be implemented ! Any suggestions how to do it ?
Adrian Rosebrock
I’ll be covering a more accurate deep learning-based OCR method that uses Tesseract in a future blog post. Make sure you sign up for the PyImageSearch newsletter to be notified when the post is published.
Aveshin Naidoo
Hi, do you have any suggestions on using OCR on a live video feed for regular printed documents using Tesseract and possibly OpenCV via a Pi 3.
Thank you kindly.
Albert
Hi,
I didn’t really understand where or how to put the path to the image in the code. I tried several places but it came back with error messages. Could you give an example, please?
Adrian Rosebrock
The image path is actually supplied via command line argument. This tutorial will help you resolve any confusion.
Adam
I wonder how did you learn all this stuff. Even I just read confusing me.
Adrian Rosebrock
I studied computer vision in college and I did my PhD in computer vision and machine learning. I share what I’ve learned on this blog. If you’re brand new to computer vision and OpenCV I would recommend you read through my introductory book, Practical Python and OpenCV. The book is a gentle guide to getting started with computer vision. Give it a read, I think it will really help you.
Widhera
i got error in pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it’s not in your path. can u help me? i already installed pytesseract with pip
Adrian Rosebrock
Can you confirm that Tesseract is properly installed via “pip freeze”? It sounds like you may not have installed the pytesseract library correctly.
Chetan Patil
You need to install the binary files from the Github repo. Only then, the above error won’t be thrown. Because that’s how mine worked.
Chetan Patil
Hey Adrian. I have successfully installed the binary files as well as the pytesseract library. However, the image_to_string method always returns an empty string no matter what !.
Is there any way to solve this problem ?
Adrian Rosebrock
That is odd. Are you using the example image used here in this blog post? Or your own images?
Rob
Hello Adrian,
how do i Train Tesseract with handwritten Text? Did you try to Train Tesseract and do you have some advices for me?
Adrian Rosebrock
Tesseract really isn’t meant for handwritten text. To be honest I would suggest using something like Google’s Vision API which includes an OCR component. In my opinion it’s one of the best OCR engines available right now.
Mohammad Asad Khan
Hi Adrian,
I am new to this domain of programming, I’ve tried your above program in PyCharm and I am getting an error:
ocr.py: error: the following arguments are required: -i/–image
My Python version is 3.6.6 and I’ve already configure all of the listed packages. Please give me appropriate answer. Thanks.
Adrian Rosebrock
You’re not supplying the command line arguments to the script. Either execute the script via the command line or set the command line arguments via PyCharm. You can learn more about command line arguments here.
Srinath
Hi adrian,
I am working on a project which requires to detect number plate of a car. I am using tesseract OCR but the accuracy is very less. What would you suggest to do.(change OCR engine or train a neural net etc.)
1.The number plate will be cropped from a surveillance video so it is very difficult to get exact boundaries etc.
2.It’s size is very less.
So, I have tried
1.Increasing its DPI
2.Erosion, dilation
3.Thresholding,blurring
Adrian Rosebrock
Be sure to take a look at the PyImageSearch Gurus course where I have an entire set of lessons dedicated to building your own Automatic License/Number Plate Recognition system.
Shivani Modi
Hey Adrian,
I have a use case in which I need to extract url links present in the pdf. Can it be done using tesseract OCR. As of now the accuracy is not good enough as the underlying line cuts alphabets like p,f,g and the ocr detects them as some character resembling to the upper part of the letter. What would you suggest to do?
Adrian Rosebrock
You may want to try my other OCR guide which uses Tesseract v4 (which tends to be more accurate in some situations).
Albert
hello Adrian,
i just used Tesseract v4 with Image. In some cases it has better result than cv.
here is the code:
from PIL import Image
import pytesseract
print(pytesseract.image_to_string(Image.open(‘cap.jpg’)))
Adrian Rosebrock
For other readers interested in Tesseract v4 refer to this post.
Jerome Diongon
I got error.
usage: ocr.py [-h] -i IMAGE [-p PREPROCESS]
ocr.py: error: the following arguments are required: -i/–image
Thats the error PLEASE HELP.
TIA.
Adrian Rosebrock
It’s okay if you are new to command line arguments but you need to educate yourself on them first. From there you will not have any errors.
paulz
Hello Adrian:
Your course is very helpful. I use python to do some work. But unfamiliar with tesseract. So I have some questions during learning.
Now I want to use tensseract to transfer some snapshots of log files produceted on linux (Eg: a result showed in vim)
1、In you articls. You said default config is enough (without pre-processing) to recognize extremely clean segmentations. Of course, snapshots of log files is definitly clean. But the final results is still have some errors apparently. And can’t recognized special characters like “*” in shapshots. And many blank line existed in final results.How should I improve results?
2、I know the DPI of a pic will decide final results. How about fonts? and front size?
Adrian Rosebrock
I’m glad you’re enjoying the course! As for your questions, try using the more powerful Tesseract v4 and see if your results improve.
irfan
hello adrian,
i have installed pillow, pytesseract by following your tutorials, but when i try to execute a script it says “no module pytesseract” what to do?
Adrian Rosebrock
Are you using Python virtual environments? If so, make sure you access your virtual environment before executing the code. You should also double-check that pytesseract was installed properly.
irfan
yes. when i execute by terminal it works but when i try to execute through python script using import then i am getting the error “no module pytesseract”
Adrian Rosebrock
You’re able to execute the script via terminal it works, correct? So what do you mean by “execute it through Python script using import”? For the terminal are you just opening a Python shell and importing?
irfan
by calling “tesseract image.jpg output” it works but when i try to execute a python script for tesseract it gives error as “no module pytesseract”
mohammed irfan
hello adrian
it works when i execute through terminal but when i try to execute through script using import pytesseract only then i get the error as “no module pytesseract”.
Adrian Rosebrock
Make sure you have properly installed “pytesseract” using the instructions I’ve covered in this tutorial (i.e., the “Installing the Tesseract + Python “bindings” section of the post).
Hanan Temam
Hello Adrian ,
I am new to tesseract and python thing.your course is very helpful and it works fine for me till now,and I want to learn more and do more on on my language so what I want to ask is
1. is it possible for me to develop a mob app or desktop app with this ?
2. can I include DPI into my code ?
3. how can I save the output text to the separate folder as aligned to the image?
Adrian Rosebrock
1. Python + OpenCV really aren’t meant to be used as mobile or desktop apps. There are too many library dependencies. You may want to try coding a C++ app for desktop or use a native language for the mobile device.
2. I assume so but you should refer to the Tesseract documents
3. Sorry, I’m not sure what you mean.
irfan
hello adrian
sorry for disturbing by asking again and again.
can i get step by step process for installing tesseract.
do i need to put any path for tesseract.
what are the minimum requirements for installing tesseract.
please help me
Adrian Rosebrock
Refer to this tutorial to install Tesseract on your system.
Daniel S.
Hello Adrian,
What is the best way to print out the x,y coordinates of words that have been detected from an image?
Adrian Rosebrock
Refer to this tutorial where we use the EAST text detector to detect the location of text in an image and then OCR it.
Sajeesh Namboothiri
Hi,
I have one question, I am able to get the text from some images but my requirement is, I have to read all the text from invoices. Could you help me for the same
Adrian Rosebrock
Sorry, I do not have any tutorials for invoice/document registration and understanding. I may consider it as a future tutorial though.
grish
hi adrian !
thanks for your time
i tries your code in a script but the window only opens and closes immediately.
i am new in this and maybe i didn’t notice something. i just copy/paste your code
i typed ocr.py –image sample.jpg
Adrian Rosebrock
Don’t copy and paste the code. Use the “Downloads” section of the tutorial to download the code (you likely accidentally introduced an error during the copy and paste).
grish
thanks for the reply.
when i launch ocr.py with the promp, i got this error : the following argument are required -i/ –image.
is the image supposed to be in the same folder than the ocr.py or not ?
Adrian Rosebrock
Make sure you read this tutorial on argparse first.
Jaydeep Dholakia
For all getting the error of “TesseractNotFoundError: tesseract is not installed or it’s not in your path”
Here is a solution to it:
Download the exe of it and then install it. After that, go to the environment variables and add the installed directory to the PATH variable.
I have explained it in more depth in the answer here: https://stackoverflow.com/a/56559289/10523024
If it helps then do upvote my answer on StackOverflow!
Adrian Rosebrock
I assume this is for Windows only, correct?
Naveen
Hi,
Please tell me how to detect the decimal reading.
Jasmeet Singh
Hi, thank you for the great tutorial!!
I am currently working on a project and want to detect the letter drawn by a user in white on a black screen. But whenever I pass a sample image with the letter ‘C’ drawn on it to tesseract binary, It always returns empty page and a warning which says: Warning. Invalid resolution 0 dpi. Using 70 instead. I have tried increasing the dpi of the image with the help of an online and also tried thresholding the image but still I am getting the same result. Please tell me what maybe the cause of the problem. Are their specific properties like size and dpi defined to be used in the input image?
Thank You
Adrian Rosebrock
Hey Jasmeet — I would actually refer you to Practical Python and OpenCV where I cover handwriting recognition.
Rafal Kosinski
Hi Adrian.
I really appreciate your knowledge and motivation to share it in the posts. These are helping me to enter the CV world and definitely I will study your books soon.
Regarding Tesseract, default language is english (‘eng’). To use an other language one needs to copy relevant data (eg. ‘pol.traineddata’ for polish) to a certain location. Then use:
text = pytesseract.image_to_string(Image.open(filename), lang=”pol”).
I am wondering how to use Tesseract (pytesseract) on text image with multiple languages? For example a foreign language lessons book contains instructions in the native language and examples in the foreign one. Or a literature text that contains quotes in a foreign language.
Neha
Hi Rafal,
Did you get how to use Tesseract (pytesseract) on text image with multiple languages?
If you got it kindly share it here.
Thank you in advance.
Ani
Hi Adrian,
I have 1 question for you.
What if the quality of image is very poor.(image which contains text).
But tesseract may not give 100% accurate result. So how can we extract the text from an image which is having poor quality
Adrian Rosebrock
In general, you can’t. You should always strive to work with higher quality images.
Ajay
Hi Adrian,
I have got a task to recognize the serial number on a note.
Can we use Tesseract for this.
Adrian Rosebrock
Potentially. Give it a try and see (it’s hard to now without seeing examples of your image files).
Ctibor
Dear Adrian.
Wouldn’t it be better to use cv2.text_OCRTesseract directly? Or some other function directly from opencv that is part of cv2 in python : https://docs.opencv.org/3.4/d1/d66/classcv_1_1text_1_1TextDetectorCNN.html
Adrian Rosebrock
I haven’t tried OpenCV’s OCR + Tesseract integration before so I cannot say.