Tesseract-OCR is unable to find text in SOME images - python-3.x

I have been using tesseract for text extraction from images. But due to some unknown reasons, it is unable to extract text from some images even if the image is clean. I am unable to find the reason for this. Is there anything that i am missing in preprocessing?
from pytesseract import image_to_string
text = image_to_string('/path/to/image')
result: " "
this code is working fine with other images, But not with this image.
original image:

Related

Using PDF file in Keras OCR or PyTesseract - Python, is it possible?

I am using Keras OCR and PyTesseract and was wondering if it is possible to use PDF files as the image input.
If not, does anyone have a suggestion as to how to convert a very massive PDF file into PNG or another acceptable format?
Thank you!
No, as far as I know PyTesseract works only with images. You'll need to convert your pdf to images first.
By "very massive PDF" I'm assuming you mean a pdf with lots of pages. This is not an issue. You can use pdf2image library (see the docs here). The method convert_from_path has an output_folder argument that lets you specify the folder where all your generated images will be saved:
Output directory for the generated files, should be seen more as a
“working directory” than an output folder. The converted images will
be written there to save system memory.
You can later use them one by one instead of your pdf to work with PyTesseract. If you don't assign the returned list of images from convert_from_path you don't risk filling up your memory.
Otherwise, if you are willing to keep everything in memory you can use the returned pages directly, like so:
pages = convert_from_path(pdf_path)
for example, my code :
Python : 3.9
Macos: BigSur
from PIL import Image
from fonctions_images import *
from pdf2image import convert_from_path
path='/Users/yves/documents_1/'
fichier =path+'TOUTOU.pdf'
images = convert_from_path(fichier,500, transparent=True,grayscale=True,poppler_path='/usr/local/Cellar/poppler/21.12.0/bin')
for v in range(0,len(images)):
image=images[v]
image.save(path+"image.png", format="png")
test=path+"image.png"
img = cv2.imread(test) # to store image in memory
img = del_lines(path,img) # to supprime the lines
img = cv2.imread(path+"img_final_bin_1.png")
pytesseract.pytesseract.tesseract_cmd = "/usr/local/bin/tesseract"
d=pytesseract.image_to_data(img[3820:4050,2340:4000], lang='fra',config=custom_config,output_type='data.frame')

Convert images to Icons is giving errors

I'm converting images to icons using this code:
import PIL.image
img = PIL.Image.open ("imagepath.png")
img.save ("iconpath.ico")
This is giving me an icon file as desired, but when I try to open it an error pops up:
Paint:
Microsoft photos error:
When I try to open other icons with the same programs they work perfectly, but it doesn't with the ones I made. Does anyone know any other way or library for doing this?
Try this:
img.save('iconpath.ico',format = 'ICO', sizes=[(32,32)])
You can change size to 16,16
First time I was converting image with PIL I've used this tutorial:
Tutorial
Everything worked fine.
The image that is being converted has to have a 1:1 proportion, if not, when trying to open the generated icon it will cause errors.

Image type Python: loaded a jpg, showing a png

I have been playing around with images in Python, just trying to understand how things work basically. I have noticed something odd and was wondering if anyone else could explain it.
I have an image 'duck.jpg' -
If I look at the properties I can see that it is a jpg image.
However, after importing into python using the follwoing convoluted way:
from PIL import Image
import io
with open('duck.jpg', 'rb') as f:
im = Image.open(io.BytesIO(f.read()))
f.close()
I get the following output after calling
im.format
'PNG'
Is there some sort of automatic conversion going on?

pytesseract image_to_string not pulling strings, but no Error

I am using the image_to_string function in the pytesseract package to convert multiple parts of a single picture file to string. All parts are working except for this image:
Here is the script that I am using to convert it:
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
im = Image.open('image.png')
text = pytesseract.image_to_string(im)
print(text)
Which gives the output:
—\—\—\N—\—\—\—\—\N
I have tried breaking up the image into smaller parts as well as processing the image as a jpg and as png. What can I do to have it output the values in the image?
Using a different page segmentation instead of the default one seems to work.
text = pytesseract.image_to_string(im,config ='--psm 6'))
According to the tesseract wiki, option 6 assumes a single uniform block of text. I tried with other options but only this one worked.
To check for other page segmentation methods read the tesseract wiki on how to improve quality of an image.

Getting error while running a classification code in keras

When I run the code from the following link:
https://gist.github.com/fchollet/f35fbc80e066a49d65f1688a7e99f069#file-classifier_from_little_data_script_2-py
I get the following error:
Using TensorFlow backend. Found 2000 images belonging to 2 classes.
/home/nd/anaconda3/lib/python3.6/site-packages/PIL/TiffImagePlugin.py:692:
UserWarning: Possibly corrupt EXIF data. Expecting to read 80000 bytes
but only got 0. Skipping tag 64640 "Skipping tag %s" % (size,
len(data), tag))
I am Using Ubuntu.
Tried Solution : change 'w' to 'wb' in line 70 and 81.
Thnx in advance
This is because some of the images have corrupted exif info. You can just remove the exif info of all your images to remove this warning.
The python package piexif can help you. you can use the following code to remove the exif info of an image:
import piexif
# suppose im_path is a valid image path
piexif.remove(im_path)
You can find more discussion here.
The error seems to imply that you try to use TIFF images (rather than JPEGs) and that the PIL library can´t import these without an error (Possibly corrupt EXIF data).
I suggest you try some test JPEGs to make sure your images can be imported correctly.

Resources