pytesseract image_to_string not pulling strings, but no Error - python-3.x

I am using the image_to_string function in the pytesseract package to convert multiple parts of a single picture file to string. All parts are working except for this image:
Here is the script that I am using to convert it:
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
im = Image.open('image.png')
text = pytesseract.image_to_string(im)
print(text)
Which gives the output:
—\—\—\N—\—\—\—\—\N
I have tried breaking up the image into smaller parts as well as processing the image as a jpg and as png. What can I do to have it output the values in the image?

Using a different page segmentation instead of the default one seems to work.
text = pytesseract.image_to_string(im,config ='--psm 6'))
According to the tesseract wiki, option 6 assumes a single uniform block of text. I tried with other options but only this one worked.
To check for other page segmentation methods read the tesseract wiki on how to improve quality of an image.

Related

Using PDF file in Keras OCR or PyTesseract - Python, is it possible?

I am using Keras OCR and PyTesseract and was wondering if it is possible to use PDF files as the image input.
If not, does anyone have a suggestion as to how to convert a very massive PDF file into PNG or another acceptable format?
Thank you!
No, as far as I know PyTesseract works only with images. You'll need to convert your pdf to images first.
By "very massive PDF" I'm assuming you mean a pdf with lots of pages. This is not an issue. You can use pdf2image library (see the docs here). The method convert_from_path has an output_folder argument that lets you specify the folder where all your generated images will be saved:
Output directory for the generated files, should be seen more as a
“working directory” than an output folder. The converted images will
be written there to save system memory.
You can later use them one by one instead of your pdf to work with PyTesseract. If you don't assign the returned list of images from convert_from_path you don't risk filling up your memory.
Otherwise, if you are willing to keep everything in memory you can use the returned pages directly, like so:
pages = convert_from_path(pdf_path)
for example, my code :
Python : 3.9
Macos: BigSur
from PIL import Image
from fonctions_images import *
from pdf2image import convert_from_path
path='/Users/yves/documents_1/'
fichier =path+'TOUTOU.pdf'
images = convert_from_path(fichier,500, transparent=True,grayscale=True,poppler_path='/usr/local/Cellar/poppler/21.12.0/bin')
for v in range(0,len(images)):
image=images[v]
image.save(path+"image.png", format="png")
test=path+"image.png"
img = cv2.imread(test) # to store image in memory
img = del_lines(path,img) # to supprime the lines
img = cv2.imread(path+"img_final_bin_1.png")
pytesseract.pytesseract.tesseract_cmd = "/usr/local/bin/tesseract"
d=pytesseract.image_to_data(img[3820:4050,2340:4000], lang='fra',config=custom_config,output_type='data.frame')

Text recognition of an image with Tesseract

I would like to create a pdf file with text recognition from a scanned image.
But I don't want the original image in the PDF file, just plain text. The text should be visible so that it can be read, but the font doesn't matter that much.
This Tesseract command does almost what I want, but the text is invisible.
tesseract -c textonly_pdf=1 test.tif test pdf
How can I make the text visible?
Can I create a pdf file with another command-line or python tool?
I'm running Tesseract in Ubuntu.
Here a snippet of code from a script I made in python (on windows) one year ago to extract the text in a dataframe (that you can then save to csv or other formats).
import cv2
import pytesseract as pya
pya.pytesseract.tesseract_cmd = r'D:\Programs\Tesseract_OCR\tesseract.exe'
from pytesseract import Output
imgcv = cv2.imread('foo.jpg')
# in text_df you have the extracted text, confidence and so on
text_df = pya.image_to_data(imgcv , output_type='data.frame')
text_df = text_df[text_df.conf != -1]
text_df = text_df[text_df.conf > 50]
conf = text_df['conf'].mean()

Tesseract-OCR is unable to find text in SOME images

I have been using tesseract for text extraction from images. But due to some unknown reasons, it is unable to extract text from some images even if the image is clean. I am unable to find the reason for this. Is there anything that i am missing in preprocessing?
from pytesseract import image_to_string
text = image_to_string('/path/to/image')
result: " "
this code is working fine with other images, But not with this image.
original image:

Image type Python: loaded a jpg, showing a png

I have been playing around with images in Python, just trying to understand how things work basically. I have noticed something odd and was wondering if anyone else could explain it.
I have an image 'duck.jpg' -
If I look at the properties I can see that it is a jpg image.
However, after importing into python using the follwoing convoluted way:
from PIL import Image
import io
with open('duck.jpg', 'rb') as f:
im = Image.open(io.BytesIO(f.read()))
f.close()
I get the following output after calling
im.format
'PNG'
Is there some sort of automatic conversion going on?

How to overlay images on each other in python and opencv?

I am trying to write images over each other. Ideally, what I want to do is to write every image in one folder over every image in another folder and output every unique image to another folder. So far, I am just working on having one image write over one image, but I can't seem to get that to work.
import numpy as np
import cv2
import matplotlib
def opencv_createsamples():
mask = ('resized_pos/2')
img = cv2.imread('neg/1')
new_img = img * (mask.astype(img.dtype))
cv2.imwrite('samp', new_img)
opencv_createsamples()
It would be helpful to have more information about your errors.
Something that stands out immediately is the lack of file type extensions. Your images are probably not being read correctly, to begin with. Also, image sizes would be a good thing to consider so you could resize as required.
If the goal is to blend images, considering the alpha channel is important. Here is a relevant question on StackOverflow:How to overlay images in python
Some other OpenCV docs that have helped me in the past: https://docs.opencv.org/trunk/d0/d86/tutorial_py_image_arithmetics.html
https://docs.opencv.org/3.1.0/d5/dc4/tutorial_adding_images.html
Hope this helps!

Resources