OpenCV Nodejs prepare image to OCR TesseractJS, remove dots - node.js

I'm trying to read data over Tesseract from image captured by webcamera. Here is example of used image:
I'm working on nodejs server, and I tried a lot of technique in Jimp including doing invert/grayscale, using sharpening to image, or fiiltering specific colors /yellow/blue/ ... after all I build separated docker container using opencv4nodejs and apply few techniques to extract text from that image.
I need mostly big texts (so small one are not neccessary /also are not sharp on this image/). So I applied this:
const src = cv.imread('./970f5b45-9f24-41d5-91f0-ef3f8b9d8914.jpeg');
let src2 = src.cvtColor(cv.COLOR_BGR2GRAY)
let dst = src2.adaptiveThreshold(255, cv.ADAPTIVE_THRESH_GAUSSIAN_C, cv.THRESH_BINARY, 12, 2);
let dst2 = dst.morphologyEx(cv.MORPH_OPEN)
After that I have this result, which is almost ready for reading by OCR, problem is a lot of dots in that image. Is there any chance to remove that dots, but keep quality of result (readable texts) in opencv, or other technique?
Result is right now:
Is it possible to extract just texts from that result? If I use this result in ocr by tesseract, it takes really a long time to extract text, and there is a huge amount of weird characters (probably because of dots/shapes).

Related

Compare images for finding the images are visibly same

Need to find the images are same , even if it has different resolution and size. No need of a pixel to pixel comparison, need to find all the images , texts, color etc. are same in the other image also.
Tried with different python packages to compare, but all are asking for same resolution. One of my image is screenshotted from Mac and other is from Ubuntu. Even both are same html, the contrast and resolution difference of the machines causing the images to become different when compare.
Tried.
Perceptual diff,
Image Hash etc.
• PDIFFER – Python wrapper for perceptualdiff tool (https://pypi.org/project/pdiffer/)
Problem - pip install pdiffer failed to install in the Macs for the latest as well as old versions
• NEEDLE - Installed needle (https://needle.readthedocs.io/en/latest/) This one has an option to specify comparison engine to be perceptualdiff/Imagemagick instead of default PIL.
Problem Found - Has an option to save baseline image first and then run assertions on them. It works when baseline is saved and then compared. I didn’t find anything that would compare the screenshots with existing images.
• OPENCV – Histogram based comparison. This converts images into grayscale, and into histograms and compares the histograms. Returns a value between -1 and 1 (-1 means not similar at all and 1 means highly similar) (https://www.pyimagesearch.com/2014/07/14/3-ways-compare-histograms-using-opencv-python/ )
Findings - I tested two images by converting into histograms and compared them Which returned a value of 0.8 (meaning somewhat similar).
Below code i tried using imagehash:
from PIL import Image import imagehash
image_one = 'result.png'
img = Image.open(image_one) image_one_hash = imagehash.whash(img)
print(image_one_hash)
image_two = 'not-found-02.png'
img2 = Image.open(image_two) image_two_hash = imagehash.whash(img2)
print(image_two_hash)
similarity = image_one_hash - image_two_hash print(similarity)

Is there a way in Tesseract to capture text meta-data along with text?

I am trying to figure out if text metadata like font-size, font-family, bold/italic etc. can be captured using Tesseract. Below is the code I used to try it but that did not work and returned "None". Using, Tesseract version = 4.1.1, Tesseract-OCR engine version = 5.0.0
with open(Image_file_location, "rb") as image:
f = image.read()
b = bytearray(f)
with tesserocr.PyTessBaseAPI() as api:
image = Image.open(io.BytesIO(b))
api.SetImage(image)
api.Recognize()
iterator = api.GetIterator()
print(iterator.WordFontAttributes())
Currently, using Tesseract, I was able to capture text properly but not meta-data. I have attached a sample image file and example expected output.
Expected Output:
[Font:"some_font", Font_family:"some_font_family", Bold, font_size:"some_font_size] GCEO Review
[Font:"some_font", Font_family:"some_font_family", Bold, font_size:"some_font_size] Dear Shareholders,
[Font:"some_font", Font_family:"some_font_family", Bold, font_size:"some_font_size] TURNING THE....
[Font:"some_font", Font_family:"some_font_family", Bold, font_size:"some_font_size] We have executed well and gained mobile share in our core.........
So, basically, wherever there is a change in meta-data, we should be able to capture the information and prepend that information before that sentence.

How to make tesseract work for different kinds of images?

I am working on a digit recognition task using Tesseract and OpenCV. I did use it and came to a solution that is specific for a particular image. If I change my image I do not obtain correct results. I should change the threshold value according to the image. The steps I did was:
Cropping the image to find an appropriate region
Change image into grayscale
Using Gaussian Blur
Taking appropriate threshold
passing the image through Tesseract
So, My question is how can I make my code generic i.e. it can be used for different images without updating my code.
While working on this image I processed as`
imggray=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
imgBlur=cv2.GaussianBlur(imggray,(5,5), 0)
imgDil=cv2.dilate(imgBlur,np.ones((5,5),np.uint8),iterations=1)
imgEro=cv2.erode(imgDil,np.ones((5,5),np.uint8),iterations=2)
ret,imgthresh=cv2.threshold(imgEro,28,255, cv2.THRESH_BINARY )
And for this Image as
imggray=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
imgBlur=cv2.GaussianBlur(imggray,(5,5), 0)
imgDil=cv2.dilate(imgBlur,np.ones((5,5),np.uint8),iterations=0)
imgEro=cv2.erode(imgDil,np.ones((5,5),np.uint8),iterations=0)
ret,imgthresh=cv2.threshold(imgEro,37,255, cv2.THRESH_BINARY )
I had to change the value of iterations and the minimum threshold to obtain proper results. What can be the solution so I should not change the values?

pytesseract not recognizing numbers in picture using OCR

I'm trying to use Python-tesseract to extract the digits from this (picture) using optical character recognition (OCR). For some reason pytesseract won't recognize the digits and I don't fully understand why (distance between the numbers?).
Can someone assist me in understanding how to properly extract the digits from this image?
The code below doesn't print anything
im.save("sudo.png")
text = pytesseract.image_to_string(im)
print(text)
A little bit of pre-processing and using ROIs to specify where the words are will help. By default, OCR uses page layout analysis to determine blocks of text. In this case, the image doesn't look like a normal page of text (like a PDF article for example).
To make it easier for OCR, first you can find the location of the words using regionprops and then pass the location of the words (as bounding boxes) to the OCR function. See the code below and results. They look accurate. You may have to play around more with the pre-processing to make this robust for a collection of different images. But hopefully, this gives you an idea on how to proceed:
capture = imread('Captura.PNG');
% Increase image size by 3x
my_image = imresize(capture, 3);
figure
imshow(my_image)
% Localize words
BW = imbinarize(rgb2gray(my_image));
BW1 = imdilate(BW,strel('disk',6));
s = regionprops(BW1,'BoundingBox');
bboxes = vertcat(s(:).BoundingBox);
% Sort boxes by image height
[~,ord] = sort(bboxes(:,2));
bboxes = bboxes(ord,:);
% Pre-process image to make letters thicker
BW = imdilate(BW,strel('disk',1));
% Call OCR and pass in location of words. Also, set TextLayout to 'word'
ocrResults = ocr(BW,bboxes,'CharacterSet','.0123456789','TextLayout','word');
words = {ocrResults(:).Text}';
words = deblank(words)

pytesseract image_to_string not pulling strings, but no Error

I am using the image_to_string function in the pytesseract package to convert multiple parts of a single picture file to string. All parts are working except for this image:
Here is the script that I am using to convert it:
from PIL import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
im = Image.open('image.png')
text = pytesseract.image_to_string(im)
print(text)
Which gives the output:
—\—\—\N—\—\—\—\—\N
I have tried breaking up the image into smaller parts as well as processing the image as a jpg and as png. What can I do to have it output the values in the image?
Using a different page segmentation instead of the default one seems to work.
text = pytesseract.image_to_string(im,config ='--psm 6'))
According to the tesseract wiki, option 6 assumes a single uniform block of text. I tried with other options but only this one worked.
To check for other page segmentation methods read the tesseract wiki on how to improve quality of an image.

Resources