Increase Accuracy of text recognition through pytesseract & PIL - python-3.x

So I am trying to extract text from image. And as the quality and size of image is not good, it is giving inaccurate results. I tried few enhancements and other things with PIL but that is only worsening the quality of image.
Can someone suggest some enhancement in image to get better results. Few Examples of images:

In the provided example of image the text is visually of quite good quality, so the question is how it comes that OCR gives inaccurate results?
To illustrate the conclusions given in further text of this answer let's run the the given image
through Tesseract. Below the result of Tesseract OCR:
"fhpgearedmomrs©gmachom"
Now let's resize the image four times and apply thresholding to it. I have done the resizing and thresholding manually in Gimp, but with appropriate resizing method and threshold value for PIL it can be for sure automated, so that after the enhancement you get an image similar to the enhanced image I have got:
The improved image run through Tesseract OCR gives following text:
"fhpgearedmotors©gmail.com"
This demonstrates that enlarging an image can help to achieve 100% accuracy on the provided text-image example.
It may appear weird that enlarging an image helps to achieve better OCR accuracy, BUT ... OCR was developed to convert scans of printed media to texts and expect 300 dpi images of the text by design. This explains why some OCR programs didn't resize the text by themselves to improve their results and do bad on small fonts expecting higher dpi resolution of the image which can be achieved by enlarging.
Here an excerpt from Tesseract FAQ on github.com prooving the statement above:
[There is a minimum text size for reasonable accuracy. You have to consider resolution as well as point size. Accuracy drops off below 10pt x 300dpi, rapidly below 8pt x 300dpi. A quick check is to count the pixels of the x-height of your characters. (X-height is the height of the lower case x.) At 10pt x 300dpi x-heights are typically about 20 pixels, although this can vary dramatically from font to font. Below an x-height of 10 pixels, you have very little chance of accurate results, and below about 8 pixels, most of the text will be "noise removed".]

Related

Convert PDF to image with high resolution to fit in page

I regularly get tree-drilling-data out of a machine that should get into reports.
The pdf-s contain too much empty space and useless information.
With convert i already managed to convert the pdf to png, cut out parts and rebuild an image i desire. It has a fine sharpness, its just too large:
Output 1: Nice, just too large
For my reports i need it in 45% size of that, or 660 pixels wide.
The best output i managed up to now is this:
Output 2: Perfect size but unsharp
Now, this is far away in quality from the picture before shrinking.
For sure, i've read this article here, that already helped.
But i think it must be possible to get an image as fine as the too large one in Output 1.
I've tried around for hours with convert -scale, -resize, -resample, playing around with values for density, sharpen, unsharpen, quality... nothing better than what i've got, using
convert -density 140 -trim input.pdf -quality 100 -sharpen 0x1.0 step1.png
then processing it to the new picture (output1, see up), that i'm putting to the correct size with
convert output1.png -resize 668x289! -unsharp 0x0.75+0.75+0.01 output2.png
I tried also "resize 668x" in order not to maybe disturb, no difference.
I find i am helpless in the end.
I am not an IT-expert, i am a computer-affin tree-consultant.
My understanding of image-processing is limited.
Maybe it would make sense to stay on a vector-based format (i tried .gif and .svg ... brrrr).
I would prefer to stay with convert/imagemagick and not to install additional software.
It has to run from command-line, as it is part of a bash-script processing multiple files. I am using Suse Linux.
Grateful for your help!
I realize you said no other software, but it can be easier to get good results from other PDF rendering engines.
ImageMagick renders PDFs by shelling out to ghostscript. This is terrific software, but it's designed for print rather than screen output. As a result, it generates very hard edges, because that's what you need if you are intending to control ink on paper. The tricks you see for rendering PDF at higher res and then resizing them fix this, but it can be tricky to get the parameters just right (as you know).
There are PDF rendering libraries which target screen output and will produce nice edges immediately. You don't need to render at high res and sample down, they just render correctly for screen in the first place. This makes them easier to use (obviously!) and a lot faster.
For example, vipsthumbnail comes with suse and includes a direct PDF rendering system. Install with:
zypper install vips-tools
Regarding the size, your 660 pixels across is too low. Some characters in your PDF will come out at only 3 or 4 pixels across and you simply can't make them sharp, there are just too few dots.
Instead, think about the size you want them printed on the paper, and the level of detail you need. The number of pixels across sets the detail, and the resolution controls the physical size of those dots when you print.
I would at least double that 668. Try:
vipsthumbnail P3_M002.pdf --size 1336 -o x.png
With your sample image I get:
Now when you print, you want those 1336 pixels to fill 17cm of paper. libvips lets you set resolution in pixels per millimetre, so you need 1336 pixels in 170 mm, or 1336 / 170, or 7.86. Try:
vips.exe copy x.png y.png[palette] --xres 7.86 --yres 7.86
Now y.png should load into librecalc at 17cm across and be nice and sharp when printed. The [palette] option after y.png enables palettised PNG, which shrinks the image to around 50kb.
The resolution setting is also called DPI (dots per inch). I find the name confusing myself -- you'll also see it called "pixels per printed inch", which I think is a much clearer.
In Imagemagick, set a higher density, then trim, then resize, then unsharpened. The higher the density, the sharper your result, but the slower it will get. Note that PNG quality of 100 is not the proper scale. It does not have quality values corresponding to 0 to 100 as in JPG. See https://imagemagick.org/script/command-line-options.php#quality. I cannot tell you the "best" numbers to use as it is image dependent. You can use some other tool such as at https://imagemagick.org/Usage/formats/#png_non-im to optimize your PNG output.
So try,
convert -density 300 input.pdf -trim +repage -resize 668x289 -unsharp 0x0.75+0.75+0.01 output.png
Or remove the -unsharp if you find that it is not needed.
ADDITION
Here is what I get with
convert -density 1200 P3_M002.pdf -alpha off -resize 660x -brightness-contrast -35,35 P3_M002.png
I am not sure why the graph itself lost brightness and contrast. (I suspect it is due to an imbedded image for the graph). So I added -brightness-contrast to bring out the detail. But it made the background slightly gray. You can try reducing those values. You may not need it quite so strong.
Great, #fmw42,
pngcrush -res 213 graphc.png done.png
from your link did the job, as to be seen here:
perfect size and sharp graph
Thank you a lot.
Now i'll try to get file-size down, as the Original pdf has 95 KiB an d now i am on 350 KiB. So, with 10 or more graphs in a document it would be maybe unnecessary large, also working on the ducument might get slow.
-- Addition -- 2023-02-04
#fmw42 : Thanks for all your effort!
Your solution with the .pdf you show does not really work - too gray for a good report, also not the required sharpness.
#jcupitt : Also thanks, vips is quick and looks interesting. vipsthumbnails' outcome ist unsharp, i tried around a bit but the docu is too abstract for me to get syntax-correct use. I could not find a dilettant-readable docu, maybe you know one?
General: With all my beginners-trials up to now i find:
the pdf contains all information to produce a large, absolutely sharp output (vector-typic, i guess)
it is no problem to convert to a png of same size without losing quality
any solutions of shrinking the png in size then result in significant (a) quality-loss or (b) file-size increase.
So, i (beginner) think that the pdf should be processed directly to the correct png-size, without later downsampling the png.
This could be done
(a) telling the conversion-process the output-size (if there is a possibility for this?) or
(b) first creating a smaller pdf, like letting it look A5 instead of A4, so a fitting .png is directly created (i need 6.5 inches wide approx.).
For both solutions i miss ability to sensefully investigate, for it takes me hours and hours to try out things and learn about the mysteries of image-processing.
The solution with pngcrush works for the moment, although i'm not really happy about the file-size (cpu and fan-power are not really important factors here).
--- Addition II --- final one 2023-02-05
convert -density 140 -trim "$datei" -sharpen 0x1.0 rgp-kopie0.png
magick rgp-kopie0.png +dither PNG8:rgp-kopie.png ## less colours
## some convert -crop and -composite here to arrange new image
pngcrush -s -res 213 graphc.png "$namenr.png"
New image is as this, with around 50 KiB, definitely satisfying for me in quality and filesize.
I thank you all a lot for contributing, this makes my work easier from now on!
... and even if i do not completely understand everything, i learnt a bit.

How to detect image brightness and sharpness in python?

I tried applying tesseract ocr on image but before applying OCR I want to improve the quality of the image so that OCR efficiency increase,
How to detect image brightness and increase or decrease the brightness of the image as per requirement.
How to detect image sharpness
It's not easy way to do that, if your images are similar to each other you can define "correct" brightness and adjust it on unprocessed images.
But what is "correct" brightness? You can use histogram to match this. See figure bellow. If you establish your correct histogram you can calibrate other images to it.
Richard Szeliski, Computer Vision Algorithms and Applications
I think you can use the autofocus method. You must check contrast histogram of image and define what is sharp to you.
source
*Basic recommendation, using the gray images is better on OCR
You can use the BRISQUE image quality assessment for scoring the image quality, it's available as a library. check it out, a smaller score means good quality.

Reducing colors in a PNG image is making the file size bigger

I am using ImageMagick to programmatically reduce the size of a PNG image by reducing the colors in the image. I get the images unique-colors and divide this by 2. Then I assign this value to the -colors option as follows:
variable = unique-colors / 2
convert image.png -colors variable -depth 8
I thought this would substantially reduce the size of the image but instead it increases the images size on disk. Can anyone shed any light on this.
Thanks.
EDIT: Turns out the problem was dithering. Dithering helps your reduced color images look more like the originals but adds to the image size. To remove dithering in ImageMagick add +dither to your command.
Example
convert CandyBar.png +dither -colors 300 -depth 8 smallerCandyBar.png
Imagemagick probably uses some dithering algorithm to make image appear as though it has original amount of colors. This increases image data "randomness" (single pixels are recolored at some places to blend into other colors) and this image data no longer packs as well. Research further into how the convert command does the dithering. You can also see this effect by adding second image as a layer in gimp/equivalent program and tuning transparency.
You should use pngquant for this.
You don't need to guess number of colors, it has actual --quality setting:
pngquant --verbose --quality=70 image.png
The above will automatically choose number of colors needed to match given quality in the same scale as JPEG quality (100 = perfect, 70 = OK, 20 = awful).
pngquant has substantially better quantization algorithm, and the better the quantization the better quality/filesize ratio.
And pngquant doesn't dither areas that look good without dithering, and this avoids adding unnecessary noise/randomness to the file.
The "new" PNG's compression is not as good as the one of the original.

ImageMagick: convert image to B&W, high contrast!

I'm trying to OCR certain images, but am having problems with the accuracy. I would like to see if I can improve accuracy by converting images to B&W, high contrast. Any ideas how I can do that with ImageMagick?
Fred's two colour threshold ImageMagick script might help.

Why converting jpeg to colour profile in GIMP reduce the size so much?

I have a 2MB JPEG image, and when I use the option Image > Mode > Convert to colour Profile, the size get reduced to 50KB without to much quality loss.
Could somebody explain why is the size so reduced? Am I missing some important point?
When you reduce the number of colors within an area you reduce the amount of math needed to describe that area, which results in a smaller file size.
I observed the same long time ago while converting JPG images with Image Magic.
JPEG contains information for reproduction of color components on different media. "convert to color profile" reduces this information by choosing concrete profile. This do not degrade quality. More info at ICC

Resources