Image preprocessing in Python for OCR

Image preprocessing in Python for OCR - python-3.x

I'm doing pre-processing of images for OCR in python. I converted the pdf to binary images. The output I get is like this
I want the ouput to be something like this
Any idea how to go about this?

You have to use Tesseract library for extracting text from given image.
I am using window system so I downloaded it from location https://sourceforge.net/projects/tesseract-ocr-alt/files/.
Suppose you have installed it at location "E:\w\Tesseract-OCR"
Then put your image at the same location. Lets call your image question.png
Now go to command prompt and give command,
E:\w\Tesseract-OCR>tesseract.exe question.png answer.txt
Where answer.txt is text file which Tesseract will create you can give any other name instead of answer.txt and question.txt is your file.
Once command is successfully executed check output in answer.txt.
In case of your image I got following output.
Investment Type: Customer Owned
System Information
Fire III
Video I]
So in this case it is recognizing only text correctly.

Related

apply same curve-color to tiff in batch in GIMP

I would like to apply a specific color curve to some 2000 .tif files.
I am a Windows user and so far I have used GIMP for photo editing.
Using Gimp 2.10 I was able to perform such task working on .JPG files using the batch Image Manipulation plug-in (bimp v 2.6; https://alessandrofrancesconi.it/projects/bimp/).
Work flow so far for Jpegs in GIMP-2.10:
Using I created a color curve working on a jpg file (Colors-> curves).
once happy with corrections I saved the curve in an external file ("myset") which hasbeen saved in '\User\appData\roaming\gimp\2.10\curves'
using bimp plug in I choose Add->color correction
in the new window that pop-up I then select only the checkbox "change the color curv from external file" (or similar, menus are not in english...sorry) and navigate to my "myset" curve file.
finally run the batch
When I tried to do the same BUT WITH the .tif files, I got warnings of the kind "unknown filed tag encountered" at the step of importing in bimp the images to process.
That said, I can open the individual tif files in Gimp (File -> open...).
When I do, I still get the warnings "unknown filed tag encountered", but i can click "OK" on the message window and continue importing the file.
Now the "import TIFF" window show me a "Page 1" icon in the top part, then I can choose if opening the file as "levels" or "image". Both choices seem to give the same result.
At that point I can apply my "myset" curve to the file from the tool Colors-> curves.
One potential solution I've been thinking of is to write a script to do this and call it from the command line. I found something along that line here: https://www.gimpusers.com/forums/gimp-user/11100-curves-spline-batch .
Unfortunately:
I have no experience in writing script-fu scripts and very few on command line.
looking at the example in the above link I cannot figure out how/where to point to the "myset" curve in the script.
looking into the Procedure Brouser I do not know which is the one corresponding to the Color->curve tool. ( possibly someting like gimp-drawable-curves-splines, but again I dont know how to have that refer to "myset")
A copy of my "myset" curve and a some .tif esample files can be found here.
Dows anyone have suggestion on perform batch curve color changes on these tif files similarly to what I describe for the jpg? I am open to other solution then GIMP (but for example I cannot open those tif in rawtherappe - don't know why - so that is less of an option)
IMPORTANT: the I need to preserve the tiff metadata (they're georeferenced)

Tesseract Batch Convert Images to Searchable PDF And Multiple Corresponding Text Files

I’m using tesseract to batch convert a list of images to both a searchable PDF as well as a TXT file containing the OCRd text.
tesseract infile outfile -l eng myconfig
infile contains a list of image paths to process
myconfig contains tesseract preferences to specify the output types (tessedit_create_text 1 and tessedit_create_pdf 1)
This leaves me with outfile.pdf and outfile.txt, the latter of which contains page separators for delimiting text between images.
What I’m really looking to do, however, is to output multiple TXT files on a per-image basis, using the same corresponding image name. For example, Image1.jpg.txt, Image2.jpg.txt, Image3.jpg.txt...
Does tesseract have the option to support this behavior natively? I realize that I can loop through the image file list and execute tesseract on a per-image basis, but this is not ideal as I’d also have to run tesseract a second time to generate the merged PDF. Instead, I’d like to run both options at the same time, with less overall execution time.
I also realize that I can split the merged TXT file on the page separator into multiple text files, but then I have to introduce less elegant code to map and rename all of those split files to correspond to their original image names: Rename 0001.txt to Image1.jpg.txt...
I’m working with both Python 3 and Linux commands at my disposal.

You can prepare a batch file that loops through the input images and output to both txt and pdf at the same time -- more efficient, one single OCR operation instead of two. You can then split output .txt file to pages.
tesseract inimagefile outfile txt pdf

Converting multiple images to a single PDF file.
On Linux, you can list all images and then pipe them to tesseract
ls *.jpg | tesseract - yourFileName txt pdf
Where:
youFileName: is the name of the output file.
txt pdf: are the output formats, you can also use only one of them.
Converting images to individual text files
On Linux, you can use the for loop to go through files and execute an action for every file.
for FILE in *.jpg; do tesseract $FILE ${FILE::-4}; done
Where:
for FILE in *.jpg : loop through all JPG files (you can change the extension based on your format)
$FILE: is the name of the image file, e.g. 001.jpg
${FILE::-4}: is the name of the image but without the extension, e.g. 001.jpg will be 001 because we removed the last 4 characters.
We need this to name the text files to the corresponding names, e.g.
001.jpg will be converted to 001.txt
002.jpg will be converted to 002.txt

Since Tesseract doesn't seem to handle this natively, I've just developed a function to split the merged TXT file on the page separator into multiple text files. Although from my observations, I'm not sure that Tesseract runs any faster by simultaneously converting batch images to both PDF and TXT (versus running it twice - once for PDF, and once for TXT).

Thank you!
BTW i'm using 4.1.1.
And i discovered another trainedata for spanish language that do a better job than the standard one. Actually recognizes well the "o" character. The only problem is the processing time, but i let the PC working overnight.
Honestly i don't know how the new trainedata file is doing the job better. I donwloaded at:
https://github.com/tesseract-ocr/tessdata_best

Convert a folder of PDFs into a csv of CMYK values

tldr: How can I convert a folder of pdfs into a list of CMYK values (or RGB or any kind of colour scale values), preferably in python.
I have a folder with around ~100,000 documents in it. To make sampling these documents easier I want to run data analysis on the documents (clustering and anomaly detection), and one metric I want to have is the CMYK coverage. Is there any method or package in (preferably) python that will calculate the CMYK coverage of the PDF?
****edit****
After some research I have found out that GhostScript should provide the functionality I require, if anyone could help me with the implementation I would still really appreciate it.

./gs -sDEVICE=inkcov -sOutputFile=out.txt input.pdf should give you each page CMYK coverage in a file.
You could use -dQUIET -o - instead of -sOutputFile to send the output to stdout.
You then need some batch scripting which will depend on your Operating System. On Windows something like:
for %s in (folder/*.pdf) do gswin64c -dQUIET -sDEVICE=inkcov -o - "%s" >> coverage.txt
ought to take every file from the folder, run it through the inkcov device and send the output to stdout, which we redirect to a file and use >> so that each execution appends to the file instead of overwriting the previous output.
You will need to delete the output file after each run of course.

How can I display an image sequence as a volume in ParaView?

As a replacement for ImageJ's 3D-Viewer I'm trying to display a sequence of microscopic images as a volume in ParaView 5.4.1. I tried following this guide which suggests to save the image sequence as a .raw file with ImageJ, open that in ParaView and manually enter the image dimensions. I'm not seeing the fields where I could enter image dimensions in ParaView though, and clicking "Apply" after loading the .raw file does nothing. Is there another way?

When you open the file, you are probably getting a dialog box titled "Open Data With..." and given a list of file formats that potentially match the file. Make sure you select "Raw (binary) Files". That is the one that reads images as a raw binary array of data and gives you lots of options to specify the size of the array (including reading the files as a 3D stack).
Don't use the one that says "RAW Files". That is a different mesh format used by some CAD programs.

Ghoscript /cropbox not printing correctly in linux

I'm using the Domestic shipping label api in usps to generate domestic shipping labels in pdf format. I managed to crop the top section of the pdf file which is the label needed by the usps and Ignored the bottom section which is the receipt which is not needed in shipping.
I use Ghostscript /Cropbox to crop the section that I only want which is successful but when I try to print the cropped pdf file in linux cups I get the whole uncropped pdf printed instead of the cropped pdf file. Why is it still printing the whole file instead of just printing the cropped section?.
Here's the script I'm using to crop the usps Shipping label.
gs -o cropped.pdf -sDEVICE=pdfwrite -c "[/CropBox [50.4 460.5 484.4 750.5] /PAGES pdfmark" -f uncropped.pdf
Then to change its orientation to portrait i use pdftk
pdftk cropped.pdf cat 1L output cropped_portrait.pdf
To print it in linux cups I'm using the command.
lp cropped_portrait.pdf
But when i print it it is printing the uncropped.pdf file instead of cropped_portrait.pdf.
Why is it doing that? I even deleted uncropped.pdf and tried printing again but it still prints uncropped.pdf.
Here's the two files the uncropped and cropped usps shipping labels.
Uncropped PDF file
Cropped PDF file
Hope you can help me on this one,
Thank you

Presumably the reduced PDF file displays correctly, so there is no problem with Ghostscript producing the PDF file.
As to why the printing process doesn't respect the CropBox, there is no reason really why it should. There are many Boxes in PDF and no real way for a print application to know which one you want to use. As a result printing applications often default to the MediaBox, which you haven't altered (Note that altering the CropBox doesn't change the content of the PDF file, just what is displayed).
Now, if your CUPS chain is using Ghostscript to render the PDF file, or convert it to PostScript, then this can be solved, you need to add -dUseCropBox to the command line. However I'm not a CUPS expert so I can't tell you how to do that. If CUPS isn't using Ghostscript then its probably still possible to instruct whatever is doing the conversion to use the CropBox, but you're going to have to find out what application is involved and alter the command appropriately for that application.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string