Tesseract Batch Convert Images to Searchable PDF And Multiple Corresponding Text Files - linux

I’m using tesseract to batch convert a list of images to both a searchable PDF as well as a TXT file containing the OCRd text.
tesseract infile outfile -l eng myconfig
infile contains a list of image paths to process
myconfig contains tesseract preferences to specify the output types (tessedit_create_text 1 and tessedit_create_pdf 1)
This leaves me with outfile.pdf and outfile.txt, the latter of which contains page separators for delimiting text between images.
What I’m really looking to do, however, is to output multiple TXT files on a per-image basis, using the same corresponding image name. For example, Image1.jpg.txt, Image2.jpg.txt, Image3.jpg.txt...
Does tesseract have the option to support this behavior natively? I realize that I can loop through the image file list and execute tesseract on a per-image basis, but this is not ideal as I’d also have to run tesseract a second time to generate the merged PDF. Instead, I’d like to run both options at the same time, with less overall execution time.
I also realize that I can split the merged TXT file on the page separator into multiple text files, but then I have to introduce less elegant code to map and rename all of those split files to correspond to their original image names: Rename 0001.txt to Image1.jpg.txt...
I’m working with both Python 3 and Linux commands at my disposal.

You can prepare a batch file that loops through the input images and output to both txt and pdf at the same time -- more efficient, one single OCR operation instead of two. You can then split output .txt file to pages.
tesseract inimagefile outfile txt pdf

Converting multiple images to a single PDF file.
On Linux, you can list all images and then pipe them to tesseract
ls *.jpg | tesseract - yourFileName txt pdf
Where:
youFileName: is the name of the output file.
txt pdf: are the output formats, you can also use only one of them.
Converting images to individual text files
On Linux, you can use the for loop to go through files and execute an action for every file.
for FILE in *.jpg; do tesseract $FILE ${FILE::-4}; done
Where:
for FILE in *.jpg : loop through all JPG files (you can change the extension based on your format)
$FILE: is the name of the image file, e.g. 001.jpg
${FILE::-4}: is the name of the image but without the extension, e.g. 001.jpg will be 001 because we removed the last 4 characters.
We need this to name the text files to the corresponding names, e.g.
001.jpg will be converted to 001.txt
002.jpg will be converted to 002.txt

Since Tesseract doesn't seem to handle this natively, I've just developed a function to split the merged TXT file on the page separator into multiple text files. Although from my observations, I'm not sure that Tesseract runs any faster by simultaneously converting batch images to both PDF and TXT (versus running it twice - once for PDF, and once for TXT).

Thank you!
BTW i'm using 4.1.1.
And i discovered another trainedata for spanish language that do a better job than the standard one. Actually recognizes well the "o" character. The only problem is the processing time, but i let the PC working overnight.
Honestly i don't know how the new trainedata file is doing the job better. I donwloaded at:
https://github.com/tesseract-ocr/tessdata_best

Related

CLI- soffice convert csv to pdf with semicolon as delimiter

I want to convert the csv file to pdf file from command line using soffice command. But my csv file is colon separated instead of comma.
If I use command:
soffice --convert-to pdf ./sampleCSVFile.csv
This will give me pdf file but there are ; in the file. I found a article to convert to convert ods to csv with semicolon as delimiter: https://ask.libreoffice.org/t/cli-convert-ods-to-csv-with-semicolon-as-delimiter/5021
So similar to that I tried:
unoconv -f pdf -e FilterOptions="59,34,0,1" ./sampleCSVFile.csv
But it didn't help.
sampleCSVFile.csv as follow:
Level 1;Level2
Level 1;Level2
Level 1 ;Level2
Level 1;Level2
Level 1 ;Level2
Level 1;Level2
Level 1;Level2
Level 1;Level2
Level 1;Level2
Is there a way to convert this colon separated csv file to pdf?
(without changing the delimiter colon to comma)
Traditionally in DOS you used Edline to write a text file then either Copy or Type to the Con, Com or Lpn device (Line PriNter).
Windows still allows the print command to do that, and its possible to echo text via Notepad to a PDF virtual printer as a port. I will skip that as it not quite suited to your usage.
However by way of example, here I take your file and print virtually to PDF FilePort then call the Port result to the console. I could use one line rather than two but its more GUI visual.
However its not cross platform, and there are other simpler ways to convert text to pdf per platform.
You ask about Soffice and the principles are much the same since before PDFs were invented. soffice --infilter="calc_pdf_export" --convert-to pdf sampleCSVFile.csv
The text you transPort to exPort is the same as you imPort. However printing blind can add default print headers, footers (Page 1) and styles.
Because it is the most basic of methods
Whatever is in your Character Separated Values File.txt will be similar output. The only difference is there is no such thing as a tab or line wrap in a PDF (as its a virtual laser printer) not a mechanical line feed one.

Search substring in binary file

friends! Please, help me with my issue. I have an application which processes data and generates output files (different formats, but mostly images). In every generated file that application puts it's watermark - string, that looks like "03-24-5532 [some cyrillic text]".
And every time I use that application, I need to edit each file in photoshop to replace watermark string with required one and it takes a lot of time.
Is this possible to search that substring in application binary data files (using Hex Editor or something else) and replace? Which is the better way to solve this problem?

How to generate pdf file of text and image in linux?

I am generating a logfile on one of my servers.
Storing alot of data, then sending it to my mail once a month as a pdf file.
The prosess i am using is to 'cat' alot of commands to a text file, then convert it and send.
Is there any linux programs or some eazy way to do something simulare and add a image i have stored on the server in the pdf file?
This answer assumes that you just want to put the image at the end of the PDF.
You could first convert the image using imagemagick to a PDF doing this (will also work with different file types):
convert image.jpg image.pdf
Then, you can use a tool like stapler or pdftk to combine your generated text PDF and the image.pdf (you can add multiple images):
stapler cat text.pdf image.pdf combined.pdf
pdftk text.pdf image.pdf output combined.pdf

Split large file into small files with particular extension

I want to split a large file into small files of 10000 lines each. I know I can do the same using:
split --lines=10000
However, the above command does not give extensions to the splitted files. I want to give all my split files the extension .txt Is it possible to do the same using split in linux. If yes, then how?
Also is it possible to number the files such that the first file has the name a1.txt. The second file has the name a2.txt, and so on. I know split gives names of the files as aa,ab, etc. but I want to replace this with a1.txt, a2.txt, a3.txt, a4.txt, a5.txt, a6.txt, a7.txt, etc.
Uses the -d parameter, as:
split --lines=10000 -d <file>

Linux PdfToText function return blank text file

I've used a linux function to convert a list of PDF files to text.
Command:
pdftotext -htmlmeta
This work well for most of my files.
but for a small amount of them, this return me a blank text file.
My unsuccesssfull pdf files were not encrypted, not securised by user / password and they were not read only.
Converting PDFs to text is not a well-defined process. It can work awesome or not at all, depending on the PDF input.
Why is this? Because a PDF's task is mainly to represent the optics of a document, not the textual contents. PDFs can be everything from a pure text with positional information up to a pure graphics of the glyphs of the letters of the text. In the latter case one would need to run an OCR on the input in order to receive text information. This is not done by tools like pdftotext.
Sometimes the text in the PDF is scattered throughout the file, e. g. because first all standard-font letters are mentioned in the PDF, then, later in the file, all the italics-font letters are mentioned (of course with positional information, so a reader of the optical representation won't notice this, even if standard and italics are mixed throughout the text on the page). To rearrange this mess to a fluent text is a major task not very many converters are capable of.
So I guess all you can do is try some more converters for PDF to text (some are better than others, and some are better just for some specific input) or see that you can get the text from another source than the PDF files.

Resources