How can I add a new font to Tesseract 4.0? - text

I'm making a text identification program and I want to train my Tesseract 4.0 to identify a specific font (in Hebrew). How can I do it?
I tried "trainyourtesseract.com" (that did'nt work at all) and "jTessBoxEditor" (that I didn't understand how to make it work properly).
I would love to get some help with that issue.
Thanks.

did you try reading this link? https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#tutorial-guide-to-lstmtraining
The rough approach is that you have to prepare your own language files (and most importantly your own .trainingtext file), then run tesstrain.sh to generate the dataset. After that, you can run combine_tessdata to extract the .lstm file from the original Hebrew model and use it as a parameter in the lstmtraining tool to finetune the original model with your new font.
UPDATE: the documentation link has changed: https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00

Detail Video watch this : https://www.youtube.com/watch?v=N5Y6gZgvryQ
Here is the shell script for the tesseract custom training
N=3 # number of images
#image name => languagename.fontname.expN.filetype
make box file
for i in `seq 1 $N`
do
tesseract testlan.arial.exp$i.png testlan.arial.exp$i batch.nochop makebox
done
after manually edit box file following steps need to be done
#Step 02: Create .tr file (Compounding image file and box file)
step 3: Extract the charset from the box files (Output for this command is unicharset file)
for i in `seq 1 $N`
do
tesseract testlan.arial.exp$i.png testlan.arial.exp$i box.train
unicharset_extractor testlan.arial.exp$i.box
done
step 4: Create a font_properties file based on our needs.
echo "[fontname] [italic (0 or 1)] [bold (0 or 1)] [monospace (0 or 1)] [serif (0 or 1)] [fraktur (0 or 1)]" > font_properties
echo "arial 0 0 1 0 0" > font_properties
Step 5: Training the data.
#Step 6
for i in `seq 1 $N`
do
mftraining -F font_properties -U unicharset -O testlan.unicharset testlan.arial.exp$i.tr
cntraining testlan.arial.exp$i.tr
done
#after step 5 and step 6 shapetable,inttemp,pffmtable,normproto files created
Step 7: Rename four files (shapetable,inttemp,pffmtable,normproto) into ([langname].shapetable,[langname].inttemp,[langname].pffmtable,[langname].normproto)
mv inttemp testlan.inttemp
mv normproto testlan.normproto
mv pffmtable testlan.pffmtable
mv shapetable testlan.shapetable
combine_tessdata testlan.
#move testlan.traineddata to C:\Program Files\Tesseract-OCR\tessdata

Related

How to apply Praat script to an audio file?

I'm trying to change formants of the audio file with praat in Colab. I found the script that does that, it's code and the code for calculating formants. I installed praat:
!sudo apt-get update -y -qqq --fix-missing && apt-get install -y -qqq praat > /dev/null
!wget -qqq http://www.praatvocaltoolkit.com/downloads/plugin_VocalToolkit.zip
!unzip -qqq /content/plugin_VocalToolkit.zip > /dev/null
with open('/content/script.praat', 'w') as f:
f.write(r"""writeInfoLine: preferencesDirectory$""")
!praat /content/script.praat
/root/.praat-dir
!mv /content/plugin_VocalToolkit/* /root/.praat-dir
!praat --version
Praat 6.0.37 (February 3 2018)
How can I apply this script to multiple wav files without UI, using linux command line or python?
The general answer
You don't. You run a script, and it's entirely up to the script how it works, what objects it works on, where those objects are fetched, how they are fetched, etc.
So you always have to look at how to apply a specific script, and that always entails figuring out how that script wants its input, and how to get to that point.
The specific answer
The page for the script you want says
This command [does something on] each selected Sound
so the first thing will be to open the files you want and select them.
Let's assume you'll be working with a small enough number of sounds to open them all in one go. If you are working on a lot of sound files, or files that are too large to hold in memory, you'll have to batch the job into smaller chunks.
One way to do this would be with a wrapper script that opened your files, selected them, and executed the other script you want:
# Get a list of all your files
files = Create Strings as file list: "list", "/some/path/*.wav"
total_files = Get number of strings
# Open each of them
for i to total_files
selectObject: files
filename$ = Get string: i
sounds[i] = Read from file: "/some/path/" + filename$
endfor
# Clear the selection
nocheck selectObject(undefined)
# Add each sound to your selection
for i to total_files
plusObject: sounds[i]
endfor
# Run your script
runScript: path_to_script$, ...
# where the ... is the list of arguments your script expects
# In your specific case, it would be something like
runScript: preferencesDirectory$ + "/plugin_VocalToolkit/changeformants.praat",
... 500, 1500, 2500, 0, 0, 5500, "yes", "yes"
# ,-´ ,-´ ,--´ ,--´ ,-´ ^ ^ ^
# New F1, F2, F3, F4, and F5 means | | |
# Max formant | |
# Process only voiced parts |
# Retrieve intensity contour
# Do something with whatever the script gives you
My Praat is pretty rusty, but this should at least give you an idea of what to do (disclaimer: I haven't run any of the above, but the concepts should be fine).
With that "wrapper" script stored somewhere, you can then execute it from the command line:
$ praat /path/to/wrapper.praat

Regarding comparison of 2 image sequences in Linux/Ubuntu

I have images in 2 different folders, 100 images in each of the 2 folders.The images belong to photographs taken from 2 different simulations.The 100 images are the 100 time steps of the 2 simulations.I wish to compare the images frame by frame. Can they be displayed on the screen with some software,such that I just need to press the arrow keys(up/down) and the images from the 2 sequences will BOTH move forward/backward by one step, so that I can compare the 2 images frame by frame simultaneously. I do not wish to mathematically subtract images, just compare them visually with the eyes.
Windows, I came to know has avisynth and pdplayer for the above. avxsynth is the Linux version of avisynth,but it is unstable in my computer.
This is the only question I found,before posting this and it is off-topic
How to list an image sequence in an efficient way? Numercial sequence comparison in Python
Can anyone please suggest any other option ?
Have you considered using ImageMagick, specifically montage, and a simple shell script to create a new set of images? (Where each new image consists of your two previous images glued (montaged) together side by side.)
ImageMagick will also support subtraction of images, which can be montaged into the new set too, should you change your mind about that.
Or the creation of animations where the images oscillate between your two runs so you can compare them more easily. (i.e. create a 3rd folder with 100 new images each of which is an animation alternating between the two runs.)
You may want to consider generating a little html with a shell-script, and putting each image or set of images into its own webpage along with forward and backward buttons. It's pretty trivial, and gives you a nice little web-browser slideshow. You can pick up the necessary HTML in a couple of minutes. You wouldn't need much more than the A HREF and IMG tags. Webbrowsers support local file:// URLs. (Again, a 3rd directory with 100 html files linking to images in the first 2 directories.)
Or you could generate one big webpage with all the images in it, and just scroll up and down...
I am new to shell scripting, can you please give me a shell script. I could not find out how to write the shell script. I could make use of montage. Suppose the 2 directories are dir1 and dir2, and each of them has five files file_001,file_002,file_003,file_004,file_005 can you please post the shell script ??
Sure. I happen to like TCSH (or CSH) for this, just for the :t option...
Note: montage output filename needs an extension to tell montage what the output graphics filetype is... (e.g. .jpeg or .gif or whatever...)
% mkdir dir3
% ls -a *
dir1:
./ ../ file_001 file_002 file_003 file_004 file_005
dir2:
./ ../ file_001 file_002 file_003 file_004 file_005
dir3:
./ ../
% foreach VAR ( dir1/file* )
montage -background #000000 -geometry +4+4 $VAR dir2/$VAR:t dir3/out_$VAR:t.jpeg
end
% ls d*
dir1:
./ ../ file_001 file_002 file_003 file_004 file_005
dir2:
./ ../ file_001 file_002 file_003 file_004 file_005
dir3:
./ out_file_001.jpeg out_file_003.jpeg out_file_005.jpeg
../ out_file_002.jpeg out_file_004.jpeg
Nothing to it... For HTML you could just echo text into a file...
% set Q = '"'
% mkdir dirfoo
% foreach VAR ( dir1/file* )
echo "<html><head></head><body><img src=${Q}../$VAR${Q}></img></body></html>" >> dirfoo/$VAR:t.html
end
That sort of thing...
Perhaps:
% foreach VAR ( dir1/file* )
echo "<html><head></head><body><table><tr><td><img src=${Q}../$VAR${Q}></img></td><td><img src=${Q}../dir2/$VAR:t${Q}></img></td></tr></table></body></html>" >> dirfoo/$VAR:t.html
end

creating multiple copies of a file in bash with a script

I am starting to learn how to use bash shell commands and scripting in Linux.
I want to create a script that will take a source file, and create a chosen number of named copies.
for example, I have the source as testFile, and I choose 15 copies, so it creates testFile1, 2, 3 ... 14, 15 in the same location.
To try and achieve this I have tried to make the following command:
for LABEL in {$X..$Y}; do cp $INPUT $INPUT$LABEL; done
However, instead of creating files from X to Y, it makes just one file with (for example) {1..5} appended instead of files 1, 2, 3, 4 and 5
How can I change it so it properly uses the variable as a number for the loop?
The brace expansion mechanism is a bit limited; it doesn't work with variables, only literals.
For what you want, you probably have the seq command, and could write:
INPUT=testFile
for num in $(seq 1 15)
do
cp "$INPUT" "$INPUT$num"
done
Using a C-style for loop :
$ x=0 y=15
$ for ((i=x; i<=y; i++)); do cp "$INPUT" "$INPUT$i"; done

Tool to copy file from given 'x' (starting) offset to given 'y' (ending) offset

Is there any tool to copy a file from the given starting offset to the given (end) offset. I also want to confirm that the tool has copies specified bytes rightly by running md5sum. Some thing like this
1) Copy source file starting from 100 byte till 250th byte
$cp /path/to/source/file /path/to/dest/file -s 100 -e 250
2) Create md5sum of the source file starting from 100byte till 250th byte
$md5sum /path/of/src/file -s 100 -e 250
xxxxxx-xxxxx-xxxxx-xxxx-xx
3) Confirm that destination file created from step 1 is right by comparing the md5sum generated from step 2.
$md5sum /path/of/dest/file
xxxxxx-xxxxx-xxxxx-xxxx-xx
I know md5sum doesn't have the option of -s and -e but I would like to confirm by some tool given the source file and the destination file. Thanks in advance
For 1) you can use dd:
# dd if=/path/to/source/file of=/path/to/destination/file bs=1 skip=100 count=250
For 2) I'm not really sure if that's achievable with standard tools.
[edit]
Aha, found a way:
For 2)
# dd if=/path/to/source/file bs=1 skip=100 count=250 | md5sum
And for 3)
md5sum /path/to/destination/file

GhostScript Batch Mode?

I'm using GhostScript to convert PDFs to PNGs, the problem is that for each page I'm calling:
gs -sDEVICE=pnggray -dBATCH -dNOPAUSE -dFirstPage=10 -dLastPage=10 -r600 -sOutputFile=image_10.png pdf_file.pdf
Which is not good, I want to pass dFirstPage=10 dLastPage=30 for example and make GhostScript automatically extract each page in a SEPARATE png file WITH the PAGE-NUMBER in the filename, without starting it again with different sOutputFile...
I know it's probably something simple but I'm missing it...
Also, it would be great if someone can tell me what parameter I need to pass to make ghostscript run in total silence, without any output to the console.
EDIT: Adding %d to the output parameter adds the number of the run, instead of the number of the page. For example:
-dFirstPage=10 -dLastPage=15 -sOutputFile=image_%d.png
results in:
image_1.png, image_2.png, image_3.png etc... instead of:
image_10.png, image_11.png, image_12.png ...
Save this as a file
#!/bin/bash
case $# in [!3] ) printf "usage : ${0##*/} stPage endPage File\n" >&2 ;; esac
stPage=$1
endPage=$2
(( endPage ++ ))
file=$3
i=$stPage
while (( i < endPage )) ; do
gs -sstdout=/dev/null -sDEVICE=pnggray -dBATCH -dNOPAUSE -dPage=$i -r600 -sOutputFile=image_$i.png ${file}
(( i ++ ))
done
Check in the ghost script manual to see if there is a -dPage=${num} option, else use
-dFirstPage=${i} -dLastPage=${i} .
Then make it executeable chmod 755 batch_gs.sh
Finally run it with arguments
batch_gs.sh 3 5 fileName
(Lightly tested).
I hope this helps.
Unfortunately, what you want to do is not possible. See also my answers here and here.
If you want to do all PNG conversions in one go (without restarting Ghostscript for each new page), you have to live with the fact, that the %d macro always starts with numbering the first output page as 1, but of course you will gain a much better performance.
If you do not like this naming conventions in your end result, you have to do a second step that renames the resulting files to their final name.
Assuming your initial output files are named image_1.png ... image_15.png, but you want them named image_25.png ... image_39.png, your core command to do this would be:
for i in $(seq 1 15); do
mv image_${i}.png image_$(( ${i} + 24)).png
done
Note, this will go wrong if the two ranges of numbers intersect, as the command would then overwrite one of your not-yet-renamed input files. To be save, don't use mv but use cp to make a copy of the new files in a temporary subdirectory first:
for i in $(seq 1 15); do
cp -a image_${i}.png temp/image_$(( ${i} + 14)).png
done

Resources