How to make exiftool use data not in the file's tags

How to make exiftool use data not in the file's tags - gpx

https://www.exiftool.org/exiftool_pod.html#Advanced-formatting-feature tells how exiftool can be used to create more complicated files. I use that feature to create a GPX file from the GPS data from all images in a directory.
I would like a way to specify on the exiftool command line two strings for exiftool to put into the <name> and <desc> elements that can appear once at the beginning of the GPX file.
A less desirable approach would be for the format file to create a dummy string instead of the standard starting <gpx> tag and replace that afterward.  But putting it on the command line would be better. Or in a shell variable that exiftool can access.

Related

Tesseract Batch Convert Images to Searchable PDF And Multiple Corresponding Text Files

I’m using tesseract to batch convert a list of images to both a searchable PDF as well as a TXT file containing the OCRd text.
tesseract infile outfile -l eng myconfig
infile contains a list of image paths to process
myconfig contains tesseract preferences to specify the output types (tessedit_create_text 1 and tessedit_create_pdf 1)
This leaves me with outfile.pdf and outfile.txt, the latter of which contains page separators for delimiting text between images.
What I’m really looking to do, however, is to output multiple TXT files on a per-image basis, using the same corresponding image name. For example, Image1.jpg.txt, Image2.jpg.txt, Image3.jpg.txt...
Does tesseract have the option to support this behavior natively? I realize that I can loop through the image file list and execute tesseract on a per-image basis, but this is not ideal as I’d also have to run tesseract a second time to generate the merged PDF. Instead, I’d like to run both options at the same time, with less overall execution time.
I also realize that I can split the merged TXT file on the page separator into multiple text files, but then I have to introduce less elegant code to map and rename all of those split files to correspond to their original image names: Rename 0001.txt to Image1.jpg.txt...
I’m working with both Python 3 and Linux commands at my disposal.

You can prepare a batch file that loops through the input images and output to both txt and pdf at the same time -- more efficient, one single OCR operation instead of two. You can then split output .txt file to pages.
tesseract inimagefile outfile txt pdf

Converting multiple images to a single PDF file.
On Linux, you can list all images and then pipe them to tesseract
ls *.jpg | tesseract - yourFileName txt pdf
Where:
youFileName: is the name of the output file.
txt pdf: are the output formats, you can also use only one of them.
Converting images to individual text files
On Linux, you can use the for loop to go through files and execute an action for every file.
for FILE in *.jpg; do tesseract $FILE ${FILE::-4}; done
Where:
for FILE in *.jpg : loop through all JPG files (you can change the extension based on your format)
$FILE: is the name of the image file, e.g. 001.jpg
${FILE::-4}: is the name of the image but without the extension, e.g. 001.jpg will be 001 because we removed the last 4 characters.
We need this to name the text files to the corresponding names, e.g.
001.jpg will be converted to 001.txt
002.jpg will be converted to 002.txt

Since Tesseract doesn't seem to handle this natively, I've just developed a function to split the merged TXT file on the page separator into multiple text files. Although from my observations, I'm not sure that Tesseract runs any faster by simultaneously converting batch images to both PDF and TXT (versus running it twice - once for PDF, and once for TXT).

Thank you!
BTW i'm using 4.1.1.
And i discovered another trainedata for spanish language that do a better job than the standard one. Actually recognizes well the "o" character. The only problem is the processing time, but i let the PC working overnight.
Honestly i don't know how the new trainedata file is doing the job better. I donwloaded at:
https://github.com/tesseract-ocr/tessdata_best

Split large file into small files with particular extension

I want to split a large file into small files of 10000 lines each. I know I can do the same using:
split --lines=10000
However, the above command does not give extensions to the splitted files. I want to give all my split files the extension .txt Is it possible to do the same using split in linux. If yes, then how?
Also is it possible to number the files such that the first file has the name a1.txt. The second file has the name a2.txt, and so on. I know split gives names of the files as aa,ab, etc. but I want to replace this with a1.txt, a2.txt, a3.txt, a4.txt, a5.txt, a6.txt, a7.txt, etc.

Uses the -d parameter, as:
split --lines=10000 -d <file>

Can gulp collect lines grep:ed into single output file?

I'm trying to filter out lines from all .js source files, and put into a separate file. (Specifically, I'm trying to grep all calls to a string translation function and post-process them).
I think I have the different parts figured out but can't make them fit together.
For each file, process it
Write each file's grep:ed lines to output.
Append the result to a file
I've tried to through.push(<output per file>) from the plugin, but the following step expects a file, not a string.
From there, I expect I could do something like gulp-concat or stream merge on the results and pipe it on to gulp.dist, but there's bit missing here.

I figured out a way - simply replace the Vinyl file's content with the lines to output, and push that through to through.push.

How to identify line endings on a large number of files

Given a medium-size tree of files (a few hundred), is there some utility that can scan the whole tree (recursively) and display the name of each file and whether the file currently contains CRLF, LF, or mixed line terminators?
A GUI that can both display the current status and also selectively change specific files is preferred, but not essential.
Also prefer a solution for Windows, but I have access to both Bash for Windows and a Linux box that has access to the same file tree, so I can use something Linux-y if necessary.

Related Question: https://unix.stackexchange.com/questions/118959/how-to-find-files-that-contain-newline-in-filename
You can use linux' find to look recursivly for filenames containing newline characters:
find . -name $'*[\n\r]*'
From there you can proceed to do what you need to do.

Piping SVG file into ImageMagick

Sorry if this belongs on serverfault
I'm wondering what the proper way is to use an SVG(xml) string as standard input
for a "convert msvg:- jpeg:- 2>&1" command (using linux)
Currently I'm just saving a temp file to use as input,
but the data originates from an API in my case, so feeding
the string directly to the command would obviously be most efficient.
I appreciate everyone's help. Thanks!

This should work:
convert - output.jpg
Example:
convert logo: logo.svg
cat logo.svg | convert - logo.jpg
Explanation:
The example's first line creates an SVN file and writes it to disk. This is only a preparatory stop so that we can run the second line.
The second line is a pipeline of two commands: cat streams the bytes of the file to stdout (standard output).
The first line served only as preparation for the next command in the pipeline, so that this next command has something to read in.
This next command is convert.
The - character is a way to tell convert to read its input data not from disk, but from stdin (standard input).
So convert reads its input data from its stdin and writes its JPEG output to the file logo.jpg.
So my first command/line is similar to your step described as 'currently I'm just saving a temp file to use as input'.
My second command/line does not use your API (I don't have access to it, do I?), but it demonstrates a different method to 'feeding a string directly to the command'.
So the most important lesson is this: Whereever convert would usually read input from a file and where you would write the file's name on the commandline, you can replace the filename by - to tell convert it should read from stdin. (But you need to make sure that there is actually something offered on convert's standard input which it can digest...)
Sorry, I can't explain better than this...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to make exiftool use data not in the file's tags - gpx

Related

Tesseract Batch Convert Images to Searchable PDF And Multiple Corresponding Text Files

Split large file into small files with particular extension

Can gulp collect lines grep:ed into single output file?

How to identify line endings on a large number of files

Piping SVG file into ImageMagick

Categories

Resources