How to generate pdf file of text and image in linux? - linux

I am generating a logfile on one of my servers.
Storing alot of data, then sending it to my mail once a month as a pdf file.
The prosess i am using is to 'cat' alot of commands to a text file, then convert it and send.
Is there any linux programs or some eazy way to do something simulare and add a image i have stored on the server in the pdf file?

This answer assumes that you just want to put the image at the end of the PDF.
You could first convert the image using imagemagick to a PDF doing this (will also work with different file types):
convert image.jpg image.pdf
Then, you can use a tool like stapler or pdftk to combine your generated text PDF and the image.pdf (you can add multiple images):
stapler cat text.pdf image.pdf combined.pdf
pdftk text.pdf image.pdf output combined.pdf

Related

Tesseract Batch Convert Images to Searchable PDF And Multiple Corresponding Text Files

I’m using tesseract to batch convert a list of images to both a searchable PDF as well as a TXT file containing the OCRd text.
tesseract infile outfile -l eng myconfig
infile contains a list of image paths to process
myconfig contains tesseract preferences to specify the output types (tessedit_create_text 1 and tessedit_create_pdf 1)
This leaves me with outfile.pdf and outfile.txt, the latter of which contains page separators for delimiting text between images.
What I’m really looking to do, however, is to output multiple TXT files on a per-image basis, using the same corresponding image name. For example, Image1.jpg.txt, Image2.jpg.txt, Image3.jpg.txt...
Does tesseract have the option to support this behavior natively? I realize that I can loop through the image file list and execute tesseract on a per-image basis, but this is not ideal as I’d also have to run tesseract a second time to generate the merged PDF. Instead, I’d like to run both options at the same time, with less overall execution time.
I also realize that I can split the merged TXT file on the page separator into multiple text files, but then I have to introduce less elegant code to map and rename all of those split files to correspond to their original image names: Rename 0001.txt to Image1.jpg.txt...
I’m working with both Python 3 and Linux commands at my disposal.
You can prepare a batch file that loops through the input images and output to both txt and pdf at the same time -- more efficient, one single OCR operation instead of two. You can then split output .txt file to pages.
tesseract inimagefile outfile txt pdf
Converting multiple images to a single PDF file.
On Linux, you can list all images and then pipe them to tesseract
ls *.jpg | tesseract - yourFileName txt pdf
Where:
youFileName: is the name of the output file.
txt pdf: are the output formats, you can also use only one of them.
Converting images to individual text files
On Linux, you can use the for loop to go through files and execute an action for every file.
for FILE in *.jpg; do tesseract $FILE ${FILE::-4}; done
Where:
for FILE in *.jpg : loop through all JPG files (you can change the extension based on your format)
$FILE: is the name of the image file, e.g. 001.jpg
${FILE::-4}: is the name of the image but without the extension, e.g. 001.jpg will be 001 because we removed the last 4 characters.
We need this to name the text files to the corresponding names, e.g.
001.jpg will be converted to 001.txt
002.jpg will be converted to 002.txt
Since Tesseract doesn't seem to handle this natively, I've just developed a function to split the merged TXT file on the page separator into multiple text files. Although from my observations, I'm not sure that Tesseract runs any faster by simultaneously converting batch images to both PDF and TXT (versus running it twice - once for PDF, and once for TXT).
Thank you!
BTW i'm using 4.1.1.
And i discovered another trainedata for spanish language that do a better job than the standard one. Actually recognizes well the "o" character. The only problem is the processing time, but i let the PC working overnight.
Honestly i don't know how the new trainedata file is doing the job better. I donwloaded at:
https://github.com/tesseract-ocr/tessdata_best

Embedded file in jpeg

I'm searching an easy way to have a file embedded in a jpeg. I'm not trying to hide anything inside but I want to have a the additional information as jpeg "built-in", so I don't need to encrypt anything. I found the EXIF interface but there isn't a tag "additional file", I can add only some metadata like date and so on.
The easiest way would be to just create an archive (e.g. a 7z file) and append the archive file to the end of the jpeg using copy
copy /b image.jpg + data.7z image_with_data.jpg
Alternatively you could embed the information as IPTC data
See
How to Embed in JPEG
Hide files inside of JPEG images

Forwarding the results of text processing commands to certain locations in an .ods file

I am looking for an efficient way to import the data from a bunch of text files into an .ods file. I have no problem in processing the text files with commands like grep and sed, however, I do not know if it is possible to redirect the results of these commands into a certain location in an ods file.
The .ods file format is basically an xml file format. In the case of .fods it is straight xml. In the case of .ods it is zipped xml. So directly inserting content from text files will likely require some xml tools. I'm using Ubuntu and found xml2/2xml could be useful for converting between xml and xml-path-style text. (sudo apt-get install xml2)
So you will have to do the following:
unzip the .ods file - the cell data will be in a file called content.xml
xml2 < content.xml to get raw text out of the xml
Edit the raw text with your content
Convert the edited raw text back to xml using 2xml
Rezip up the previously unzipped .ods, including your edited content
This may be quite an involved/cumbersome process. Alternatively I'd suggest simply saving your .ods file as a .csv file instead and directly editing the comma-separated-values.

How do you print a multipage tiff file using CUPS (lp command)?

On Linux system (Ubuntu) I have a multipage TIFF file (file.tiff).
When I send it to a printer using "lp file.tiff" command, only the first page prints.
How do I print all the pages?
I have the following known options:
Split the file to single-page TIFFs
Convert TIFF to PDF
I'd like to keep the multi-page TIFF and avoid creating other formats. Is there a way to make CUPS print all the pages from the multipage TIFF file?
(Please do not offer "convert the file" as an answer as I know those, I'm looking for a CUPS method, lpprintmultipagetiff --please?).
Use tiff2ps. The link is below. You could also setup a dirty loop to print each page manually with cups.
for((i=1;i<=884;i++)); do <your lpr print command>; done
Note: 884 is the last page number... I'm just guessing. Use $i in your lpr print command when printing the desired page.
http://linux.about.com/library/cmd/blcmdl1_tiff2ps.htm

Ghoscript /cropbox not printing correctly in linux

I'm using the Domestic shipping label api in usps to generate domestic shipping labels in pdf format. I managed to crop the top section of the pdf file which is the label needed by the usps and Ignored the bottom section which is the receipt which is not needed in shipping.
I use Ghostscript /Cropbox to crop the section that I only want which is successful but when I try to print the cropped pdf file in linux cups I get the whole uncropped pdf printed instead of the cropped pdf file. Why is it still printing the whole file instead of just printing the cropped section?.
Here's the script I'm using to crop the usps Shipping label.
gs -o cropped.pdf -sDEVICE=pdfwrite -c "[/CropBox [50.4 460.5 484.4 750.5] /PAGES pdfmark" -f uncropped.pdf
Then to change its orientation to portrait i use pdftk
pdftk cropped.pdf cat 1L output cropped_portrait.pdf
To print it in linux cups I'm using the command.
lp cropped_portrait.pdf
But when i print it it is printing the uncropped.pdf file instead of cropped_portrait.pdf.
Why is it doing that? I even deleted uncropped.pdf and tried printing again but it still prints uncropped.pdf.
Here's the two files the uncropped and cropped usps shipping labels.
Uncropped PDF file
Cropped PDF file
Hope you can help me on this one,
Thank you
Presumably the reduced PDF file displays correctly, so there is no problem with Ghostscript producing the PDF file.
As to why the printing process doesn't respect the CropBox, there is no reason really why it should. There are many Boxes in PDF and no real way for a print application to know which one you want to use. As a result printing applications often default to the MediaBox, which you haven't altered (Note that altering the CropBox doesn't change the content of the PDF file, just what is displayed).
Now, if your CUPS chain is using Ghostscript to render the PDF file, or convert it to PostScript, then this can be solved, you need to add -dUseCropBox to the command line. However I'm not a CUPS expert so I can't tell you how to do that. If CUPS isn't using Ghostscript then its probably still possible to instruct whatever is doing the conversion to use the CropBox, but you're going to have to find out what application is involved and alter the command appropriately for that application.

Resources