how to merge pdf as a table with pdftk or convert - linux

How can one use convert or pdftk to merge several pdfs organized as a table?
For example, given 4 files: file1.pdf, file2.pdf, file3.pdf, file4.pdf, each of a single page, I would like to have a single-page pdf like
file1.pdf file2.pdf
file3.pdf file4.pdf
That is, the files are arranged like an array.

By far the easiest way to convert 4 PDF pages to 1 page on any OS is by N-Up imposition/printing with output to a virtual PDF printer such as Ghostscript. For the most basic 4-Up command line usage see https://stackoverflow.com/a/72850245/10802527
Thus to combine 4 pages (others such as 2 6 9 or 16 are possible) using here in a gui I can very easily set the order.
On Linux or MacOS you can use, along with other options, the CUPS command
lp -o number-up=4 filename
see https://www.cups.org/doc/options.html
The major advantage over using tools such as PDFtk with convert is that it resolves both scaling and preserving most PDF structures without degrading to inferior down-scaled imagery by NOT passing in and out of images before calling Ghostscript.
If you have single pdfs then you can merge before print using PDFtk (uses Ghostscript) instead of poppler pdfunite. Note that with either the Original PDF format is preserved.
If you want to convert to half size images and stitch them together, then reprint to one pdf page, then that can easily be done using imagemagik convert and other commands to call Ghostscript to suit your requirements direct. However, the results will in many ways be degraded by translation to image output.
Since all of the above pass through GS it makes sense, where possible, to install GS as a PDF printer driver.
If you want to avoid installing GhostScript printing then you can use cross platform Coherent cpdf (it only uses GS if the files need repairs)
Note these are "windows double quoted names" adjust as required and is based on the 4 sequential pages in one file are then to be placed 4 at a time on each new page, thus can be used with any multiple of pages in the input.pdf
cpdf -twoup "input.pdf" -o "in-2-Up-tmp.pdf"
cpdf "in-2-Up-tmp.pdf" -rotate 90 -o "out-2-Up.pdf"
cpdf -twoup "out-2-Up.pdf" -o "out-4-Up-tmp.pdf"
cpdf "out-4-Up-tmp.pdf" -rotate 90 -o "out-4-Up.pdf"

Related

Tesseract Batch Convert Images to Searchable PDF And Multiple Corresponding Text Files

I’m using tesseract to batch convert a list of images to both a searchable PDF as well as a TXT file containing the OCRd text.
tesseract infile outfile -l eng myconfig
infile contains a list of image paths to process
myconfig contains tesseract preferences to specify the output types (tessedit_create_text 1 and tessedit_create_pdf 1)
This leaves me with outfile.pdf and outfile.txt, the latter of which contains page separators for delimiting text between images.
What I’m really looking to do, however, is to output multiple TXT files on a per-image basis, using the same corresponding image name. For example, Image1.jpg.txt, Image2.jpg.txt, Image3.jpg.txt...
Does tesseract have the option to support this behavior natively? I realize that I can loop through the image file list and execute tesseract on a per-image basis, but this is not ideal as I’d also have to run tesseract a second time to generate the merged PDF. Instead, I’d like to run both options at the same time, with less overall execution time.
I also realize that I can split the merged TXT file on the page separator into multiple text files, but then I have to introduce less elegant code to map and rename all of those split files to correspond to their original image names: Rename 0001.txt to Image1.jpg.txt...
I’m working with both Python 3 and Linux commands at my disposal.
You can prepare a batch file that loops through the input images and output to both txt and pdf at the same time -- more efficient, one single OCR operation instead of two. You can then split output .txt file to pages.
tesseract inimagefile outfile txt pdf
Converting multiple images to a single PDF file.
On Linux, you can list all images and then pipe them to tesseract
ls *.jpg | tesseract - yourFileName txt pdf
Where:
youFileName: is the name of the output file.
txt pdf: are the output formats, you can also use only one of them.
Converting images to individual text files
On Linux, you can use the for loop to go through files and execute an action for every file.
for FILE in *.jpg; do tesseract $FILE ${FILE::-4}; done
Where:
for FILE in *.jpg : loop through all JPG files (you can change the extension based on your format)
$FILE: is the name of the image file, e.g. 001.jpg
${FILE::-4}: is the name of the image but without the extension, e.g. 001.jpg will be 001 because we removed the last 4 characters.
We need this to name the text files to the corresponding names, e.g.
001.jpg will be converted to 001.txt
002.jpg will be converted to 002.txt
Since Tesseract doesn't seem to handle this natively, I've just developed a function to split the merged TXT file on the page separator into multiple text files. Although from my observations, I'm not sure that Tesseract runs any faster by simultaneously converting batch images to both PDF and TXT (versus running it twice - once for PDF, and once for TXT).
Thank you!
BTW i'm using 4.1.1.
And i discovered another trainedata for spanish language that do a better job than the standard one. Actually recognizes well the "o" character. The only problem is the processing time, but i let the PC working overnight.
Honestly i don't know how the new trainedata file is doing the job better. I donwloaded at:
https://github.com/tesseract-ocr/tessdata_best

Convert a folder of PDFs into a csv of CMYK values

tldr: How can I convert a folder of pdfs into a list of CMYK values (or RGB or any kind of colour scale values), preferably in python.
I have a folder with around ~100,000 documents in it. To make sampling these documents easier I want to run data analysis on the documents (clustering and anomaly detection), and one metric I want to have is the CMYK coverage. Is there any method or package in (preferably) python that will calculate the CMYK coverage of the PDF?
****edit****
After some research I have found out that GhostScript should provide the functionality I require, if anyone could help me with the implementation I would still really appreciate it.
./gs -sDEVICE=inkcov -sOutputFile=out.txt input.pdf should give you each page CMYK coverage in a file.
You could use -dQUIET -o - instead of -sOutputFile to send the output to stdout.
You then need some batch scripting which will depend on your Operating System. On Windows something like:
for %s in (folder/*.pdf) do gswin64c -dQUIET -sDEVICE=inkcov -o - "%s" >> coverage.txt
ought to take every file from the folder, run it through the inkcov device and send the output to stdout, which we redirect to a file and use >> so that each execution appends to the file instead of overwriting the previous output.
You will need to delete the output file after each run of course.

Ghoscript /cropbox not printing correctly in linux

I'm using the Domestic shipping label api in usps to generate domestic shipping labels in pdf format. I managed to crop the top section of the pdf file which is the label needed by the usps and Ignored the bottom section which is the receipt which is not needed in shipping.
I use Ghostscript /Cropbox to crop the section that I only want which is successful but when I try to print the cropped pdf file in linux cups I get the whole uncropped pdf printed instead of the cropped pdf file. Why is it still printing the whole file instead of just printing the cropped section?.
Here's the script I'm using to crop the usps Shipping label.
gs -o cropped.pdf -sDEVICE=pdfwrite -c "[/CropBox [50.4 460.5 484.4 750.5] /PAGES pdfmark" -f uncropped.pdf
Then to change its orientation to portrait i use pdftk
pdftk cropped.pdf cat 1L output cropped_portrait.pdf
To print it in linux cups I'm using the command.
lp cropped_portrait.pdf
But when i print it it is printing the uncropped.pdf file instead of cropped_portrait.pdf.
Why is it doing that? I even deleted uncropped.pdf and tried printing again but it still prints uncropped.pdf.
Here's the two files the uncropped and cropped usps shipping labels.
Uncropped PDF file
Cropped PDF file
Hope you can help me on this one,
Thank you
Presumably the reduced PDF file displays correctly, so there is no problem with Ghostscript producing the PDF file.
As to why the printing process doesn't respect the CropBox, there is no reason really why it should. There are many Boxes in PDF and no real way for a print application to know which one you want to use. As a result printing applications often default to the MediaBox, which you haven't altered (Note that altering the CropBox doesn't change the content of the PDF file, just what is displayed).
Now, if your CUPS chain is using Ghostscript to render the PDF file, or convert it to PostScript, then this can be solved, you need to add -dUseCropBox to the command line. However I'm not a CUPS expert so I can't tell you how to do that. If CUPS isn't using Ghostscript then its probably still possible to instruct whatever is doing the conversion to use the CropBox, but you're going to have to find out what application is involved and alter the command appropriately for that application.

print a big, landscape PDF drawing to a3 LaserJet4 (portrait) using ghostscript 9.0

We want to print big drawings (up to A0 and some times longer) to A3 printers using ghostscript:
gs -o - -sDEVICE=pdfwrite -r1200x1200 -sPAPERSIZE=a3 -f
/S/tmp/SamplePDFnewStamp.pdf | gs -o resized.pcl -sDEVICE=ljet4
-g7012x4961 -dPDFFitPage -
I get A4 landscape on an A3 portrait Paper. I also tried to rotate:
gs -sOutputFile="-" -sDEVICE=pdfwrite -r1200x1200 -sPAPERSIZE=a3 -d
-dBATCH -dNOPAUSE -dAutoRotatePages=/None -dPDFFitPage -c "<</Orientation 1>> setpagedevice 90 rotate 0 -595 translate" -f
/S/tmp/SamplePDFnewStamp.pdf -c quit | gs -o resized.pcl
-sDEVICE=ljet4 -g7012x4961 -dPDFFitPage -
getting the same result.
Its not really possible to comment without seeing a PDF file, but a number of the command line options you are using there don't make sense in the combination you have.
The first thing I would do is stop piping the commands like that, at least while investigating the problem. Do it as 2 stages, that will allow you (and others) to look at the intermediate PDF file.
Secondly, I don't believe you can do what you seem to be trying to do. It looks like you are trying to pipe the PDF produced by the first invocation of gs through the second invocation. I don't see any way that will work, the pdfwrite device needs to seek around the file in order to create the xref table, it cannot use stdout, at least in the current version. What version of Ghostscript are you using ?
I also can't see the point of this, why take a PDF, make a new PDF from it, and then render the second PDF ? Why not just render the original ?
None of the media size switches you are specifying will have any effect, because you haven't told Ghostscript that the media size is fixed (using -dFIXEDMEDIA). As a result the PDF interpreter will set the media size to be the same as the MediaBox in the PDF file. Similar problems apply with sending PostScript and expecting it to alter the behaviour of Ghostscript when rendering a PDF file.
Setting the resolution for pdfwrite is not a good idea, and will in general have no effect. Even if it does have an effect, you probably don't want to set it to be the resolution of the device (and the -g values seem to suggest this is not a 1200 dpi device either). The only effect the resolution has is when objects have to be rendered to images because the can't be represented in PDF. You don't want to create images at the printer resolution, somewhere between one quarter and one half the resolution is usually sufficient.
If you'd care to share an example PDF file, I may be able to tell you how to solve your orientation problem. You will need to explain why you are running it through pdfwrite before going to PCL though, I can't see any reason for that.
This:
gs -sDEVICE=pdfwrite -sOutputFile=\temp\out.pdf -dDEVICEHEIGHTPOINTS=2386.08 -dDEVICEWIDTHPOINTS=1685.7600 -dFIXEDMEDIA -dPDFFitPage SamplePDFnewStamp.pdf
Will take your original PDF file and produce a PDF file rotated by 90 degrees. If I then do:
gs -sDEVICE=ljet4 -sOutputFile=\temp\out.pcl \temp\out.pdf
I get a PCL file that, when processed by GhostPDL with appropriate media size, seems to do what you want.
I haven't tried it, due to lack of an actual device to print on, but I would expect that:
gs -sDEVICE=ljet4 -sOutputFile=\temp\out.pcl -dDEVICEHEIGHTPOINTS=2386.08 -dDEVICEWIDTHPOINTS=1685.7600 -dFIXEDMEDIA -dPDFFitPage SamplePDFnewStamp.pdf
would produce the same file in one step.
I found a solution:
gs -q -sDEVICE=ljet4 -sOutputFile=out.pcl -dFIXEDMEDIA -dPDFFitPage -sPAPERSIZE=a3 \-c "<</Install {-1 -1 scale -843 -1192 translate}>> setpagedevice" -f SamplePDFnewStamp.pdf -c quit

What are best parameters to run ImageMagick to convert low quality pdf to images (for OCR)

I have several low quality pdfs. I would like to use OCR -- to be more precise Ocropus to get text from them. To do use, I use first ImageMagick -- a command line tool to convert pdf to images -- to transforms these pdfs into jpg or png.
However ImageMagick produces very low quality images and Ocropus hardly recognizes anything. I would like to learn what are the best parameters for handling low quality pdfs to provide as-good-as-possible-quality images to OCR.
I have found this page, but I do not know where to start.
You can learn about the detailed settings ImageMagick's "delegates" (external programs IM uses, such as Ghostscript) by typing
convert -list delegate
(On my system that's a list of 32 different commands.) Now to see which commands are used to convert to PNG, use this:
convert -list delegate | findstr /i png
Ok, this was for Windows. You didn't say which OS you use. [*] If you are on Linux, try this:
convert -list delegate | grep -i png
You'll discover that IM does produce PNG only from PS or EPS input. So how does IM get (E)PS from your PDF? Easy:
convert -list delegate | findstr /i PDF
convert -list delegate | grep -i PDF
Ah! It uses Ghostscript to make a PDF => PS conversion, then uses Ghostscript again to make a PS => PNG conversion. Works, but isn't the most efficient way if you know that Ghostscript can do PDF => PNG in one go. And faster. And in much better quality.
About IM's handling of PDF conversion to images via the Ghostscript delegate you should know two things first and foremost:
By default, if you don't give an extra parameter, Ghostscript will output images with a 72dpi resolution. That's why Karl's answer suggested to add -density 600 which tells Ghostscript to use a 600 dpi resolution for its image output.
The detour of IM to call Ghostscript twice to convert first PDF => PS and then PS => PNG is a real blunder. Because you never win and harldy keep quality in the first step, but very often loose some. Reasons:
PDF can handle transparencies, which PostScript can not.
PDF can embed TrueType fonts, which Ghostscript can not. etc.pp.
Conversion in the direction PS => PDF is not that critical....)
That's why I'd suggest you convert your PDFs in one go to PNG (or JPEG) using Ghostscript directly. And use the most recent version 8.71 (soon to be released: 9.01) of Ghostscript! Here are example commands:
gswin32c.exe ^
-sDEVICE=pngalpha ^
-o output/page_%03d.png ^
-r600 ^
d:/path/to/your/input.pdf
(This is the commandline for Windows. On Linux, use gs instead of gswin32c.exe, and \ instead of ^.) This command expects to find an output subdirectory where it will store a separate file for each PDF page. To produce JPEGs of good quality, try
gs \
-sDEVICE=jpeg \
-o output/page_%03d.jpeg \
-r600 \
-dJPEGQ=95 \
/path/to/your/input.pdf
(Linux command version). This direct conversion avoids the intermediate PostScript format, which may have lost your TrueType font and transparency object's information that were in the original PDF file.
[*] D'oh! I missed to see your "linux" tag at first...
-density 600 or so should give you what you need.
At least two other tools you may want to consider:
pdfimages, which comes with the package poppler-utils, makes it easy to extract the images from a PDF without degrading them.
pdfsandwich, which can give you an OCR'd file by simply running pdfsandwich inputfile.pdf. You may need to tweak the options to get a decent result. See the official page for more info.

Resources