Optimize PDF files (with Ghostscript or other) - linux

Is Ghostscript the best option if you want to optimize a PDF file and reduce the file size?
I need to store alot of PDF files and therefore I need to optimize and reduce the file size as much as possible
Does anyone have any experience with Ghostscript and/or other?
command line
exec('gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4
-dPDFSETTINGS=/screen -sOutputFile='.$file_new.' '.$file);

If you looking for a Free (as in 'libre') Software, Ghostscript is surely your best choice. However, it is not always easy to use -- some of its (very powerful) processing options are not easy to find documented.
Have a look at this answer, which explains how to execute a more detailed control over image resolution downsampling than what the generic -dPDFSETTINGS=/screen does (that defines a few overall defaults, which you may want to override):
How to downsample images within pdf file?
Basically, it tells you how to make Ghostscript downsample all images to a resolution of 72dpi (this value is what -dPDFSETTINGS=/screen uses -- you may want to go even lower):
-dDownsampleColorImages=true \
-dDownsampleGrayImages=true \
-dDownsampleMonoImages=true \
-dColorImageResolution=72 \
-dGrayImageResolution=72 \
-dMonoImageResolution=72 \
If you want to try if Ghostscript is able to also 'un-embed' the fonts used (sometimes it works, sometimes not -- depending on the complexity of the embedded font, and also on the font type used), you can try to add the following to your gs command:
gs \
-o output.pdf \
[...other options...] \
-dEmbedAllFonts=false \
-dSubsetFonts=true \
-dConvertCMYKImagesToRGB=true \
-dCompressFonts=true \
-c ".setpdfwrite <</AlwaysEmbed [ ]>> setdistillerparams" \
-c ".setpdfwrite <</NeverEmbed [/Courier /Courier-Bold /Courier-Oblique /Courier-BoldOblique /Helvetica /Helvetica-Bold /Helvetica-Oblique /Helvetica-BoldOblique /Times-Roman /Times-Bold /Times-Italic /Times-BoldItalic /Symbol /ZapfDingbats /Arial]>> setdistillerparams" \
-f input.pdf
Note: Be aware that downsampling image resolution will surely reduce quality (irreversibly), and dis-embedding fonts will make it difficult or impossible to display and print the PDFs unless the same fonts are installed on the machine....
Update
One option which I had overlooked in my original answer is to add
-dDetectDuplicateImages=true
to the command line. This parameter leads Ghostscript to try and detect any images which are embedded in the PDF multiple times. This can happen if you use an image as a logo or page background, and if the PDF-generating software is not optimized for this situation. This used to be the case with older versions of OpenOffice/LibreOffice (I tested the latest release of LibreOffice, v4.3.5.2, and it does no longer do such stupid things).
It also happens if you concatenate PDF files with the help of pdftk. To show you the effect, and how you can discover it, let's look at a sample PDF file:
pdfinfo p1.pdf
Producer: libtiff / tiff2pdf - 20120922
CreationDate: Tue Jan 6 19:36:34 2015
ModDate: Tue Jan 6 19:36:34 2015
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 1
Encrypted: no
Page size: 595 x 842 pts (A4)
Page rot: 0
File size: 20983 bytes
Optimized: no
PDF version: 1.1
Recent versions of Poppler's pdfimages utility have added support for a -list parameter, which can list all images included in a PDF file:
pdfimages -list p1.pdf
page num type width height color comp bpc enc interp objectID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------
1 0 image 423 600 rgb 3 8 jpeg no 7 0 52 52 19.2K 2.6%
This sample PDF is a 1-page document, containing an image, which is compressed with JPEG-compression, has a width of 423 pixels and a height of 600 pixels and renders at a resolution of 52 PPI on the page.
If we concatenate 3 copies of this file with the help of pdftk like so:
pdftk p1.pdf p1.pdf p1.pdf cat output p3.pdf
then the result shows these image properties via pdfimages -list:
pdfimages -list p3.pdf
page num type width height color comp bpc enc interp objectID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------
1 0 image 423 600 rgb 3 8 jpeg no 4 0 52 52 19.2K 2.6%
2 1 image 423 600 rgb 3 8 jpeg no 8 0 52 52 19.2K 2.6%
3 2 image 423 600 rgb 3 8 jpeg no 12 0 52 52 19.2K 2.6%
This shows that there are 3 identical PDF objects (with the IDs 4, 8 and 12) which are embedded in p3.pdf now. p3.pdf consists of 3 pages:
pdfinfo p3.pdf | grep Pages:
Pages: 3
Optimize PDF by replacing duplicate images with references
Now we can apply the above mentioned optimization with the help of Ghostscript
gs -o p3-optim.pdf -sDEVICE=pdfwrite -dDetectDuplicateImages=true p3.pdf
Checking:
pdfimages -list p3-optim.pdf
page num type width height color comp bpc enc interp objectID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------
1 0 image 423 600 rgb 3 8 jpeg no 10 0 52 52 19.2K 2.6%
2 1 image 423 600 rgb 3 8 jpeg no 10 0 52 52 19.2K 2.6%
3 2 image 423 600 rgb 3 8 jpeg no 10 0 52 52 19.2K 2.6%
There is still one image listed per page -- but the PDF object ID is always the same now: 10.
ls -ltrh p1.pdf p3.pdf p3-optim.pdf
-rw-r--r--# 1 kp staff 20K Jan 6 19:36 p1.pdf
-rw-r--r-- 1 kp staff 60K Jan 6 19:37 p3.pdf
-rw-r--r-- 1 kp staff 16K Jan 6 19:40 p3-optim.pdf
As you can see, the "dumb" concatentation made with pdftk increased the original file size to three times the original one. The optimization by Ghostscript brought it down by a considerable amount.
The most recent versions of Ghostscript may even apply the -dDetectDuplicateImages by default. (AFAIR, v9.02, which introduced it for the first time, didn't use it by default.)

You can obtain good results by converting from PDF to Postscript, then back to PDF using
pdf2ps file.pdf file.ps
ps2pdf -dPDFSETTINGS=/ebook file.ps file-optimized.pdf
The value of argument -dPDFSETTINGS defines the quality of the images in the resulting PDF. Options are, from low to high quality: /screen, /default, /ebook, /printer, /prepress, see http://milan.kupcevic.net/ghostscript-ps-pdf/ for a reference.
The Postscript file can become quite large, but the results are worth it. I went from a 60 MB PDF to a 140 MB Postscript file, but ended up with a 1.1 MB optimized PDF.

I use Ghostscript with following options taken from here.
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen \
-dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf

You may find that pdftocairo (from Poppler) can make smaller PDFs but beware that it will strip some features (such as hyperlinks) away.

This worked for me
Convert your PDF to PS (this creates a large file
pdf2ps large.pdf very_large.ps
Convert the new PS back to a PDF
ps2pdf very_large.ps small.pdf
Source:
https://pandemoniumillusion.wordpress.com/2008/05/07/compress-a-pdf-with-pdftk/

You will lose in quality but if it's not an issue then ImageMagick's convert may proves helpful :
convert original.pdf reduced.pdf
Note that it doesn't always work : I once converted a 126 MB file into a 14 MB one using this command, but another time it doubled the size of a 350 Ko file.
Anyway it's worth giving it a try…
As mentioned in comments, of course there is no point in applying this command on a vector-based PDF, it will only be useful on rasterized images.
See also this post for related options.

Ghostscript comes with ps2pdf14 utility which can be used to optimise PDF file(s) but on some occasions size of "optimised" file may be bigger than original.

For the PDF which size is mainly due to embedded images (pdfimages -list is your friend), typically scanned documents, I would recommend the use of ocrmypdf which is quite good at optimizing, with optional OCR layer as a bonus.

Related

Use several thread when rendering pdf to image using mupdf

Is it possible to run mutool.exe draw using several threads to increase PDF to Image conversion speed?
The command help list says something about -B and -T parameters, but I do not understand what maximum band_height does? What values should I set for -B?
-B - maximum band_height (pXm, pcl, pclm, ocr.pdf, ps, psd and png output only)
-T - number of threads to use for rendering (banded mode only)
Executing mutool with -B 100 -T 6 slightly increased conversion speed by 10% but not so much, the CPU usage spiked from 6% to 11%, but why not 60%?
mutool.exe draw -r 300 -B 100 -T 6 -o "C:\test%d.png" "C:\test-large.pdf"
Every system and PDF is different but lets use a single page without text for timings in my system.
I know this file is complex but not too unusual since without text, other objects behave as text would, without the complexity of font look-up etc. so rendering time is generally fairly similar for a given run.
Lets start with a low resolution since I know the file well enough to have found it fail due to Malloc in this machine around 300dpi.
mutool draw -Dst -r 50 -o complex.png complex.pdf
page complex.pdf 1 1691ms
total 1691ms (0ms layout) / 1 pages for an average of 1691ms
mutool draw -Dst -r 100 -o complex.png complex.pdf
page complex.pdf 1 3299ms
total 3299ms (0ms layout) / 1 pages for an average of 3299ms
mutool draw -Dst -r 200 -o complex.png complex.pdf
page complex.pdf 1 7959ms
total 7959ms (0ms layout) / 1 pages for an average of 7959ms
mutool draw -Dst -r 400 -o complex.png complex.pdf
page complex.pdf 1error: malloc of 2220451350 bytes failed
error: cannot draw 'complex.pdf'
So this is when "Banding" is required to avoid memory issues since my target is 400 dpi output.
You may see I used -D above so I need to remove that for threads, cannot use multiple threads without using display list. Lets start small since too large bands or too many threads can also malloc error.
mutool draw -st -B 32 -T 2 -r 400 -o complex.png complex.pdf
page complex.pdf 1 14111ms
total 14111ms (0ms layout) / 1 pages for an average of 14111ms
14 seconds for this file is not a bad result based on the progressive timings above, but perhaps on this 8 thread device I could do better? Lets try bigger bands and more threads.
mutool draw -st -B 32 -T 3 -r 400 -o complex.png complex.pdf
page complex.pdf 1 12726ms
total 12726ms (0ms layout) / 1 pages for an average of 12726ms
mutool draw -st -B 256 -T 3 -r 400 -o complex.png complex.pdf
page complex.pdf 1 12234ms
total 12234ms (0ms layout) / 1 pages for an average of 12234ms
mutool draw -st -B 256 -T 6 -r 400 -o complex.png complex.pdf
page complex.pdf 1 12258ms
total 12258ms (0ms layout) / 1 pages for an average of 12258ms
So increasing threads up to 3 helps and upping the Band size helps, but 6 threads is no better. So is there another tweak we can consider, and playing around with many runs the best I got on this kit/configuration was 12 seconds.
mutool draw -Pst -B 128 -T 4 -r 400 -o complex.png complex.pdf
page complex.pdf 1 1111ms (interpretation) 10968ms (rendering) 12079ms (total)

squashed then re-squashed give different size?

I extracted a
firmware.bin
using fmk mod kit and gave me 3 files: header.img , rootfs.img and footer.img
now whenever I cat and repack all the files together in firmware2.bin again, it works and it upgrades the router.
but when I unsquash the rootfs.img using this command unsquashfs rootfs.img into squashfs-root/
then I squash it again using mksquashfs rootfs-root/ squash_new.img -comp lzma -b 131072 "which it by the way the same compression method and block size as the original rootfs.img"
but it gives me a less size comparing to the rootfs.img and the router gives me upgrade failed
here are the sizes of the 2 files
squash_new.img (9,945,088 bytes)
rootfs.img (9,945,232 bytes)
is there a problem with unsquashfs or mksquashfs?
because when I used a hex editor software, I noticed some entries are different although I have not changed anything.

Image magick: takes too much time at linux server

I am facing an issue with time taken by imagemagic to execute the commands on my server. I also tried to make thread control from 20(default) to 1 but, no improvements.
Here are some of the commands we fire and time took for them on server. Is there any way to reduce this execution time?
/usr/bin/convert source1.jpeg -resize 4518x3013! output.png
real 0m13.150s
user 0m18.320s
sys 0m2.029s
/usr/bin/convert output.png -crop 2408x3010+1053+0 +repage cropped.png
real 0m5.978s
user 0m5.043s
sys 0m0.881s
/usr/bin/convert destination.png -draw image over 564,564+2408+3010 'cropped.png' output.png
real 0m10.085s
user 0m11.160s
sys 0m1.710s
Updated Information
identify -version command output:
Version: ImageMagick 6.8.9-1 Q16 x86_64 2014-08-16 http://www.imagemagick.org
Copyright: Copyright ( c ) 1999-2014 ImageMagick Studio LLC
Features: DPC OpenMP
Delegates: bzlib freetype gslib jng jpeg png ps tiff zlib
Server configuration:
OS version is centos 6
RAM 32GB
source1.jpeg(link)
First command executed with -bench 5 and returned below output.
Performance[1]: 5i 0.095ips 1.000e 90.970u 0:52.550
Performance[2]: 5i 0.104ips 0.522e 92.310u 0:48.110
Performance[3]: 5i 0.090ips 0.485e 93.420u 0:55.770
Performance[4]: 5i 0.086ips 0.474e 91.180u 0:58.230
Performance[5]: 5i 0.091ips 0.488e 94.850u 0:55.030
Thanks,
Sagar
Does this get you any faster?
convert input.png -quality 80% -resize 4518x3013! \
\( +clone -crop 2408x3010+1053+0 +repage \) \
-geometry +564+564 -composite output.png
For really fast PNG writing, use -quality 10 for drawings, -quality 11 for photos. This should cut your PNG-writing time by a factor of five.
The "quality" number doesn't affect image quality when writing a PNG. It only affects the compression effectiveness.

No output file from GhostScript PDF to PNG conversion

I have a two-page PDF I'm trying to convert to a PNG file. When I run:
gs -sDevice=pngalpha -o=gs-output-%d.png -r400 test1-0.pdf
I get:
GPL Ghostscript 9.07 (2013-02-14)
Copyright (C) 2012 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 2.
Page 1
%%BoundingBox: 35 35 577 757
%%HiResBoundingBox: 35.910001 35.910001 576.090022 756.090029
Page 2
%%BoundingBox: 35 35 577 757
%%HiResBoundingBox: 35.910001 35.910001 576.090022 756.090029
And then... nothing. No output files at all. Where am I going wrong?
You're so close you'll be mad ;-)
After a bit of manpage reading here's what worked for me:
gs -sDEVICE=pngalpha -ogs-output-%d.png -r400 test1-0.pdf
i.e. DEVICE instead of Device, and -o instead of -o=
In case it matters, my gs version is:
GPL Ghostscript 9.05 (2012-02-08)
try this which works perfectly with me and it get a very good results:
-sDEVICE=pngalpha -o "$OUTPUTIMAGEFILE" -dFirstPage=1 -dLastPage=2 -dNOPAUSE -dGraphicsAlphaBits=4 -dTextAlphaBits=4 "$INPUTPDFFILE"
However i guess that using -r400 is for JPEG images not PNGS

Imagemagick use-trimbox doesn't work at all

I'm desparately trying to convert PDF to JPG by Imagemagick (convert command) preserving trimbox.
I run following command (convert only first page).
convert -verbose -define pdf:use-trimbox=true "test_org.pdf[0]" cropped.jpg
Here is an output. Looks like imagemagick doesn't pass use-trimbox parameter to the ghostscript. May that be a reason? As at the moment converted image is mediabox size, not trimbox. Version of ImageMagick is 6.0.7, ghostscript GPL Ghostscript 8.64.
convert: **"gs" -q -dBATCH -dSAFER -dMaxBitmap=500000000 -dNOPAUSE -dAlignToPixels=0 "-sDEVICE=bmpsep8" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-g652x935" "-r72x72" -dFirstPage=1 -dLastPage=1 "-sOutputFile=/tmp/magick-XXgTtZZG" "-f/tmp/magick-XXs4Kjq2" "-ftest_org.pdf".**
/tmp/magick-XXgTtZZG[0] BMP 652x935 PseudoClass 256c 2.3mb 0.050u 0:01
/tmp/magick-XXgTtZZG[1] BMP 652x935 PseudoClass 256c 2.3mb 0.040u 0:01
/tmp/magick-XXgTtZZG[2] BMP 652x935 PseudoClass 256c 2.3mb 0.020u 0:01
/tmp/magick-XXgTtZZG[3] BMP 652x935 PseudoClass 256c 2.3mb 0.010u 0:01
test_org.pdf PDF 652x935 652x935+0+0 DirectClass 2.3mb 0.040u 0:01
test_org.pdf PDF 652x935 652x935+0+0 DirectClass 2.3mb 0.040u 0:01
test_org.pdf=>cropped.jpg PDF 652x935 652x935+0+0 DirectClass 202kb 0.120u 0:01
You possibly have too old ImageMagick. Works fine in my case.
Version: ImageMagick 6.6.0-4 2010-11-16 Q16 http://www.imagemagick.org
-define pdf:use-trimbox=true makes convert invoke gs with -dUseTrimBox option, which I don't see in output provided by you. Consider updating ImageMagick.
Your ImageMagick 6.0.7 is more than 6 years old (dozens of releases back). Current is 6.7.0-9.
Your Ghostscript 8.64 is also more than 2 years old already (5 releases back). Current is 9.02.
My recommendation is to upgrade.
On my (Windows) system I have IM 6.7.0-8 and GS 9.02. Running -define pdf:use-trimbox=true works fine here and translates to a Ghostscript commandline parameter of -dUseTrimBox=true.
However (and this is important!): one should take into account, that for many practical example PDFs out there, TrimBox is undefined, or explicitely set to the same values as MediaBox. Both have the same effect: a -dUseTrimBox=true will not make any difference in the output to a -dUseTrimBox=false.

Resources