Repair apparently damaged pdf and reduce file size - linux

I have a PDF file (4.6MB) which was made by combining 6 different PDFs (containing both text and bitmap graphics) using pdftk in Ubuntu 12.04. I wish to compress this file to something close to 2MB without affecting its quality.
I have tried pdftk's "compress" option (couldn't compress it to 2 MB), also tried converting it to ps first and than back to pdf, it gives the following warning:
****Warning: considering '0000000000 XXXXX n' as a free entry.
and then hangs. qpdf also failed saying that the file is damaged.
Could someone help me out?

What result does Ghostscript give you? Try this command:
gs \
-o output.pdf \
-sDEVICE=pdfwrite \
-dPDFSETTINGS=/screen \
input.pdf

has this pdf file reserved infos? If it has no confidential data it would be interesting to see
anyway many times where qpdf fails, Multivalent works
you can try to use its Compress tool (it also attempts to repair pdf file)
Multivalent
https://rg.to/file/c6bd7f31bf8885bcaa69b50ffab7e355/Multivalent20060102.jar.html
(latest free version with tools included, current has no tools in itself)
java -cp path....to/Multivalent.jar tool.pdf.Compress file.pdf

This works for me to repair the damaged PDF
sudo apt-get install mupdf-tools
mutool clean input.pdf output.pdf

Related

How to convert a PDF into JPG with command line in Linux? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 2 years ago.
The community reviewed whether to reopen this question last year and left it closed:
Original close reason(s) were not resolved
Improve this question
What are fast and reliable ways for converting a PDF into a (single) JPEG using the command line on Linux?
For the life of me, over the last 5 years, I cannot get imagemagick to work consistently (if at all) for me, and I don't know why people continually recommend it again and again. I just googled how to convert a PDF to a JPEG today, found this answer, and tried convert, and it doesn't work at all for me:
Broken command (doesn't work for me):
# BROKEN cmd
$ convert in.pdf out.jpg
convert-im6.q16: not authorized `in.pdf' # error/constitute.c/ReadImage/412.
convert-im6.q16: no images defined `out.jpg' # error/convert.c/ConvertImageCommand/3258.
(Update 24 Feb. 2022: here is the fix for imagemagick so convert will work. See also my comment here, and my comments under this answer here. I still like pdftoppm, below, much better, however.)
Then, I remembered there was another tool I use and wrote about, so I googled "linux convert pdf to jpg Gabriel Staples", clicked the first hit, and scrolled down to my answer. Here's what works perfectly for me. This is the basic command format:
Good command--use this instead:
# GOOD cmd
pdftoppm -jpeg -r 300 input.pdf output
The -jpeg sets the output image format to JPG, -r 300 sets the output image resolution to 300 DPI, and the word output will be the prefix to all pages of images, which will be numbered and placed into your current directory you are working in. A better way, in my opinion, however, is to use mkdir -p images first to create an "images" directory, then set the output to images/pg so that all output images will be placed cleanly into the images dir you just created, with the file prefix pg in front of each of their numbers.
Therefore, here are my favorite commands:
[Produces ~1MB-sized files per pg] Output in .jpg format at 300 DPI:
mkdir -p images && pdftoppm -jpeg -r 300 mypdf.pdf images/pg
[Produces ~2MB-sized files per pg] Output in .jpg format at highest quality (least compression) and still at 300 DPI:
mkdir -p images && pdftoppm -jpeg -jpegopt quality=100 -r 300 mypdf.pdf images/pg
If you need more resolution, you can try 600 DPI:
mkdir -p images && pdftoppm -jpeg -r 600 mypdf.pdf images/pg
...or 1200 DPI:
mkdir -p images && pdftoppm -jpeg -r 1200 mypdf.pdf images/pg
See the references below for more details and options.
References:
[my answer] Convert PDF to image with high resolution
[my answer] https://askubuntu.com/questions/150100/extracting-embedded-images-from-a-pdf/1187844#1187844
Keywords: ubuntu linux convert pdf to images; pdf to jpeg; ptdf to tiff; pdf2images; pdf2tiff; pdftoppm; pdftoimages; pdftotiff; pdftopng; pdf2png
You can try ImageMagick's convert utility.
On Ubuntu, you can install it with this command:
$ sudo apt-get install imagemagick
Use convert like this:
$ convert input.pdf output.jpg
# For good quality use these parameters
$ convert -density 300 -quality 100 in.pdf out.jpg
libvips can convert PDF -> JPEG quickly. It comes with most linux distributions, it's in homebrew on macos, and you can download a windows binary from the libvips site.
This will render the PDF to a JPG at the default DPI (72):
vips copy somefile.pdf somefile.jpg
You can use the dpi option to set some other rendering resolution, eg.:
vips copy somefile.pdf[dpi=600] somefile.jpg
You can pick out pages like this:
vips copy somefile.pdf[dpi=600,page=12] somefile.jpg
Or render five pages starting from page three like this:
vips copy somefile.pdf[dpi=600,page=3,n=5] somefile.jpg
The docs for pdfload have all the options.
With this benchmark image, I see:
$ /usr/bin/time -f %M:%e convert -density 300 r8.pdf[3] x.jpg
276220:2.17
$ /usr/bin/time -f %M:%e pdftoppm -jpeg -r 300 -f 3 -l 3 r8.pdf x.jpg
91160:1.24
$ /usr/bin/time -f %M:%e vips copy r8.pdf[page=3,dpi=300] x.jpg
149572:0.53
So libvips is about 4x faster and needs half the memory, on this test at least.
Convert from imagemagick seems to do a good job:
convert file.pdf test.jpg
and in case there were multiple files generated:
convert test-0.jpg -append test-1.jpg ... -append one.jpg
to generate a single file, where all pages are concatenated.

ERROR -12 closing pdfwrite device in ghost script

In our module we are using the ghost script to compress PDF of higher size to lower size use the command
gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4
-sOutputFile=output.pdf input.pdf
while converting this we are getting an error as shown below
GPL Ghostscript 9.10: Unrecoverable error, exit code 1GPL Ghostscript 9.10: ERROR -12 closing pdfwrite device. See gs/psi/ierrors.h for code explanation
More Information:
we are using Ubuntu 14.04 OS.
Thanks, Praveen Ravipati
Thanks KenS, I found the cause of the issue. I verified the tmp directory while compressing the PDF . it taking the huge space and at some point of time tmp space is not enough to compress. So this is throwing io error. So to fix this i added extra space for tmp directory. then I checked again.It is fine

How do we merge 2 pages in a pdf file on linux

I have a pdf file of 10 pages , and I want to merge every two pages of it into a single page , like 1,2->1 : 3,4->2 : and so on ... I learnt about Ghostscript but these are the tools for compressing the .pdf , also there are utilities to merge two or more pdf s together into one,but I unfortunately could not find any to merge pages in the same pdf . Kindly help !
You can do this with cpdf.
cpdf -twoup in.pdf -o out.pdf
Installing pdfjam adds a command called pdfjoin that can be used to join multiple PDF files into one. If your distribution doesn't come with pdfjam, you could also try pdftk.
pdfunite is part of the poppler package
pdfunite 01.pdf 02.pdf 03.pdf out.pdf
You can do this with pdftk
pdftk test.pdf test1.pdf cat output output.pdf

An efficient way to detect corrupted png files?

I've written a program to process a bunch of png files that are generated by a seperate process. The capture mostly works, however there are times when the process dies and is restarting which leaves a corrupted image. I have no way to detect when the process dies or which file it dies one (there are ~3000 png files).
Is there a good way to check for a corrupted png file?
I know this is a question from 2010, but I think this is a better solution: pngcheck.
Since you're on a Linux system you probably already have Python installed.
An easy way would be to try loading and verifying the files with PIL (Python Imaging Library) (you'd need to install that first).
from PIL import Image
v_image = Image.open(file)
v_image.verify()
(taken verbatim from my own answer in this thread)
A different possible solution would be to slightly change how your processor processes the files: Have it always create a file named temp.png (for example), and then rename it to the "correct" name once it's done. That way, you know if there is a file named temp.png around, then the process got interrupted, whereas if there is no such file, then everything is good.
(A variant naming scheme would be to do what Firefox's downloader does -- append .partial to the real filename to get the temporary name.)
Kind of a hack, but works
If you are running on linux or something like you might have the "convert" command
$ convert --help
Version: ImageMagick 5.5.6 04/01/03 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 2003 ImageMagick Studio LLC
Usage: convert [options ...] file [ [options ...] file ...] [options ...] file
If you make an invalid png, and then try to convert, you'll get an error:
$ date> foo.png
$ convert foo.png foo.gif
convert: NotAPNGImageFile (foo.png).
Find all non-PNG files:
find . -type f -print0 | xargs -0 file --mime | grep -vF image/png
Find all corrupted PNG files:
find . -type f -print0 | xargs -0 -P0 sh -c 'magick identify +ping "$#" > /dev/null' sh
file command only checks magic number. Having the PNG magic number doesn't mean it is a well formed PNG file.
magick identify is a tool from ImageMagick. By default, it only checks headers of the file for better performance. Here we use +ping to disable the feature and make identify read the whole file.

Problems with linux Imagemagick converting PDFs to JPGs

The system I'm using uses the linux utility convert to convert pdfs to jpgs. My box gives me the following error.
>$ convert Badge-1114044091.pdf Badge-1114044091.jpg
convert: Postscript delegate failed `Badge-1114044091.pdf'.
convert: missing an image filename `Badge-1114044091.jpg'.
But the production machine does not. According to
>$ convert -version
my version is the same as the prodution machine. I'm not sure exactly how to check if postscript needs to be updated. Not really a huge linux guru.
EDIT: Upon suggestion, I checked Ghostscript. The following was already installed.
>$ gs -version
ESP Ghostscript 8.15.3 (2006-08-25)
Copyright (C) 2004 artofcode LLC, Benicia, CA. All rights reserved.
Install GhostScript.
http://www.ghostscript.com/
ImageMagick (the 'convert' utility) doesn't actually convert PDFs; it invokes GhostScript using an arcane command like
gs -q -sDEVICE=jpeg -dBATCH -dNOPAUSE -dFirstPage=1 -dLastPage=1 -r<OUTPUT RESOLUTION> -sOutputFile=<OUTPUT>.jpg <INPUT>.pdf 2>&1
You might want to try that command directly if you want more control.

Resources