PDF on Linux: Combine font subsets and replace Type 3 with Type 1

PDF on Linux: Combine font subsets and replace Type 3 with Type 1 - linux

I have a PDF file that I'd like to post-process on Linux. In particular I'd like to:
Replace Type 3 fonts with Type 1 fonts
Replace multiple subsets of the same font with a single subsets (the subsets are the result of including figures in LaTeX, where each figure contains a subset'ed font)
With Windows these two steps are possible with the Adobe Distiller (open the document file and print it into a new PDF document with the respective settings).
On Linux I'm able to subset fonts with Ghostscript [1], but it does not seem to be able to replace (all?) Type 3 fonts with Type 1 fonts or to combine multiple subsets of the same font.
Any hints on how I can achieve these two tasks with free tools?
(I am aware of the reply to How to convert Type 3 font to Type 1 font in PDF. However, I don't really care if I theoretically lose information about the font, as this conversation seems to work fine in Distiller).
[1] With the arguments:
gs -dPDFA -dSAFER -dNOPLATFONTS -dNOPAUSE -dBATCH \
-sDEVICE=pdfwrite -dCompatibilityLevel=1.4 \
-dPDFSETTINGS=/printer -dCompatibilityLevel=1.4 \
-dMaxSubsetPct=100 -dSubsetFonts=true \
-dEmbedAllFonts=true -sOutputFile=/tmp/tmp.pdf -f "$1"

Somehow I doubt your statement "With Windows these two steps are possible with the Adobe Distiller". I'd need to see with my own eyes that this works before I can believe it. This is especially true for the "replace multiple subsets of the same font with a single subsets". (But I'm not in a position to verify or falsify the statement myself right now... so I'll just take it for a fact for the time being.)
Type 3 fonts are described in a fully-fledged version of PostScript. Type 1 fonts are described by using a subset of the PostScript language.
Replacements of embedded fonts are a non-trivial task when processing PDF files. I'm not familiar with any Ghostscript-related utility that could do that.
callassoftware.com has a very powerful commandline utility for sale called pdfToolbox CLI 4. It is available for Windows, Linux, Mac OS X and Solaris. pdfToolbox4 is capable of achieving practically everything you can imagine in so-called PDF preflighting jobs. This includes un-embedding of font subsets and re-embedding them again with their full sets (do it in 2 separate steps, so it might produce the result you want).
That's about the only tool I can think of which could help you. (BTW, a part of callas' PDF preflighting technology is licensed by Adobe to pose in Acrobat 9 Pro as its own preflighting tool...)

Related

how to merge pdf as a table with pdftk or convert

How can one use convert or pdftk to merge several pdfs organized as a table?
For example, given 4 files: file1.pdf, file2.pdf, file3.pdf, file4.pdf, each of a single page, I would like to have a single-page pdf like
file1.pdf file2.pdf
file3.pdf file4.pdf
That is, the files are arranged like an array.

By far the easiest way to convert 4 PDF pages to 1 page on any OS is by N-Up imposition/printing with output to a virtual PDF printer such as Ghostscript. For the most basic 4-Up command line usage see https://stackoverflow.com/a/72850245/10802527
Thus to combine 4 pages (others such as 2 6 9 or 16 are possible) using here in a gui I can very easily set the order.
On Linux or MacOS you can use, along with other options, the CUPS command
lp -o number-up=4 filename
see https://www.cups.org/doc/options.html
The major advantage over using tools such as PDFtk with convert is that it resolves both scaling and preserving most PDF structures without degrading to inferior down-scaled imagery by NOT passing in and out of images before calling Ghostscript.
If you have single pdfs then you can merge before print using PDFtk (uses Ghostscript) instead of poppler pdfunite. Note that with either the Original PDF format is preserved.
If you want to convert to half size images and stitch them together, then reprint to one pdf page, then that can easily be done using imagemagik convert and other commands to call Ghostscript to suit your requirements direct. However, the results will in many ways be degraded by translation to image output.
Since all of the above pass through GS it makes sense, where possible, to install GS as a PDF printer driver.
If you want to avoid installing GhostScript printing then you can use cross platform Coherent cpdf (it only uses GS if the files need repairs)
Note these are "windows double quoted names" adjust as required and is based on the 4 sequential pages in one file are then to be placed 4 at a time on each new page, thus can be used with any multiple of pages in the input.pdf
cpdf -twoup "input.pdf" -o "in-2-Up-tmp.pdf"
cpdf "in-2-Up-tmp.pdf" -rotate 90 -o "out-2-Up.pdf"
cpdf -twoup "out-2-Up.pdf" -o "out-4-Up-tmp.pdf"
cpdf "out-4-Up-tmp.pdf" -rotate 90 -o "out-4-Up.pdf"

Barcodes too wide in a GoDEX printer with CUPS in Linux

I have a GoDEX RT700i printer (203 DPI) and I want to print barcodes in Linux (Ubuntu 16.04)
The barcodes I have are in PDF format. There is a 8 digit number below the barcode.
In Windows, there is no problem with GoDEX drivers. The barcodes and the number are printed perfectly.
Note: If I print the PDF from google chrome it looks fine, but if I print the PDF from Adobe Acrobat Reader, it looks like in Linux.
In Linux, when I print the barcode, the digits of the number are okay, same as Windows, and the height of the bars is okay too, but the width of every bar is bigger than the displayed in the pdf.
How can I fix this?
Here a photo of the printed barcodes
The left one has printed in Linux and the right one has printed in Windows.
There is some additional information:
For Linux I have compiled and installed the GoDEX driver for CUPS and then I have added the printer via AppSocket/HP JetDirect with the IP and Port (9100).
Then, I select the PPD file godex-rt-700i.ppd
These two lines are in the ppd file. Maybe they are related with the problem:
TTRasterizer: Type42
*cupsFilter: "application/vnd.cups-raster 50 rastertoezpl"
When I send the print order, I realized that there are 3 filters for the job:
pdftopdf (application/pdf to application/vnd.cups-pdf, cost 66)
gstoraster (application/vnd.cups-pdf to application/vnd.cups-raster, cost 99)
rastertoezpl (application/vnd.cups-raster to printer/GODEX-RT700i, cost 50)
In the rastertoezpl.c file I saw that there is a function (GDXCompress) that compress the output lines for Godex printer. I thought that maybe the compression affects somehow to the barcode and I tried to deactivate that function (CompBuffer = NULL) and recompile the driver, but that didn't fix anything.
These are the outputs of every filter:
All files (original and intermediate outputs)
When I send the original PDF file to print, these 2 files are generated by cups in /var/spool/cups/:
d00122-001 (pdf)
c00122 (unknown)
1. pdftopdf (/usr/lib/cups/filter/pdftopdf):
/usr/lib/cups/daemon/cups-exec -g 7 -n 0 -u 7 none /usr/lib/cups/filter/pdftopdf MY_PRINTER 122 my_user 00000378 1 "PageSize=Custom.56.69x65.20 Collate ColorModel=Grayscale Duplex=None job-uuid=urn:uuid:7f84fc46-1965-35d2-6a72-e2e73ab0264b job-originating-host-name=localhost date-time-at-creation= date-time-at-processing= time-at-creation=1488464765 time-at-processing=1488464765" /var/spool/cups/d00122-001 > output_pdf2pdf.pdf
output_pdf2pdf.pdf (pdf)
2.gstoraster (/usr/lib/cups/filter/gstoraster):
/usr/lib/cups/daemon/cups-exec -g 7 -n 0 -u 7 none /usr/lib/cups/filter/gstoraster MY_PRINTER 122 my_user 00000378 1 "PageSize=Custom.56.69x65.20 Collate ColorModel=Grayscale Duplex=None job-uuid=urn:uuid:7f84fc46-1965-35d2-6a72-e2e73ab0264b job-originating-host-name=localhost date-time-at-creation= date-time-at-processing= time-at-creation=1488464765 time-at-processing=1488464765"
output_gstoraster.ras (ras)
This file can be opened by rasterview program
3.rastertoezpl (/usr/lib/cups/filter/rastertoezpl):
/usr/lib/cups/daemon/cups-exec -g 7 -n 0 -u 7 none /usr/lib/cups/filter/rastertoezpl MY_PRINTER 122 my_user 00000378 1 "PageSize=Custom.56.69x65.20 Collate ColorModel=Grayscale Duplex=None job-uuid=urn:uuid:7f84fc46-1965-35d2-6a72-e2e73ab0264b job-originating-host-name=localhost date-time-at-creation= date-time-at-processing= time-at-creation=1488464765 time-at-processing=1488464765"
It doesn't create any file. It sends the printer orders directly to the printer
Versions:
Ghostscript = GPL Ghostscript 9.18 Artifex Software
cups = 2.1.3-4
pdftopdf = cups-filters 1.8.3-2ubuntu3.1

Which versions of the various components are you using (CUPS, pdftpdf and Ghostscript) ?
Have you checked the intermediate file produced from pdftopdf to see what that PDF file looks like ?
Have you examined the CUPS raster produced from gstoraster to see if it is correct ?
Exactly how big a difference are we discussing ? A pixel, an inch ? Bear in mind that this is apparently a 203 dpi device, so a pixel is quite a lot.
Given that there are 3 stages in the pipeline the first thing you should do is attempt to isolate which step is causing your problem. First capture the output at every stage; the PDF resulting from pdftopdf, then the CUPS raster file resulting from gstoraster. You can examine each of these individually to see if they show your problem. If they do not then the problem must arise from the final step 'rastertoezpl' and you'll need someone who knows that code. Otherwise you'll be able to decide whether the problem is the pdftopdf step, or the gstoraster step. In any event you can then ask for specific help.
Its most unlikely that the content of the PPD file has any impact here (other than specifying the final filter required to drive the printer). Of course, without seeing the original file, its hard to tell, possibly the barcode is a TrueType font.....
[edit]
Well I still can't see a Ghostscript command line in your question. I'm not able to run CUPS and I can't build RasterView either since it requires a bunch of dependencies I simply don't have.
However, I can run it to TIFF. The result is the same as your photo when the resolution is low enough.
Your problem is the one described in comments 17 and 18 in the bug thread I posted in my comment below. The PostScript (and PDF) imaging model says that when any part of a pixel is touched, that whole pixel is rendered to the output.
Your PDF draws the barcodes as a series of (vector) rectanlges, using co-ordinates and sizes which are not precisely aligned on the underlying pixels of the device.
If you use Adobe Acrobat and 'save as' TIFF you will see exactly the same problem there (you need to set the resolution of the output to 203 dpi using the 'Settings' button on the 'save as' dialog).
There is a long discussion on the bug thread about this, there are a number of possible solutions;
Write the PostScript (or PDF) so that the co-ordinates are precisely clamped to the device grid. This may be difficult to do, especially if you run the file through pdf2pdf.
Draw the bars by first drawing a big rectangle, then draw the spaces between bars as white.that might make the bars 'skinny' but they won't merge. If the printer is thermal then the thermal spread will reduce the effect.
Generate the barcode as an image instead of vectors. Images don't follow the 'any part of pixel rule', they use 'centre of pixel' instead, which may give (at least slightly) better results.
Use a barcode font. Fonts also use a different method for drawing, because if you reduce the font size it quickly turns into a series of black blobs if you use any part of pixel.
Basically, you are trying to draw shapes to a tolerance which simply isn't possible on a low-resolution device like this, when using PostScript/PDF.

ps2pdf creates a very big pdf file from paps-created-ps file

In linux, I use ps2pdf to convert text file report to pdf in bash script.
To feed ps2pdf for ps file, I use paps command because of UTF8 encoding.
The problem is pdf file from ps2pdf is about 30 times bigger than ps file created from paps.
Previous, I used a2ps to convert text to ps and then fed to ps2pdf, and the pdf output from this is normal size and not big.
Is there any way to reduce the pdf size from paps and ps2pdf? Or what am I doing wrong?
The command I used is as below.
paps --landscape --font="Freemono 10" textfile.txt > textfile.ps
ps2pdf textfile.ps textfile.pdf
Thank you very much.

As the author of paps, I agree with #Kens's description of paps' inner workings. Indeed, I chose to create my own font mechanism in the postscript language. That is history though as I have just released a new version of paps that uses cairo for its postscript, pdf, or svg rendering. This is much more compact than paps output, especially w.r.t. the result after doing ps2pdf. Please check out http://github.com/dov/paps .

For ps2pdf, it is easiest to control output size is by designating paper size.
An example command is:
ps2pdf -sPAPERSIZE=a4 -dOptimize=true -dEmbedAllFonts=true YourPSFile.ps
ps2pdf is the wrapper to ghostscript (ps2pdf is owned by ghostscript package)
with -sPAPERSIZE=something you define the paper size. Wondering about valid PAPERSIZE values? See [http://ghostscript.com/doc/current/Use.htm#Known_paper_sizes here]
-dOptimize=true let's the created PDF be optimised for loading
-dEmbedAllFonts=true makes the fonts look always nice
All of this is from : https://wiki.archlinux.org/index.php/Ps2pdf

I think he means the size on disk, rather than the size of the output media. The 'most likely' scenario normally is that the source contains a large DCT encoded image (JPEG) which is decoded and then compressed losslessly into the PDF file using something like flate.
But that can't be the case here, as its apparently only text. So the next most likely problem is that the text is being rasterised, which suggests some odd fonts in the PostScript, which is possible if you are using UTF-8 text, its probably constructing something daft like a CIDFont with TrueType descendant fonts.
However, since the version of Ghostscript isn't given, and we don't have a file to look at, its really impossible to tell. Older versions of the pdfwrite device did less well on creating optimal files, especially from CIDFonts.
Setting 'Optimize=true' won't actually do anything with the current version of pdfwrite, that's an Acrobat Distiller parameter we no longer implement. Older versions of Ghostscript did use it, but the output wasn't correctly Linearised.
The correct parameter for newer versions is '-dFastWebView' which is supposed to be faster when loading from the web if the client can deal with this format. Given the crazy way its specified, practically no viewer in the world does. However, the file is properly constructed in recent versions, so if you can find a viewer which supports it, you can use this (at the expense of making the PDF file slightly larger)
If you would like to post a URL to a PostScript file exhibiting problems I can look at it, but without it there's really nothing much I can say.
Update
The problem is the paps file, it doesn't actually contain any text at all, in a PostScript sense.
Each character is stored as a procedure, where a path is drawn and then filled. This is NOT stored in a font, just in a dictionary. All the content on the page is stored in strings in a paps 'language'. In the case of text this simply calls the procedure for the relevant glyph(s)
Now, because this isn't a font, the repeated procedures are simply seen by pdfwrite (and pretty much all other PostScript consumers) as a series of paths and fills, and that's exactly what gets written to the output in the PDF file.
Now normally a PDF file would contain text that looks like :
/Helvetica 20 Tf
(AAA) Tj
which is pretty compact, the font would contain the program to draw the 'A' so we only include it once.
The output from paps for the same text would look like (highly truncated) :
418.98 7993.7 m
418.98 7981.84 l
415.406 7984.14 411.82 7985.88 408.219 7987.04 c
...
... 26 lines omitted
...
410.988 7996.3 414.887 7995.19 418.98 7993.7 c
f
418.98 7993.7 m
418.98 7981.84 l
415.406 7984.14 411.82 7985.88 408.219 7987.04 c
...
... 26 lines omitted
...
410.988 7996.3 414.887 7995.19 418.98 7993.7 c
f
418.98 7993.7 m
418.98 7981.84 l
415.406 7984.14 411.82 7985.88 408.219 7987.04 c
...
... 26 lines omitted
...
410.988 7996.3 414.887 7995.19 418.98 7993.7 c
f
which as you can clearly see is much larger. Whereas with a font we would only include the instructions to draw the glyph once, and then use only a few bytes to draw each occurrence, with the paps output we include the drawing instructions for the glyph each and every time it is drawn.
So the problem is the way paps emits PostScript, and there is nothing that pdfwrite can do about it.
That said, I see that you are using Ghostscript 8.71 which is now 4 years old, you should probably consider upgrading.

What are best parameters to run ImageMagick to convert low quality pdf to images (for OCR)

I have several low quality pdfs. I would like to use OCR -- to be more precise Ocropus to get text from them. To do use, I use first ImageMagick -- a command line tool to convert pdf to images -- to transforms these pdfs into jpg or png.
However ImageMagick produces very low quality images and Ocropus hardly recognizes anything. I would like to learn what are the best parameters for handling low quality pdfs to provide as-good-as-possible-quality images to OCR.
I have found this page, but I do not know where to start.

You can learn about the detailed settings ImageMagick's "delegates" (external programs IM uses, such as Ghostscript) by typing
convert -list delegate
(On my system that's a list of 32 different commands.) Now to see which commands are used to convert to PNG, use this:
convert -list delegate | findstr /i png
Ok, this was for Windows. You didn't say which OS you use. [*] If you are on Linux, try this:
convert -list delegate | grep -i png
You'll discover that IM does produce PNG only from PS or EPS input. So how does IM get (E)PS from your PDF? Easy:
convert -list delegate | findstr /i PDF
convert -list delegate | grep -i PDF
Ah! It uses Ghostscript to make a PDF => PS conversion, then uses Ghostscript again to make a PS => PNG conversion. Works, but isn't the most efficient way if you know that Ghostscript can do PDF => PNG in one go. And faster. And in much better quality.
About IM's handling of PDF conversion to images via the Ghostscript delegate you should know two things first and foremost:
By default, if you don't give an extra parameter, Ghostscript will output images with a 72dpi resolution. That's why Karl's answer suggested to add -density 600 which tells Ghostscript to use a 600 dpi resolution for its image output.
The detour of IM to call Ghostscript twice to convert first PDF => PS and then PS => PNG is a real blunder. Because you never win and harldy keep quality in the first step, but very often loose some. Reasons:
PDF can handle transparencies, which PostScript can not.
PDF can embed TrueType fonts, which Ghostscript can not. etc.pp.
Conversion in the direction PS => PDF is not that critical....)
That's why I'd suggest you convert your PDFs in one go to PNG (or JPEG) using Ghostscript directly. And use the most recent version 8.71 (soon to be released: 9.01) of Ghostscript! Here are example commands:
gswin32c.exe ^
-sDEVICE=pngalpha ^
-o output/page_%03d.png ^
-r600 ^
d:/path/to/your/input.pdf
(This is the commandline for Windows. On Linux, use gs instead of gswin32c.exe, and \ instead of ^.) This command expects to find an output subdirectory where it will store a separate file for each PDF page. To produce JPEGs of good quality, try
gs \
-sDEVICE=jpeg \
-o output/page_%03d.jpeg \
-r600 \
-dJPEGQ=95 \
/path/to/your/input.pdf
(Linux command version). This direct conversion avoids the intermediate PostScript format, which may have lost your TrueType font and transparency object's information that were in the original PDF file.
[*] D'oh! I missed to see your "linux" tag at first...

-density 600 or so should give you what you need.

At least two other tools you may want to consider:
pdfimages, which comes with the package poppler-utils, makes it easy to extract the images from a PDF without degrading them.
pdfsandwich, which can give you an OCR'd file by simply running pdfsandwich inputfile.pdf. You may need to tweak the options to get a decent result. See the official page for more info.

beamer includegraphics with screenshots

I'm using the LaTeX-Beamer class for making presentations. Every once in a while I need to include screenshots. Those graphics are pixel-based, of course. I use includegraphics like this:
\begin{figure}
\includegraphics[width= \paperwidth]{img/analyzer.png}
\end{figure}
or usually something like this:
\begin{figure}
\includegraphics[width= 0.8\linewidth]{img/analyzer.png}
\end{figure}
This leads to pretty bad readibility of the contained text, so I'm asking for your best practices: How would you include screenshots containing text considering, that I will do the output PDF with pdflatex?
EDIT: I suppose I'm looking for something like an 1:1 presetation of the image within beamer. However, [scale = 1.0] doesn't achieve what I'm looking for.

Your best bet is to scale the image outside of Latex for inclusion, and include it in 1:1 ratio. The scaling done by graphics packages in Latex isn't going to be anywhere near as good as possible from other tools. Latex (Tex) has limited floating-point arithmetic capabilities, whereas an external tool can use sophisticated algorithms to get the scaling better.
Another option is to use only a part of the screenshot, the one you want to concentrate on.
Edit: If you can change the font size before taking the screenshot, that's another option—just increase the font size for the screenshots.
Of course, you can combine the two methods.

I have done exactly what you do and e.g defined
\newcommand{\screenshot}[1]{\centerline{%
\includegraphics[height=7.8cm,transparent]{#1}}} % 7.8in
which worked with whatever style I was using at the time. The files included with this macro were all PNGs created with one the usual Linux screen capture tools.
Edit: You may have to play with the size (height and width) of your input files. It came out rather nice for me (and this was from a presentation in 2006).

How about scaling it as follows:
\includegraphics[scale=0.5]{images/myimage.jpg}
This works for me.

Have you tried to convert the image to .eps or .pdf file and use this file in LaTeX?
Maybe try also latex, dvips and ps2pdf.
Problem might be in used viewer, in Linux I use Document viewer or ePDFViewer and output is much worse than in Adobe Reader or Acrobat, which I use in Windows...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string