ps2pdf creates a very big pdf file from paps-created-ps file - linux

In linux, I use ps2pdf to convert text file report to pdf in bash script.
To feed ps2pdf for ps file, I use paps command because of UTF8 encoding.
The problem is pdf file from ps2pdf is about 30 times bigger than ps file created from paps.
Previous, I used a2ps to convert text to ps and then fed to ps2pdf, and the pdf output from this is normal size and not big.
Is there any way to reduce the pdf size from paps and ps2pdf? Or what am I doing wrong?
The command I used is as below.
paps --landscape --font="Freemono 10" textfile.txt > textfile.ps
ps2pdf textfile.ps textfile.pdf
Thank you very much.

As the author of paps, I agree with #Kens's description of paps' inner workings. Indeed, I chose to create my own font mechanism in the postscript language. That is history though as I have just released a new version of paps that uses cairo for its postscript, pdf, or svg rendering. This is much more compact than paps output, especially w.r.t. the result after doing ps2pdf. Please check out http://github.com/dov/paps .

For ps2pdf, it is easiest to control output size is by designating paper size.
An example command is:
ps2pdf -sPAPERSIZE=a4 -dOptimize=true -dEmbedAllFonts=true YourPSFile.ps
ps2pdf is the wrapper to ghostscript (ps2pdf is owned by ghostscript package)
with -sPAPERSIZE=something you define the paper size. Wondering about valid PAPERSIZE values? See [http://ghostscript.com/doc/current/Use.htm#Known_paper_sizes here]
-dOptimize=true let's the created PDF be optimised for loading
-dEmbedAllFonts=true makes the fonts look always nice
All of this is from : https://wiki.archlinux.org/index.php/Ps2pdf

I think he means the size on disk, rather than the size of the output media. The 'most likely' scenario normally is that the source contains a large DCT encoded image (JPEG) which is decoded and then compressed losslessly into the PDF file using something like flate.
But that can't be the case here, as its apparently only text. So the next most likely problem is that the text is being rasterised, which suggests some odd fonts in the PostScript, which is possible if you are using UTF-8 text, its probably constructing something daft like a CIDFont with TrueType descendant fonts.
However, since the version of Ghostscript isn't given, and we don't have a file to look at, its really impossible to tell. Older versions of the pdfwrite device did less well on creating optimal files, especially from CIDFonts.
Setting 'Optimize=true' won't actually do anything with the current version of pdfwrite, that's an Acrobat Distiller parameter we no longer implement. Older versions of Ghostscript did use it, but the output wasn't correctly Linearised.
The correct parameter for newer versions is '-dFastWebView' which is supposed to be faster when loading from the web if the client can deal with this format. Given the crazy way its specified, practically no viewer in the world does. However, the file is properly constructed in recent versions, so if you can find a viewer which supports it, you can use this (at the expense of making the PDF file slightly larger)
If you would like to post a URL to a PostScript file exhibiting problems I can look at it, but without it there's really nothing much I can say.
Update
The problem is the paps file, it doesn't actually contain any text at all, in a PostScript sense.
Each character is stored as a procedure, where a path is drawn and then filled. This is NOT stored in a font, just in a dictionary. All the content on the page is stored in strings in a paps 'language'. In the case of text this simply calls the procedure for the relevant glyph(s)
Now, because this isn't a font, the repeated procedures are simply seen by pdfwrite (and pretty much all other PostScript consumers) as a series of paths and fills, and that's exactly what gets written to the output in the PDF file.
Now normally a PDF file would contain text that looks like :
/Helvetica 20 Tf
(AAA) Tj
which is pretty compact, the font would contain the program to draw the 'A' so we only include it once.
The output from paps for the same text would look like (highly truncated) :
418.98 7993.7 m
418.98 7981.84 l
415.406 7984.14 411.82 7985.88 408.219 7987.04 c
...
... 26 lines omitted
...
410.988 7996.3 414.887 7995.19 418.98 7993.7 c
f
418.98 7993.7 m
418.98 7981.84 l
415.406 7984.14 411.82 7985.88 408.219 7987.04 c
...
... 26 lines omitted
...
410.988 7996.3 414.887 7995.19 418.98 7993.7 c
f
418.98 7993.7 m
418.98 7981.84 l
415.406 7984.14 411.82 7985.88 408.219 7987.04 c
...
... 26 lines omitted
...
410.988 7996.3 414.887 7995.19 418.98 7993.7 c
f
which as you can clearly see is much larger. Whereas with a font we would only include the instructions to draw the glyph once, and then use only a few bytes to draw each occurrence, with the paps output we include the drawing instructions for the glyph each and every time it is drawn.
So the problem is the way paps emits PostScript, and there is nothing that pdfwrite can do about it.
That said, I see that you are using Ghostscript 8.71 which is now 4 years old, you should probably consider upgrading.

Related

Remove images (with transparency/alpha channel) from PDF

How to remove images with alpha channel (transparency) in a PDF file?
I need to remove all images with transparency from a PDF file because it needs to be optimized with pdf2ps and ps2pdf (to reduce filesize).. Postscript doesn't work properly when the PDF contains images with transparency and the PDF will be converted to one big image..
I have not managed to reproduce your problem.
For cons, I did the same treatment to compress my pdf except that I used pdftops instead of pdf2ps.
I hope it will help.
Sorry for my english (translate.google)
Clark,
It sounds like www.pstill.com will do everything you need and more in one tool. There is a Linux command line version available for a very reasonable price. I have used the tool on a few different PDF's for different reasons and it has always worked as advertised.
From their website.
Putting the 'Portable' back in PDF - PDF to PDF Transcoding
Your PDF cannot be printed on some printers or processed with some applications? PStill can sanitize, simplify, reprocess, flatten transparency and recompress PDF-Files, this process also known as 'transcoding' create a new PDF that has better compatibility, is often smaller in file size, can be optional encrypted/secured and contain only a uniform set of font types. Fonts can be normalized to plain PostScript Type 1 formats, can be subsetted, missing fonts included and bad fonts repaired/replaced. PStill can detect and remove duplicate elements in the PDF. Text can be converted to outlines which makes it perfect for creating 'fontless' PDF. Transcoding can be used to repair bad PDF or simplify the PDF structure so more limited output devices can process it.
Andrew.

Hidden/Open words in an Image file such as PNG or JGP

As far as I can tell my question is not related to topics involved in Stenography or in the win.rar soluations I've seen to this where you are essentially hidding messages.
I am trying to figure out if there is a way to insert code into a file such as a jpg or png with a simple message, that could later be extracted by a program reading the file without having it encoded into the file either by slight differences in pixels or what have you in stenography.
I basically just want a tag along message that is a part of the file itself that is not brought up by the image reader but could perhaps be seen by a text reader of some kind.
I'm not sure how possible this is because I, for the most part don't understand the order/layout of the png/jgp/ect file aside from the RGB pixel code. How does it start, how does the image display tool know to stop displaying ect.
The way I'm envisioning it would be something like:
pngStartCode -> RGBinfo --> png end code so image reader knows to stop -> start sequence that some kind of reader will recognize (possibly a new text reader) -> written text wanted to be communicated -> endcodeforreader
I may just be rambling about something ridiculous here but please let me know if this is at least possible.
You can use following command(Windows command prompt)
Create a text file with your message, say "message.txt"
Now choose target file(it can be any file like a.jpg,a.png,a.exe,..etc), say "image.jpg"
Now execute follwing command
copy /b "image.jpg"+"message.txt" "NewImage.jpg"
Above command will combine files(in binary mode) and creats a new file(in this case NewImage.jpg). Now if anyone opens image they will just see noraml image. If you want to look at text, you have open it with any text editor(Notepad) and scroll down to last, there you can find text.
Here it wont chage any pixels or any thing to image, it just appends text to image.
It sounds like OP is asking about comment tags in the PNG specifications (i.e. adding data but without intent to hide it).
PNG files are broken into "Chunks". The image part is usually divided into several IDAT chunks; the color, size, etc are stored in an IHDR chunk, etc.
The iTXt, tEXt, and zTXt chunks are used for conveying text information associated with the image, so typically you'd look into using a tool to add those types of chunks. tEXt is for just plain text, zTXt is compressed.
More info on the PNG specification including what kinds of chunks are available can be found here, and you find chunk viewers on google.
For convenience at preset time (January 2021) here are a couple tools that will let you view, edit, and add chunks:
Windows 10: http://entropymine.com/jason/tweakpng/
Linux: https://www.systutorials.com/docs/linux/man/n-png/
Mac: https://apps.apple.com/us/app/inspectpng/id498851708?mt=12
NOTE: I do not vouch for the safety of any of the above links. Please use standard caution when downloading any file from the internet. If you don't have your own anti-virus, Virustotal has one online you can upload individual files to for free.

Converting PostScript to Text Using GhostScript

I want to extract Text data out of PostScript documents. The problem is when I use GhostScript to do that, some texts would be extracted normally while others would be converted to weird symbolic characters.
I realized that the texts, which had normally been extracted, were in fonts that GhostScript would NOT embed them in PDF because of licensing restrictions. And, ironically the fonts without licensing restrictions which were normally embedded in PDF, weren’t been converting back correctly.
I tried both txtwrite device to convert directly the PostScript to Text and also pdfwrite device to first convert the PS to PDF and then extract the text out of the PDF Document, but neither of them worked.
I thought maybe I could be able to substitute all fonts with the unsupported fonts, so that the text data would be extracted correctly, but came out there is no simple way to do that.
What do you think I should do?
The cause of this is usually that the characters are encoded in a non-standard fashion. I'm afraid there is not a lot you can do, except possibly for finding out by comparing the readable PostScript with the extracted text which "weird symbolic characters" corresponds to what actual character. Then you might be able to reconstruct the original text by replacing the weird with the intended characters.

beamer includegraphics with screenshots

I'm using the LaTeX-Beamer class for making presentations. Every once in a while I need to include screenshots. Those graphics are pixel-based, of course. I use includegraphics like this:
\begin{figure}
\includegraphics[width= \paperwidth]{img/analyzer.png}
\end{figure}
or usually something like this:
\begin{figure}
\includegraphics[width= 0.8\linewidth]{img/analyzer.png}
\end{figure}
This leads to pretty bad readibility of the contained text, so I'm asking for your best practices: How would you include screenshots containing text considering, that I will do the output PDF with pdflatex?
EDIT: I suppose I'm looking for something like an 1:1 presetation of the image within beamer. However, [scale = 1.0] doesn't achieve what I'm looking for.
Your best bet is to scale the image outside of Latex for inclusion, and include it in 1:1 ratio. The scaling done by graphics packages in Latex isn't going to be anywhere near as good as possible from other tools. Latex (Tex) has limited floating-point arithmetic capabilities, whereas an external tool can use sophisticated algorithms to get the scaling better.
Another option is to use only a part of the screenshot, the one you want to concentrate on.
Edit: If you can change the font size before taking the screenshot, that's another option—just increase the font size for the screenshots.
Of course, you can combine the two methods.
I have done exactly what you do and e.g defined
\newcommand{\screenshot}[1]{\centerline{%
\includegraphics[height=7.8cm,transparent]{#1}}} % 7.8in
which worked with whatever style I was using at the time. The files included with this macro were all PNGs created with one the usual Linux screen capture tools.
Edit: You may have to play with the size (height and width) of your input files. It came out rather nice for me (and this was from a presentation in 2006).
How about scaling it as follows:
\includegraphics[scale=0.5]{images/myimage.jpg}
This works for me.
Have you tried to convert the image to .eps or .pdf file and use this file in LaTeX?
Maybe try also latex, dvips and ps2pdf.
Problem might be in used viewer, in Linux I use Document viewer or ePDFViewer and output is much worse than in Adobe Reader or Acrobat, which I use in Windows...

Getting R plots into LaTeX?

I'm a newbie to both R and LaTeX and have just recently found how to plot a standard time series graph using R and save it as a png image. What I'm worried about is that saving it as an image and then embedding it into LaTeX is going to scale it and make it look ugly.
Is there a way to make R's plot() function output a vector graphic and embed that into LaTeX? I'm a total beginner in both so please be gentle :) Code snippets are highly appreciated!
I would recommend the tikzDevice package for producing output for inclusion in LaTeX documents:
http://cran.r-project.org/web/packages/tikzDevice/index.html
The tikzDevice converts graphics produced in R to code that can be interpreted by the LaTeX package tikz. TikZ provides a very nice vector drawing system for LaTeX. Some good examples of TikZ output are located at:
http://www.texample.net/
The tikzDevice may be used like any other R graphics device:
require( tikzDevice )
tikz( 'myPlot.tex' )
plot( 1, 1, main = '\\LaTex\\ is $\\int e^{xy}$' )
dev.off()
Note that the backslashes in LaTeX macros must be doubled as R interprets a single backslash as an escape character. To use the plot in a LaTeX document, simply include it:
\include{path/to/myPlot.tex}
The pgfSweave package contains Sweave functionality that can handle the above step for you. Make sure that your document contains \usepackage{tikz} somewhere in the LaTeX preamble.
http://cran.r-project.org/
The advantages of tikz() function as compared to pdf() are:
The font of labels and captions in your figures always matches the font used in your LaTeX document. This provides a unified look to your document.
You have all the power of the LaTeX typesetter available for creating mathematical annotation and can use arbitrary LaTeX code in your figure text.
Disadvantages of the tikz() function are:
It does not scale well to handle plots with lots of components. These are things such as persp() plots of large matricies. The shear number of graphic elements can cause LaTeX to slow to a crawl or run out of memory.
The package is currently flagged as beta. This means that the interface or functionality of the package is subject to change if the authors find a compelling reason to do so.
I should end this post by disclaiming that I am an author of both the tikzDevice and pgfSweave packages so my opinion may be biased. However, I have used both packages to produce several academic reports in the last year and have been very satisfied with the results.
Shane is spot-on, you do want Sweave. Eventually.
As a newbie, you may better off separating task though. For that, do this:
open a device: pdf("figures/myfile.pdf", height=6, width=6).
plot your R object: plot(1:10, type='l', main='boring') -- and remember that lattice and ggplot need an explicit print around plot.
important: close your device: dev.off() to finalize the file.
optional: inspect the pdf file.
in LaTeX, use usepackage{graphicx} in the document header, use
\includegraphics[width=0.98\textwidth]{figures/myfile} to include the figure created earlier and note that file extension is optional.
run this through pdflatex and enjoy.
You might want to consider using Sweave. There is a lot of great documentation available for this on the Sweave website (and elsewhere). It has very simple syntax: just put your R code between <<>>= and #.
Here's a simple example that ends up looking like this:
\documentclass[a4paper]{article}
\title{Sweave Example 1}
\author{Friedrich Leisch}
\begin{document}
\maketitle
In this example we embed parts of the examples from the
\texttt{kruskal.test} help page into a \LaTeX{} document:
<<>>=
data(airquality)
library(ctest)
kruskal.test(Ozone ~ Month, data = airquality)
#
which shows that the location parameter of the Ozone
distribution varies significantly from month to month. Finally we
include a boxplot of the data:
\begin{center}
<<fig=TRUE,echo=FALSE>>=
boxplot(Ozone ~ Month, data = airquality)
#
\end{center}
\end{document}
To build the document, you can just call R CMD Sweave file.Rnw or run Sweave(file) from within R.
This is a dupe of a question on SO that I can't find.
But:
http://r-forge.r-project.org/projects/tikzdevice/ -- tikz output from r
and
http://www.rforge.net/pgfSweave/ tikz code via sweave.
Using tikz will give you a look consistent with the rest of your document, plus it will use latex to typeset all the text in your graphs.
EDIT
Getting LaTeX into R Plots

Resources