GhostScript - ImageMagick converts pdf to image to odd letters when converting Microsoft Print to PDF files - node.js

NOTICE: Watch updates at bottom.
I am building an API which supposed to convert PDF to base64 images (doesn't matter which type - jpg, jpeg, png..).
The API is built with NodeJS on CentOS 7.5 x64.
I have searched all over the web for npm packages which converts pdf to images, the very most of them uses ImageMagick and GhostScript (The others doesn't seem to work). These packages work well on code but the problem starts when GhostScript does it job.
For example, a simple pdf page with text will look like this after conversion:
This is the output in shell:
**** Warning: can't process font stream, loading font by the name.
**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> Microsoft: Print To PDF <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
I have tried to convert the images with shell commands ended up with the same outputs.
Thanks by advance.
UPDATE:
Converting a sample pdf file which probably was not printed to pdf by Microsoft worked fine, maybe this is the problem?
UPDATE 2:
After converting a few more pdfs it turns out that this is Microsoft Print to PDF files only that making this problem.

This was reported as a bug to the Ghostscript Bugzilla here
As can be seen from the thread, this is due to using an old version of Ghostscript, and has been fixed at some point in the past. So the problem is due to using old (in this case more than 5 years old) software.

Related

How to embed an MP4 inside a PDF?

I am a happy user of img2pdf. This tool does the minimal amount of work to put a series of JPEG 2000/JPEG/PNG images into a PDF "enveloppe". However I am now faced with a new challenge: embed a MP4 file into a PDF "enveloppe".
I see that commercial tool can do it, as seen at:
Add audio, video, and interactive objects to PDFs
Here is one such sample PDF file (no Flash required on windows in this sample):
https://gitlab.com/agrahn/media9/-/issues/9#note_345903962
https://gitlab.com/agrahn/media9/uploads/90fddd777e0ec514c39c924cd8d3b688/video_test.pdf
It seems to have been introduced in ISO 32000-1 (PDF 1.7 Extension Level 5)
I am looking for a solution which will use the Rich Media annotation inside the PDF stream.
There are dozen of duplicated questions on superuser/stackoverflow, which all pretty much refer to imagemagick/convert command line tool. But in my case, convert expand the images into a multi-page PDF (which is not my desired behavior):
$ convert input.mp4 output.pdf
$ pdfinfo output.pdf
Title: out
Producer: https://imagemagick.org
CreationDate: Wed Aug 19 15:38:01 2020 CEST
ModDate: Wed Aug 19 15:38:01 2020 CEST
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 1601
Encrypted: no
Page size: 352 x 288 pts
Page rot: 0
File size: 534407296 bytes
Optimized: no
PDF version: 1.3
with:
$ convert --version
Version: ImageMagick 6.9.10-23 Q16 x86_64 20190101 https://imagemagick.org
Copyright: © 1999-2019 ImageMagick Studio LLC
License: https://imagemagick.org/script/license.php
Features: Cipher DPC Modules OpenMP
Delegates (built-in): bzlib djvu fftw fontconfig freetype jbig jng jpeg lcms lqr ltdl lzma openexr pangocairo png tiff webp wmf x xml zlib
and
$ file input.mp4
input.mp4: ISO Media, MP4 Base Media v1 [IS0 14496-12:2003]
$ ffprobe -v quiet -print_format json -show_streams input.mp4 | grep codec_long_name
"codec_long_name": "H.264 / AVC / MPEG-4 AVC / MPEG-4 part 10",
How would you embed an MP4 inside a PDF now that Flash support is being removed from Acrobat (Dec 2020) ? The solution should be on the command line (linux based system).
It was common and still possible to use Rich Media Annotation to include 3D animations or Media files within a PDF. Generally you need top end editors such as Acrobat PRO but there are a few LaTeX editor modules that some times work, thus can be PDFLaTeX compiled from Linux command line. for outdated example app see http://www.acrotex.net/blog/?cat=22 for an overflow example see https://www.overleaf.com/project/5ff76fa5686edd3e034cfedb and for prior adobe reply (but did not work for a while) see
> Embedded media, as well as referenced media outside a PDF file, may be played with a variety of player software. (In some situations, the player software may be the conforming reader itself.)
[Later comments] Adobe shot selves in foot with the poor closure of their buggy insecure SWFlash and only improved some rich media handling in more recent Windows Reader versions Acrobat DC - 21.001.20135 plus ! having turned their back on maintaining Portable Document Format Readers for Linux/Mac. What is needed is a push to use HTMLZ as ideal Rich Media Format but that would need Google Chrome to run with the bouton (pun).
Its NOT recommended except for 3D PDF as most methods require manual over-ridding security measures to STOP runtime applications within a PDF.
SWF/Flash is no longer acceptable for that reason. MicroSoft Edge (pre-chrome) made an attempt at imbedding pdf links to You Tube videos but AFAIK that was abandoned. Thus RMA pdfs can not run in more common browsers they need specialist viewers and the best viewer for Linux is possibly Okular but I cannot run that file in my version.
When opening a 3D file you need to jump through multiple hoops to allow and for floating video it may not even run. your links lead to this media example which can run inline (seen above) or better pop out with system application controls.
However in some viewers it needs to be manually exported from the pdf archive as an attachment to be run in a system media player. and for browser presentations it can be a local hyperlink like this.
Using Okular on windows it does not ask prior to running content but that could be because it found no suitable player, however it allows me to link to a local file and run that in system viewer.
For everyday presentations its easier to include the media file in a zip with the pdf or PowerPoint presentation for running locally from the pdf.
[Updated PoC for building using alternative raw mp4's]
It is possible to write a complex PDF in text on the console (Here is a 203 line example in Windows CMD), Typical output I would not normally suggest that as an answer to a highly complex structure such as Rich Media, but a simpler all platforms approach is possible with a small nip and tuck of header & trailer, plus variable mp4 body.
Source modified example spliced as 3 parts https://transfer.sh/8kQIbB/all.pdf
Different body with a small amount of command line math https://transfer.sh/Uqmv6t/all2.pdf
Method
Store the text header as a pre-set template then append the mp4 without changes finally add the PDF trailer with modified values from file lengths. too long to describe its now 216 lines (with comments and notes) and working well for PDFs in xChange as a drag and drop any 1-9 MB.mp4 (need to up that math value to 1999 MB) or send file to or CLI command for single files, but the programming can be done simply as I did using CMD or OS script and the result generated in windows with roughly scripting around
copy /b head2.bin + pixels.mp4 + tail2.bin all2.pdf /b
the secondary part is how to use text to overlay a cover image in as few lines as possible
So that is now scripted to add variable length of any video.mp4 so I simply run or drag and drop per consol and a dialog can show progress and show feedback and get inputs such as names or dimensions via mp4toPDF video.mp4 [output.pdf] so next step is user to add the caption (perhaps other scalars) as variable argument(s).
The number of PDF viewers supporting Rich Media is dwindling, I can't use Acrobat nor Edge either, so it seems I need to use Tracker (below) which is much more versatile and has many other advantages, but is Windows only.
or Cross platform Foxit. However, on Windows Foxit is way inferior with no resize or search bar or other floating controls.
So currently I can add via script and run in either edit viewer a mp4 or wmv or other video under 2GB but the field (locked aspect in Foxit) has no cover (plain white) however if I move an image over the top it seems to block out action but under white its unseen so need to resolve that transparency issue, have settled on stamp bigger white area to keep the run button visible. but having some issues with auto stamping its affecting run button even when the two are not overlapping
Breaking news, OMG, no idea which way this will help any one, other than the "Revenue Men"
Microsoft Edge’s new PDF viewer is powered by Adobe, and it won’t let you forget that. In an announcement on its website, Microsoft says it’s replacing Edge’s existing PDF viewer with one from Adobe Acrobat, which includes some “advanced” features that are available if you’re willing to pay for them.
Video controls support are disabled in Adobe Acrobat, and is not supported by web browsers. Although you want ta add video with video controls, you can use Adobe Actrobat DC Pro, and you can automatize it using Action wizard.
Check this out https://helpx.adobe.com/acrobat/using/action-wizard-acrobat-pro.html
You can make a python script to embed your video.
Thanks to pyPdf2 api, you can use the addAttachment method to embed your video.
https://github.com/mstamy2/PyPDF2

ImageMagick issue on AppEngine Standard (PDFs and NodeJS)

I am using App Engine Standard. Since ImageMagick is available on it, I tried a few PDF manipulation libraries and basically, what I would like to do, is simply converting a PDF into an image.
The issue I am getting is this:
'convert-im6.q16: not authorized /tmp/ygM1sF-Txq00JkGbpal8YWBQ.pdf\'
# error/constitute.c/ReadImage/412.\nconvert-im6.q16: no images
defined/tmp/ygM1sF-Txq00JkGbpal8YWBQ-0.png\' #
error/convert.c/ConvertImageCommand/3258.\n' }
After some research, I found out that post here: Fix for ImageMagick convert errors with pdf files. Here is what he says:
PDF files on Linux systems are usually handled by ghostscript (via the
terminal command gs). And, ImageMagick (done through the terminal
convert command) uses ghostscript for reading and writing PDF files.
Because the security problems are serious and numerous, ImageMagick’s
access to PDF files is then cut off.
Granted, through these security flaws in PDF someone could craft a
malicious image file that, when converted by ImageMagick into a PDF,
will then do very nasty things to your computer.
But, ghostscript has since been updated once and once again with
security fixes. How about a fix for ImageMagick to get PDF
functionality back? Or, at least an explanation of progress towards
fixing this issue?
I can't change the ImageMagick configuration on App Engine Standard, but I wonder if there is something else I can do. Or maybe the engineers at Google would be able to update ImageMagick instead and remove that limitation?
I really need to convert PDF into images, so I wonder if it worth waiting, or if I need to find another solution.
Thanks for your ideas.

Setting up an automated pdf comparison on a server

Documentation is generated for every new build. I want to automate the process of comparing the new pdf with the old one and outputting the differences onto a text or image file. I also have the option of comparing large HTML folders or chm files(which is complicated as they are compiled files). How should I go about doing this? (I am looking for freeware Python tools)
I have looked into pdf-diff, a Python tool that does exactly what I want. But since I am working on a windows machine with no visual studio, when I try to install it using pip, I get the error "Unable to find vcvarsall.bat".

Pentaho 7.1 PDF wrong diacritc

We have installed new LINUX server for our Pentaho installation, but I am having problem with diacritics in PDF generated files.
For web (HTML), I have set encoding to utf-8 which is working perfectly.
BUT for PDF encoding utf-8 is not working. I have "fixed" it on old server by setting up CP-1250 encoding, but I don't want to use old standard anymore. So I have been trying to fix it.
I have set option in pentaho-server/tomcat/webapps/pentaho/WEB-INF/classes/classic-engine.properties to
org.pentaho.reporting.engine.classic.core.modules.output.pageable.pdf.Encoding=UTF-8
but PDF versions of report are still ignoring letters with diacritics..
Soo, my thought is, that there must be some PDF encoding setting above this, perhaps some global PDF generator setting or perhaps Java or Linux itself?
Is anyone able to give me a hint where should I look, what to check?

Convert .xls to .pdf using LibreOffice via Command Line

I'm trying to convert a .xls file to .pdf using LibreOffice via command line on Ubuntu. I have a kind of report on the .xls file with some colors in the background of the cells and etc.
The problem is when I convert the .xls file, the .pdf loses the original format. Each page is broken almost in the half and the content of one page is displayed in two different pages.
Does anybody know how to convert the .xls file to .pdf via command line with keeping the original format?
Or some trick to set the size of the .pdf page to not break pages? (Also via command line)
The code I used to make the conversion was:
soffice --headless --convert-to pdf:"impress_pdf_Export" filename.xls
If you use LibreOffice to convert Microsoft Excel (XLS) files to PDF documents, this is a two-step process (even if your command does look like it is a one-step process):
Import the XLS into LibreOffice (even if started with --headless).
Export the PDF from LibreOffice.
If the result does not look like you expect (not similar enough to Excel's native PDF export), then start with debugging the first step from above:
Open the XLS file with LibreOffice in a GUI. Does it look like you expect it to look? Or are some formatting options looking weird?
Export the PDF from there (with the GUI). Are the page dimensions as you expect? Did you set them up how you prefer? The margins like you want them? etc.pp. ...
If you are working on Windows, you may also want to consider OfficeToPDF.exe. It is hosted on CodePlex, licensed with the Apache 2.0 License and available in binary and in source code.
It requires a working Office 2013, Office 2010 or Office 2007 installation. But then it can commandline- and batch-convert to PDF various MS Office-based file formats, including XLS(X), PPT(X), DOC(X), VSD(X) and PUB as well as Libre/OpenOffice-based ODT, ODS and ODC files.
Although this is a little bit off from the initial question (you don't _really need Office Libre if you have the Office suite and on a Windows machine)
I do appreciate the follow-up provided by Kurt. It prompted me to post the following Gist offering some clear instructions on how to go about using the .exe in a for loop.
https://gist.github.com/einsty/2189cae4175f619cff0f
Try copying appropriate font file (for me it's
a simsun.ttc file) to your libreoffice installing directory like '/opt/libreoffice4.2/share/fonts/truetype'.But if the width of a single excel sheet is too much for a print page(sth like 'A4'),it'll still collapse.

Resources