Can't load PDF with Wand/ImageMagick in Google Cloud Function - python-3.x

Trying to load a PDF from the local file system and getting a "not authorized" error.
"File "/env/local/lib/python3.7/site-packages/wand/image.py", line 4896, in read self.raise_exception() File "/env/local/lib/python3.7/site-packages/wand/resource.py", line 222, in raise_exception raise e wand.exceptions.PolicyError: not authorized `/tmp/tmp_iq12nws' # error/constitute.c/ReadImage/412
The PDF file is successfully saved to the local 'server' from GCS but won't be loaded by Wand. Loading images into OpenCV isn't an issue, just happening when trying to load PDFs using Wand/ImageMagick
Code to load the PDF from GCS to local file system into Wand/ImageMagick is below
_, temp_local_filename = tempfile.mkstemp()
gcs_blob = STORAGE_CLIENT.bucket('XXXX').get_blob(results["storedLocation"])
gcs_blob.download_to_filename(temp_local_filename)
# load the pdf into a set of images using imagemagick
with(Image(filename=temp_local_filename, resolution=200)) as source:
#run through pages and save images etc.
ImageMagick should be authorised to access files on the local filesystem so it should load the file without issue instead of this 'Not Authorised' error.

PDF reading by ImageMagick has been disabled because of a security vulnerability Ghostscript had. The issue is by design and a security mitigation from the ImageMagick team will exist until. ImageMagick Enables Ghostscript processing of PDFs again and Google Cloud Functions update to that new version of ImageMagick with PDF processing enabled again.
There's no fix for the ImageMagick/Wand issue in GCF that I could find but as a workaround for converting PDFs to images in Google Cloud Functions, you can use this [ghostscript wrapper][2] to directly request the PDF conversion to an image via Ghostscript and bypass ImageMagick/Wand. You can then load the PNGs into ImageMagick or OpenCV without issue.
requirements.txt
google-cloud-storage
ghostscript==0.6
main.py
# create a temp filename and save a local copy of pdf from GCS
_, temp_local_filename = tempfile.mkstemp()
gcs_blob = STORAGE_CLIENT.bucket('XXXX').get_blob(results["storedLocation"])
gcs_blob.download_to_filename(temp_local_filename)
# create a temp folder based on temp_local_filename
temp_local_dir = tempfile.mkdtemp()
# use ghostscript to export the pdf into pages as pngs in the temp dir
args = [
"pdf2png", # actual value doesn't matter
"-dSAFER",
"-sDEVICE=pngalpha",
"-o", temp_local_dir+"page-%03d.png",
"-r300", temp_local_filename
]
# the above arguments have to be bytes, encode them
encoding = locale.getpreferredencoding()
args = [a.encode(encoding) for a in args]
#run the request through ghostscript
ghostscript.Ghostscript(*args)
# read the files in the tmp dir and process the pngs individually
for png_file_loc in glob.glob(temp_local_dir+"*.png"):
# loop through the saved PNGs, load into OpenCV and do what you want
cv_image = cv2.imread(png_file_loc, cv2.IMREAD_UNCHANGED)
Hope this helps someone facing the same issue.

Related

Python ZipFile module problem when file is encrypted

I have the following short program
from zipfile import ZipFile
procFile1 ="C:\\Temp\\XLFile-Demo.zip"
procFile2 ="C:\\Temp2\\XLFile-Demo-PW123.zip"
# Unencrypted file
print ("Unencrypted file")
myzip1 = ZipFile(procFile1)
print (myzip1.infolist())
myzip1.extractall("C:\\Temp")
# Encrypted File
print ("Encrypted file")
myzip2 = ZipFile(procFile2)
print (myzip2.infolist())
myzip2.setpassword(bytes('123', 'utf-8'))
myzip2.extractall("C:\\Temp2")enter code here
At this Amazon Drive link are the two files. They are identical except that one zip is protected with the password 123.
Executing the above code successfully extracts the unencrypted one but raises the error NotImplementedError: That compression method is not supported for the other.
Unencrypted file
[<ZipInfo filename='XLFile-Demo.xlsx' compress_type=deflate external_attr=0x20 file_size=31964 compress_size=29252>]
Encrypted file
[<ZipInfo filename='XLFile-Demo.xlsx' compress_type=99 external_attr=0x20 file_size=31964 compress_size=29280>]
Am I doing anything wrong from my end?
The error came up when the file was zipped using WinRar's ZIP option. I installed 7Zip and it is working.
The .infolist for the 7Zip file is the following:
[<ZipInfo filename='XLFile-Demo.xlsx' compress_type=deflate external_attr=0x20 file_size=31964 compress_size=29340>]
Incidentally WinRar can handle this file and 7Zip can correctly process the encrypted Zip archive created by WinRar.

How can I generate a PDF with custom fonts using AWS Lambda?

I have an AWS Lambda function that generates PDFs using the html-pdf library with custom fonts.
At first, I imported my fonts externally from Google Fonts, but then the PDF's size has enlarged by ten times.
So I tried to import my fonts locally src('file:///var/task/fonts/...ttf/woff2') but still no luck.
Lastly, I trie to create fonts folder in the main project and then I added all of my fonts, plus the file fonts.config:
<?xml version="1.0"?>
<!DOCTYPE fontconfig SYSTEM "fonts.dtd">
<fontconfig>
<dir>/var/task/fonts/</dir>
<cachedir>/tmp/fonts-cache/</cachedir>
<config></config>
</fontconfig>
and set the following env:
FONTCONFIG_PATH = /var/task/fonts
but still no luck (I haven't installed fontconfig since I'm not sure how and if I need to).
My Runtime env is Node.js 8.1.0.
You can upload your fonts into an S3 bucket and then download them to the lambda's /tmp directory, during its execution. In case your lib creates .pkl files, you should first change your root directory to /tmp (lambda is not allowed to write in the default root directory).
The following Python code downloads your files from a /fonts directory in an S3 bucket to /tmp/fonts "local" directory.
import os
import boto3
os.chdir('/tmp')
os.mkdir(os.path.join('/tmp/', 'fonts'))
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')
my_bucket = s3.Bucket("bucket_name")
for file in my_bucket.objects.filter(Prefix="fonts/"):
filename = file.key
short_filename = filename.replace('fonts/','')
if(len(short_filename) > 0):
s3_client.download_file(
bucket,
filename,
"/tmp/fonts/" + short_filename,
)

How to get WKHTMLTOPDF working on Heroku?

I created a website which generates PDF using PDFKIT and I know how to install and setup environment variable path on Window. I managed to deploy my first website on Heroku but now I'm getting error "No wkhtmltopdf executable found: "b''" When trying to generate the PDF.
I have no idea, How to install and setup WKHTMLTOPDF on Heroku because this is first time I'm dealing with Linux.
I really tried everything before asking this but even following this not working for me.
Python 3 flask install wkhtmltopdf on heroku
If possible, please guide me with step by step on how to install and setup this.
I followed all the resource and everything but couldn't make it work. Every time I get the same error.
I'm using Django version 2. Python version 3.7.
This is what I get if I do heroku stack
Available Stacks
cedar-14
container
heroku-16
* heroku-18
Error, I'm getting when generating the PDF.
No wkhtmltopdf executable found: "b''"
If this file exists please check that this process can read it. Otherwise please install wkhtmltopdf - https://github.com/JazzCore/python-pdfkit/wiki/Installing-wkhtmltopdf
My website works very well on localhost without any problem and as far as I know, I'm sure that I have done something wrong in installing wkhtmltopdf.
Thank you
It's non-trivial. If you want to avoid all of the below's headache, you can just use my service, api2pdf: https://github.com/api2pdf/api2pdf.python. Otherwise, if you want to try and work through it, see below.
1) Add this to your requirements.txt to install a special wkhtmltopdf pack for heroku as well as pdfkit.
git+git://github.com/johnfraney/wkhtmltopdf-pack.git
pdfkit==0.6.1
2) I created a pdf_manager.py in my flask app. In pdf_manager.py I have a method:
def _get_pdfkit_config():
"""wkhtmltopdf lives and functions differently depending on Windows or Linux. We
need to support both since we develop on windows but deploy on Heroku.
Returns:
A pdfkit configuration
"""
if platform.system() == 'Windows':
return pdfkit.configuration(wkhtmltopdf=os.environ.get('WKHTMLTOPDF_BINARY', 'C:\\Program Files\\wkhtmltopdf\\bin\\wkhtmltopdf.exe'))
else:
WKHTMLTOPDF_CMD = subprocess.Popen(['which', os.environ.get('WKHTMLTOPDF_BINARY', 'wkhtmltopdf')], stdout=subprocess.PIPE).communicate()[0].strip()
return pdfkit.configuration(wkhtmltopdf=WKHTMLTOPDF_CMD)
The reason I have the platform statement in there is that I develop on a windows machine and I have the local wkhtmltopdf binary on my PC. But when I deploy to Heroku, it runs in their linux containers so I need to detect first which platform we're on before running the binary.
3) Then I created two more methods - one to convert a url to pdf and another to convert raw html to pdf.
def make_pdf_from_url(url, options=None):
"""Produces a pdf from a website's url.
Args:
url (str): A valid url
options (dict, optional): for specifying pdf parameters like landscape
mode and margins
Returns:
pdf of the website
"""
return pdfkit.from_url(url, False, configuration=_get_pdfkit_config(), options=options)
def make_pdf_from_raw_html(html, options=None):
"""Produces a pdf from raw html.
Args:
html (str): Valid html
options (dict, optional): for specifying pdf parameters like landscape
mode and margins
Returns:
pdf of the supplied html
"""
return pdfkit.from_string(html, False, configuration=_get_pdfkit_config(), options=options)
I use these methods to convert to PDF.
Just follow these steps to Deploy Django app(pdfkit) on Heroku:
Step 1:: Add following packages in requirements.txt file
wkhtmltopdf-pack==0.12.3.0
pdfkit==0.6.0
Step 2: Add below lines in the views.py to add path of binary file
import os, sys, subprocess, platform
if platform.system() == "Windows":
pdfkit_config = pdfkit.configuration(wkhtmltopdf=os.environ.get('WKHTMLTOPDF_BINARY', 'C:\\Program Files\\wkhtmltopdf\\bin\\wkhtmltopdf.exe'))
else:
os.environ['PATH'] += os.pathsep + os.path.dirname(sys.executable)
WKHTMLTOPDF_CMD = subprocess.Popen(['which', os.environ.get('WKHTMLTOPDF_BINARY', 'wkhtmltopdf')],
stdout=subprocess.PIPE).communicate()[0].strip()
pdfkit_config = pdfkit.configuration(wkhtmltopdf=WKHTMLTOPDF_CMD)
Step 3: And then pass pdfkit_config as argument as below
pdf = pdfkit.from_string(html,False,options, configuration=pdfkit_config)

Zipfile file in cloud(amazon s3) without writing it first to local file(no write privileges)

I need to zip some files in amazon s3 without needing to write them to file locally first. Ideally my code worked in development but i don't have many write privileges in production.
folder = output_dir
files = fs.glob(folder)
f = BytesIO()
zip = zipfile.ZipFile(f, 'a', zipfile.ZIP_DEFLATED)
for file in files:
filename = os.path.basename(file)
image = fs.get(file, filename)
zip.write(filename)
zip.close()
the proplem is at this line in production
image = fs.get(file, filename)
Because i don't have write privileges.
My last resort is to write to /tmp/ directory which i have privileges to.
Is there a way to zip files from a url path or directly in the cloud?
I ended up using python tempfile which ended up being a perfect solution.
Using NamedTemporaryFile gave me the guarantee to create named and system visible temporary files that could be deleted automatically. No manual work.

AWS Lambda function - convert PDF to Image

I am developing application where user can upload some drawings in pdf format. Uploaded files are stored on S3. After uploading, files has to be converted to images. For this purpose I have created lambda function which downloads file from S3 to /tmp folder in lambda execution environment and then I call ‘convert’ command from imagemagick.
convert sourceFile.pdf targetFile.png
Lambda runtime environment is nodejs 4.3. Memory is set to 128MB, timeout 30 sec.
Now the problem is that some files are converted successfully while others are failing with the following error:
{ [Error: Command failed: /bin/sh -c convert /tmp/sourceFile.pdf
/tmp/targetFile.png convert: %s' (%d) "gs" -q -dQUIET -dSAFER -dBATCH
-dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEVICE=pngalpha" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" "-sOutputFile=/tmp/magick-QRH6nVLV--0000001" "-f/tmp/magick-B610L5uo"
"-f/tmp/magick-tIe1MjeR" # error/utility.c/SystemCommand/1890.
convert: Postscript delegate failed/tmp/sourceFile.pdf': No such
file or directory # error/pdf.c/ReadPDFImage/678. convert: no images
defined `/tmp/targetFile.png' #
error/convert.c/ConvertImageCommand/3046. ] killed: false, code: 1,
signal: null, cmd: '/bin/sh -c convert /tmp/sourceFile.pdf
/tmp/targetFile.png' }
At first I did not understand why this happens, then I tried to convert problematic files on my local Ubuntu machine with the same command. This is the output from terminal:
**** Warning: considering '0000000000 XXXXX n' as a free entry.
**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> Mac OS X 10.10.5 Quartz PDFContext <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.
So the message was very clear, but the file gets converted to png anyway. If I try to do convert source.pdf target.pdf and after that convert target.pdf image.png, file is repaired and converted without any errors. This doesn’t work with lambda.
Since the same thing works on one environment but not on the other, my best guess is that the version of Ghostscript is the problem. Installed version on AMI is 8.70. On my local machine Ghostsript version is 9.18.
My questions are:
Is the version of ghostscript problem? Is this a bug with older
version of ghostscript? If not, how can I tell ghostscript (with or
without using imagemagick) to repair or ignore errors like it does on
my local environment?
If the old version is a problem, is it possible to build ghostscript
from source, create nodejs module and then use that version of
ghostscript instead the one that is installed?
Is there an easier way to convert pdf to image without using
imagemagick and ghostscript?
UPDATE
Relevant part of lambda code:
var exec = require('child_process').exec;
var AWS = require('aws-sdk');
var fs = require('fs');
...
var localSourceFile = '/tmp/sourceFile.pdf';
var localTargetFile = '/tmp/targetFile.png';
var writeStream = fs.createWriteStream(localSourceFile);
writeStream.write(body);
writeStream.end();
writeStream.on('error', function (err) {
console.log("Error writing data from s3 to tmp folder.");
context.fail(err);
});
writeStream.on('finish', function () {
var cmd = 'convert ' + localSourceFile + ' ' + localTargetFile;
exec(cmd, function (err, stdout, stderr ) {
if (err) {
console.log("Error executing convert command.");
context.fail(err);
}
if (stderr) {
console.log("Command executed successfully but returned error.");
context.fail(stderr);
}else{
//file converted successfully - do something...
}
});
});
You can find a compiled version of Ghostscript for Lambda in the following repository.
You should add the files to the zip file that you are uploading as the source code to AWS Lambda.
https://github.com/sina-masnadi/lambda-ghostscript
This is an npm package to call Ghostscript functions:
https://github.com/sina-masnadi/node-gs
After copying the compiled Ghostscript files to your project and adding the npm package, you can use the executablePath('path to ghostscript') function to point the package to the compiled Ghostscript files that you added earlier.
Its almost certainly a bug, or perhaps limitation, with the older version of Ghostscript.
Many PDF producers create PDF files which do not conform to the specification, and yet will open without complain in Adobe Acrobat. Ghostscript endeavours to do the same, but obviously we can't know what Acrobat is going to allow, so we are continually chasing this nebulous target. (FWIW that warning is a legitimate out-of-spec PDF file).
There's nothing you can do with the old version other than replace it.
Yes you can build Ghostscript from source, I have no idea about a nodejs module, not sure why that's relevant.
There are numerous other applications which will render a PDF file, MuPDF is another one I know of. And, of course, you can use Ghostscript directly without using ImageMagick. Of course, if you can load another application, then you should simply be able to replace your Ghostscript installation too.
The version of GS on aws is an old version with known bugs. We can get around this by uploading an x64 GS file, compiled specifically for Linux. Then upload that using new AWS lambda layers. I have written a node function that does just this here:
https://github.com/rcastoro/PDFImagine
Make sure you have that GS layer for your lambda, however!

Resources