pytest fails when using io.BytesIO stream instead of PDF file - python-3.x

I'm running pytest to check a function that uses pdfminer to convert PDF to text. The function works when doing $ python function.py and the result is what I expect it to be. I should also point out that I'm using a stream when parsing the file (io.BytesIO) and this stream is the reason my test fails.
Running pytest the function fails with a PDFSyntaxError.
# function.py
...
from pdfminer.pdfparser import PDFParser
from pdfminer.document import PDFDocument
req = requests.get(url_pointing_to_pdf_file)
pdf = io.BytesIO(req.content)
parser = PDFParser(pdf)
document = PDFDocument(parser, password=None) # this fails
...
pytest calls the init method in pdfdocument.py (part of the pdfminer library) and stops here:
for xref in xrefs:
trailer = xref.get_trailer()
...
if 'Root' in trailer:
self.catalog = dict_value(trailer['Root'])
break
else:
raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
...
And this is what pytest shows when testing the function fails:
tests/test_function.py:11:
----------------------------------------------------
.../function.py:157: in function
**document = PDFDocument(parser, password=None)**
...
E pdfminer.pdfparser.PDFSyntaxError: No /Root object! - Is this really a PDF?
lib/python3.6/site-packages/pdfminer/pdfdocument.py:583:PDFSyntaxError
Running the test with a PDF file stored in the same directory as function.py is successful, so the culprit is the io.BytesIO format of the downloaded PDF file. Since I want to use a stream with function.py I would like to know if there is a better way to do this.

Related

Alternative methods for deprecated gym.wrappers.Monitor()

The example is from Chapter 8 of Grokking Deep Reinforcement Learning written by Miguel Morales.
Please click here
The wrappers.Monitor is deprecated after the book is published. The code in question is as below:
env = wrappers.Monitor(
env, mdir, force=True,
mode=monitor_mode,
video_callable=lambda e_idx: record) if monitor_mode else env
I searched the internet and tried 2 methods but failed.
1- gnwrapper.Monitor
I pip install gym-notebook-wrapper and import gnwrapper, then rewrite the code
env = gnwrapper.Monitor(gym.make(env_name),directory="./")
A [WinError 2] The system cannot find the file specified error message is returned.
2- gym.wrappers.RecordVideo
I from gym.wrappers import RecordVideo then rewrite the code
env = RecordVideo(gym.make(env_name), "./")
A AttributeError: 'CartPoleEnv' object has no attribute 'videos' error message is returned.
Is there any way to solve the problem?
After some experiments, the best way to work around the problem is to downgrade the gym version to 0.20.0, in order to preserve the wrappers.Monitor function.
Moreover, the subprocess.Popen and subprocess.check_output does not work as the original author suggested, so I use moviepy (please pip install moviepy if you do not have this libarary and from moviepy.editor import VideoFileClip at the very beginning) to change MP4 to GIF files. The code in get_gif_html is amended as follows.
gif_path = basename + '.gif'
if not os.path.exists(gif_path):
videoClip = VideoFileClip(video_path)
videoClip.write_gif(gif_path, logger=None)
gif = io.open(gif_path, 'r+b').read()
The program now completely works.
Edit (24 Jul 2022):
The following solution is for gym version 0.25.0, as I hope the program works for later versions. Windows 10 and Juptyer Notebook is used for the demonstration.
(1) Maintain the moviepy improvement stated above
(2) import from gym.wrappers.record_video import RecordVideo and substitute
env = wrappers.Monitor(
env, mdir, force=True,
mode=monitor_mode,
video_callable=lambda e_idx: record) if monitor_mode else
with this
if monitor_mode:
env = RecordVideo(env, mdir, episode_trigger=lambda e_idx:record)
env.reset()
env.start_video_recorder()
else:
env = env
RecordVideo function takes (env, video_folder, episode_trigger) arguments, and video_callable and episode_trigger are the same thing.
(3) Loss of env.videos in get_gif_html function in new versions.
In the old setting, after the video is recorded, the MP4 and meta.json files are stored in the following address with a random file name
C:\Users\xxx\AppData\Local\Temp
https://i.stack.imgur.com/Za3Sd.jpg
This env.videos function returns a list of tuples, each tuples consisted of address path of MP4 and meta.json.
I tried to recreate the env.videos by
(a) declare the mdir as global variable
def get_make_env_fn(**kargs):
def make_env_fn(env_name, seed=None, render=None, record=False,
unwrapped=False, monitor_mode=None,
inner_wrappers=None, outer_wrappers=None):
global mdir
mdir = tempfile.mkdtemp()
(b) in the demo_progression and demo_last function, grouping all MP4 and meta_json files, zipping them and use list comprehension to make a new env_videos variable (as opposed to env.videos)
env.close()
mp4_files = ([(mdir + '\\' + file) for file in os.listdir(mdir) if file.endswith('.mp4')])
meta_json_files = ([(mdir + '\\' + file) for file in os.listdir(mdir) if file.endswith('.meta.json')])
env_videos = [(mp4, meta_json) for mp4, meta_json in zip(mp4_files, meta_json_files)]
data = get_gif_html(env_videos=env_videos,
These are all the changes.

Google cloud function with wand stopped working

I have set up 3 Google Cloud Storge buckets and 3 functions (one for each bucket) that will trigger when a PDF file is uploaded to a bucket. Functions convert PDF to png image and do further processing.
When I am trying to create a 4th bucket and similar function, strangely it is not working. Even if I copy one of the existing 3 functions, it is still not working and I am getting this error:
Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/functions_v1beta2/worker.py", line 333, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions_v1beta2/worker.py", line 199, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions_v1beta2/worker.py", line 196, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 27, in pdf_to_img with Image(filename=tmp_pdf, resolution=300) as image: File "/env/local/lib/python3.7/site-packages/wand/image.py", line 2874, in __init__ self.read(filename=filename, resolution=resolution) File "/env/local/lib/python3.7/site-packages/wand/image.py", line 2952, in read self.raise_exception() File "/env/local/lib/python3.7/site-packages/wand/resource.py", line 222, in raise_exception raise e wand.exceptions.PolicyError: not authorized/tmp/tmphm3hiezy' # error/constitute.c/ReadImage/412`
It is baffling me why same functions are working on existing buckets but not on new one.
UPDATE:
Even this is not working (getting "cache resources exhausted" error):
In requirements.txt:
google-cloud-storage
wand
In main.py:
import tempfile
from google.cloud import storage
from wand.image import Image
storage_client = storage.Client()
def pdf_to_img(data, context):
file_data = data
pdf = file_data['name']
if pdf.startswith('v-'):
return
bucket_name = file_data['bucket']
blob = storage_client.bucket(bucket_name).get_blob(pdf)
_, tmp_pdf = tempfile.mkstemp()
_, tmp_png = tempfile.mkstemp()
tmp_png = tmp_png+".png"
blob.download_to_filename(tmp_pdf)
with Image(filename=tmp_pdf) as image:
image.save(filename=tmp_png)
print("Image created")
new_file_name = "v-"+pdf.split('.')[0]+".png"
blob.bucket.blob(new_file_name).upload_from_filename(tmp_png)
Above code is supposed to just create a copy of image file which is uploaded to bucket.
Because the vulnerability has been fixed in Ghostscript but not updated in ImageMagick, the workaround for converting PDFs to images in Google Cloud Functions is to use this ghostscript wrapper and directly request the PDF conversion to png from Ghostscript (bypassing ImageMagick).
requirements.txt
google-cloud-storage
ghostscript==0.6
main.py
import locale
import tempfile
import ghostscript
from google.cloud import storage
storage_client = storage.Client()
def pdf_to_img(data, context):
file_data = data
pdf = file_data['name']
if pdf.startswith('v-'):
return
bucket_name = file_data['bucket']
blob = storage_client.bucket(bucket_name).get_blob(pdf)
_, tmp_pdf = tempfile.mkstemp()
_, tmp_png = tempfile.mkstemp()
tmp_png = tmp_png+".png"
blob.download_to_filename(tmp_pdf)
# create a temp folder based on temp_local_filename
# use ghostscript to export the pdf into pages as pngs in the temp dir
args = [
"pdf2png", # actual value doesn't matter
"-dSAFER",
"-sDEVICE=pngalpha",
"-o", tmp_png,
"-r300", tmp_pdf
]
# the above arguments have to be bytes, encode them
encoding = locale.getpreferredencoding()
args = [a.encode(encoding) for a in args]
#run the request through ghostscript
ghostscript.Ghostscript(*args)
print("Image created")
new_file_name = "v-"+pdf.split('.')[0]+".png"
blob.bucket.blob(new_file_name).upload_from_filename(tmp_png)
Anyway, this gets you around the issue and keeps all the processing in GCF for you. Hope it helps. Your code works for single page PDFs though. My use-case was for multipage pdf conversion, ghostscript code & solution in this question.
This actually seems to be a show stopper for ImageMagick related functionalities using PDF format. Similar code deployed by us on Google App engine via custom docker is failing with the same error on missing authorizations.
I am not sure how to edit the policy.xml file on GAE or GCF but a line there has to be changed to:
<policy domain="coder" rights="read|write" pattern="PDF" />
#Dustin: Do you have a bug link where we can see the progress ?
Update:
I fixed it on my Google app engine container by adding a line in docker image. This directly changes the policy.xml file content after imagemagick gets installed.
RUN sed -i 's/rights="none"/rights="read|write"/g' /etc/ImageMagick-6/policy.xml
This is an upstream bug in Ubuntu, we are working on a workaround for App Engine and Cloud Functions.
While we wait for the issue to be resolved in Ubuntu, I followed #DustinIngram's suggestion and created a virtual machine in Compute Engine with an ImageMagick installation. The downside is that I now have a second API that my API in App Engine has to call, just to generate the images. Having said that, it's working fine for me. This is my setup:
Main API:
When a pdf file is uploaded to Cloud Storage, I call the following:
response = requests.post('http://xx.xxx.xxx.xxx:5000/makeimages', data=data)
Where data is a JSON string with the format {"file_name": file_name}
On the API that is running on the VM, the POST request gets processed as follows:
#app.route('/makeimages', methods=['POST'])
def pdf_to_jpg():
file_name = request.form['file_name']
blob = storage_client.bucket(bucket_name).get_blob(file_name)
_, temp_local_filename = tempfile.mkstemp()
temp_local_filename_jpeg = temp_local_filename + '.jpg'
# Download file from bucket.
blob.download_to_filename(temp_local_filename)
print('Image ' + file_name + ' was downloaded to ' + temp_local_filename)
with Image(filename=temp_local_filename, resolution=300) as img:
pg_num = 0
image_files = {}
image_files['pages'] = []
for img_page in img.sequence:
img_page_2 = Image(image=img_page)
img_page_2.format = 'jpeg'
img_page_2.compression_quality = 70
img_page_2.save(filename=temp_local_filename_jpeg)
new_file_name = file_name.replace('.pdf', 'p') + str(pg_num) + '.jpg'
new_blob = blob.bucket.blob(new_file_name)
new_blob.upload_from_filename(temp_local_filename_jpeg)
print('Page ' + str(pg_num) + ' was saved as ' + new_file_name)
image_files['pages'].append({'page': pg_num, 'file_name': new_file_name})
pg_num += 1
try:
os.remove(temp_local_filename)
except (ValueError, PermissionError):
print('Could not delete the temp file!')
return jsonify(image_files)
This will download the pdf from Cloud Storage, create an image for each page, and save them back to cloud storage. The API will then return a JSON file with the list of image files created.
So, not the most elegant solution, but at least I don't need to convert the files manually.

File Partially Download with urllib.request.urlretrieve

I have this code Which is trying to retrieve file from Git Hub Repositories.
import os
import tarfile
from six.moves import urllib
import urllib.request
DOWNLOAD_ROOT = "https://github.com/ageron/handson-ml/tree/master/"
HOUSING_PATH = os.path.join("datasets", "housing").replace("\\","/")
print(HOUSING_PATH)
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH
print(HOUSING_URL)
print(os.getcwd())
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
if not os.path.isdir(housing_path):
os.makedirs(housing_path)
tgz_path = os.path.join(housing_path, "housing.tgz").replace("\\","/")
print(tgz_path)
urllib.request.urlretrieve(housing_url, tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)
housing_tgz.close()
fetch_housing_data()
After Executing the code I got this Error ReadError: file could not be opened successfully. I did checked the actual file size and the file which is download after executing this code and I came to know that file is downloaded partially.
So is their any way to download the whole file ? Thanks in Advance
Finally I got the problem. It was with the link that I was using to retrieve the file. I didn't knew that RAW link should be used along with the file name (Not using file name will give you 404 Error) in Git Hub Repositories.
So I little bit of modification is needs to be done in actual code posted in my question.
That is :
Change the link from
DOWNLOAD_ROOT = "https://github.com/ageron/handson-ml/tree/master/"
To this :
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
And this
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH
to
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz" \\**( Actual File name is needed)**
Thank you !

MafftCommandline and io.StringIO

I've been trying to use the Mafft alignment tool from Bio.Align.Applications. Currently, I've had success writing my sequence information out to temporary text files that are then read by MafftCommandline(). However, I'd like to avoid redundant steps as much as possible, so I've been trying to write to a memory file instead using io.StringIO(). This is where I've been having problems. I can't get MafftCommandline() to read internal files made by io.StringIO(). I've confirmed that the internal files are compatible with functions such as AlignIO.read(). The following is my test code:
from Bio.Align.Applications import MafftCommandline
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import io
from Bio import AlignIO
sequences1 = ["AGGGGC",
"AGGGC",
"AGGGGGC",
"AGGAGC",
"AGGGGG"]
longest_length = max(len(s) for s in sequences1)
padded_sequences = [s.ljust(longest_length, '-') for s in sequences1] #padded sequences used to test compatibilty with AlignIO
ioSeq = ''
for items in padded_sequences:
ioSeq += '>unknown\n'
ioSeq += items + '\n'
newC = io.StringIO(ioSeq)
cLoc = str(newC).strip()
cLocEdit = cLoc[:len(cLoc)] #create string to remove < and >
test1Handle = AlignIO.read(newC, "fasta")
#test1HandleString = AlignIO.read(cLocEdit, "fasta") #fails to interpret cLocEdit string
records = (SeqRecord(Seq(s)) for s in padded_sequences)
SeqIO.write(records, "msa_example.fasta", "fasta")
test1Handle1 = AlignIO.read("msa_example.fasta", "fasta") #alignIO same for both #demonstrates working AlignIO
in_file = '.../msa_example.fasta'
mafft_exe = '/usr/local/bin/mafft'
mafft_cline = MafftCommandline(mafft_exe, input=in_file) #have to change file path
mafft_cline1 = MafftCommandline(mafft_exe, input=cLocEdit) #fails to read string (same as AlignIO)
mafft_cline2 = MafftCommandline(mafft_exe, input=newC)
stdout, stderr = mafft_cline()
print(stdout) #corresponds to MafftCommandline with input file
stdout1, stderr1 = mafft_cline1()
print(stdout1) #corresponds to MafftCommandline with internal file
I get the following error messages:
ApplicationError: Non-zero return code 2 from '/usr/local/bin/mafft <_io.StringIO object at 0x10f439798>', message "/bin/sh: -c: line 0: syntax error near unexpected token `newline'"
I believe this results due to the arrows ('<' and '>') present in the file path.
ApplicationError: Non-zero return code 1 from '/usr/local/bin/mafft "_io.StringIO object at 0x10f439af8"', message '/usr/local/bin/mafft: Cannot open _io.StringIO object at 0x10f439af8.'
Attempting to remove the arrows by converting the file path to a string and indexing resulted in the above error.
Ultimately my goal is to reduce computation time. I hope to accomplish this by calling internal memory instead of writing out to a separate text file. Any advice or feedback regarding my goal is much appreciated. Thanks in advance.
I can't get MafftCommandline() to read internal files made by
io.StringIO().
This is not surprising for a couple of reasons:
As you're aware, Biopython doesn't implement Mafft, it simply
provides a convenient interface to setup a call to mafft in
/usr/local/bin. The mafft executable runs as a separate process
that does not have access to your Python program's internal memory,
including your StringIO file.
The mafft program only works with an input file, it doesn't even
allow stdin as a data source. (Though it does allow stdout as a
data sink.) So ultimately, there must be a file in the file system
for mafft to open. Thus the need for your temporary file.
Perhaps tempfile.NamedTemporaryFile() or tempfile.mkstemp() might be a reasonable compromise.

os.path.getsize returns 0 although folder has files in it (Python 3.5)

I am trying to create a program which auto-backups some folders under certain circumstances.
I try to compare the size of two folders (source and dest), source has files in it, a flac file and a subfolder with a text file whereas dest is empty.
This is the code I've written so far:
import os.path
sls = os.path.getsize('D:/autobu/source/')
dls = os.path.getsize('D:/autobu/dest/')
print(sls)
print(dls)
if sls > dls:
print('success')
else:
print('fail')
And the output is this:
0
0
fail
What have I done wrong? Have I misunderstood how getsize functions?
os.path.getsize('D:/autobu/source/') is used for getting size of a file
you can folder size you can use os.stat
src_stat = os.stat('D:/autobu/source/')
sls = src_stat.st_size

Resources