Passing base64 .docx to docx.Document results in BadZipFile exception - python-3.x

I'm writing an Azure function in Python 3.9 that needs to accept a base64 string created from a known .docx file which will serve as a template. My code will decode the base64, pass it to a BytesIO instance, and pass that to docx.Document(). However, I'm receiving an exception BadZipFile: File is not a zip file.
Below is a slimmed down version of my code. It fails on document = Document(bytesIODoc). I'm beginning to think it's an encoding/decoding issue, but I don't know nearly enough about it to get to the solution.
from docx import Document
from io import BytesIO
import base64
var = {
'template': 'Some_base64_from_docx_file',
'data': {'some': 'data'}
}
run_stuff = ParseBody(body=var)
output = run_stuff.run()
class ParseBody():
def __init__(self, body):
self.template = str(body['template'])
self.contents = body['data']
def _decode_template(self):
b64Doc = base64.b64decode(self.template)
bytesIODoc = BytesIO(b64Doc)
document = Document(bytesIODoc)
def run(self):
self.document = self._decode_template()
I've also tried the following change to _decode_template and am getting the same exception. This is running base64.decodebytes() on the b64Doc object and passing that to BytesIO instead of directly passing b64Doc.
def _decode_template(self):
b64Doc = base64.b64decode(self.template)
bytesDoc = base64.decodebytes(b64Doc)
bytesIODoc = BytesIO(bytesDoc)
I have successfully tried the following on the same exact .docx file to be sure that this is possible. I can open the document in Python, base64 encode it, decode into bytes, pass that to a BytesIO instance, and pass that to docx.Document successfully.
file = r'WordTemplate.docx'
doc = open(file, 'rb').read()
b64Doc = base64.b64encode(doc)
bytesDoc = base64.decodebytes(b64Doc)
bytesIODoc= BytesIO(bytesDoc)
newDoc = Document(bytesIODoc)
I've tried countless other solutions to no avail that have lead me further away from a resolution. This is the closest I've gotten. Any help is greatly appreciated!

The answer to the question linked below actually helped me resolve my own issue. How to generate a DOCX in Python and save it in memory?
All I had to do was change document = Document(bytesIODoc) to the following:
document = Document()
document.save(bytesIODoc)

Related

How to get a download link which requires checkboxes checking in additional dialog box

I want to download the last publicly available file from https://sam.gov/data-services/Exclusions/Public%20V2?privacy=Public
while trying to download manually, the real download links look like:
https://falextracts.s3.amazonaws.com/Exclusions/Public%20V2/SAM_Exclusions_Public_Extract_V2_22150.ZIP?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20220530T143743Z&X-Amz-SignedHeaders=host&X-Amz-Expires=2699&X-Amz-Credential=AKIAY3LPYEEXWOQWHCIY%2F20220530%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=3eca59f75a4e1f6aa59fc810da8f391f1ebfd8ca5a804d56b79c3eb9c4d82e32
My function gets only initial link, which refers to the real link:
import json
import requests
from operator import itemgetter
files_url = 'https://sam.gov/api/prod/fileextractservices/v1/api/listfiles?random=1653676394983&domain=Exclusions/Public%20V2&privacy=Public'
def get_file():
response = requests.get(files_url, stream=True)
links_resp = json.loads(response.text)
links_dicts = [d for d in links_resp['_embedded']['customS3ObjectSummaryList'] if d['displayKey'].count('SAM_Exclus')]
sorted_links = sorted(links_dicts, key=itemgetter('dateModified'), reverse=True)
return sorted_links[0]['_links']['self']['href']
get_file()
Result:
'https://s3.amazonaws.com/falextracts/Exclusions/Public V2/SAM_Exclusions_Public_Extract_V2_22150.ZIP'
But by following the above link, I get Access denied
So I will appreciate any hints on how to get real download links
I've edited your code as much as possible so you can understand. The requests library can convert it to json itself.
imports that are not at the beginning of the code do not look very good for reading...
import requests as req
from operator import itemgetter
files_url = "https://sam.gov/api/prod/fileextractservices/v1/api/listfiles?random=1653676394983&domain=Exclusions/Public%20V2&privacy=Public"
down_url = "https://sam.gov/api/prod/fileextractservices/v1/api/download/Exclusions/Public%20V2/{}?privacy=Public"
def get_file():
response = req.get(files_url, stream=True).json()
links_dicts = [d for d in response["_embedded"]["customS3ObjectSummaryList"]]
sorted_links = sorted(links_dicts, key=itemgetter('dateModified'), reverse=True)
key = sorted_links[0]['displayKey']
down = req.get(down_url.format(key))
if not down.status_code == 200:
return False
print(key)
open(key, 'wb').write(down.content)
return True
get_file()

How to send a pdf object from Databricks to Sharepoint?

INTRO: I have a Databricks notebook where I create a pdf file based on some data.
In order to generate the file I am using the fpdf library:
from fpdf import FPDF, HTMLMixin
Thanks to the library I generate a pdf file which is of type: <__main__.HTML2PDF at 0x7f3b73720fd0>.
My goal now is to send this pdf to a sharepoint folder. To do so I am using the following lines of code:
from office365.runtime.auth.user_credential import UserCredential
from office365.sharepoint.client_context import ClientContext
from office365.sharepoint.files.file import File
# paths
sharepoint_site = "MySharepointSite"
sharepoint_folder = "Shared Documents/General/PDFs/"
sharepoint_user = "aaa#bbb.onmicrosoft.com"
sharepoint_user_pw = "xyz"
sharepoint_folder = sharepoint_folder.strip("/")
# set environment variables
SITE_URL = f"https://sharepoint.com/sites/{sharepoint_site}"
RELATIVE_URL = f"/sites/{sharepoint_site}/{sharepoint_folder}"
# connect to sharepoint
ctx = ClientContext(SITE_URL).with_credentials(UserCredential(sharepoint_user, sharepoint_user_pw))
web = ctx.web
ctx.load(web).execute_query()
# Generate PDF
pdf = generate_pdf(ctx, row['ServerRelativeUrl'])
# HERE IS MY ISSUE!
ctx.web.get_folder_by_server_relative_url(sharepoint_folder).upload_file('test.pdf', pdf).execute_query()
PROBLEM: When I reach the last row I get the following error message:
TypeError: Object of type HTML2PDF is not JSON serializable
I believe that pdf objects cannot be serialized to be JSON and therefore I am stuck and I do not know how to send the PDF to the sharepoint.
QUESTION: Would you be able to suggest a smart and elegant way to achieve my goal i.e sending the pdf file to the sharepoint please?
I was able to solve this problem by saving the pdf as a string, then encoding it and finally pushing it to the sharepoint:
pdf_binary = pdf.output(dest='S').encode("latin1")
ctx.web.get_folder_by_server_relative_url(sharepoint_folder).upload_file("test.pdf", pdf_binary).execute_query()
Note: If it does not work, try to change the encoding type.

Saving (.svg) images using Scrapy

Im using Scrapy and I want to save some of the .svg images from the webpage locally on my computer. The urls for these images have the structure '__.com/svg/4/8/3/1425.svg' (and is a full working url, https included).
Ive defined the item in my items.py file:
class ImageItem(scrapy.Item):
image_urls = scrapy.Field()
images = scrapy.Field()
Ive added the following to my settings:
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = '../Data/Silks'
MEDIA_ALLOW_REDIRECTS = True
In the main parse function im calling:
imageItem = ImageItem()
imageItem['image_urls'] = [url]
yield imageItem
But it doesn't save the images. Ive followed the documentation and tried numerous things but keep getting the following error:
StopIteration: <200 https://www.________.com/svg/4/8/3/1425.svg>
During handling of the above exception, another exception occurred:
......
......
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x1139233b0>
Am I missing something? Can anyone help? I am fully stumped.
Gallaecio was right! Scrapy was having an issue with the .svg file type. Changed the imagePipeline to the filePipeline and it works!
For anyone stuck the documentation is here
Python Imaging Library (PIL), which is used by the ImagesPipeline, does not support vector images.
If you still want to benefit from the ImagesPipeline capabilities and not switch to the more general FilesPipeline, you can do something along those lines
from svglib.svglib import svg2rlg
from reportlab.graphics import renderPM
from io import BytesIO
class SvgCompatibleImagesPipeline(ImagesPipeline):
def get_images(self, response, request, info, *, item=None):
"""
Add processing of SVG images to the standard images pipeline
"""
if isinstance(response, scrapy.http.TextResponse) and response.text.startswith('<svg'):
b = BytesIO()
renderPM.drawToFile(svg2rlg(BytesIO(response.body)), b, fmt='PNG')
res = response.replace(body=b.getvalue())
else:
res = response
return super().get_images(res, request, info, item=item)
This will replace the SVG image in the response body by a PNG version of it, which can be further processed by the regular ImagesPipeline.

SoapUI to get attachment from response in Groovy

I tried to use following code to get attachment from reponse as text in Groovy.
def testStep = testRunner.testCase.getTestStepByName("getData")
def response = testStep.testRequest.response
def ins = response.attachments[0].inputStream
log.info(ins);
It contains some binary information too, so it is not fully human readable, but got following in output:
java.io.ByteArrayInputStream#5eca74
It is easy to simply encode it to base64 and store it as a property value.
def ins = response.attachments[0].inputStream
String encoded = ins.bytes.encodeBase64().toString()

Opening Image from Website

I was trying to make a simple program to pull an image from the website xkcd.com, and I seem to be running into a problem where it returns list object has no attribute show. Anyone know how to fix this?
import requests
from lxml import html
r = requests.get("http://imgs.xkcd.com/comics/self_driving_issues.png")
tree = html.fromstring(r.content)
final = tree.xpath("""//*[#id="comic"]/img""")
final.show()
Your call to requests.get is retrieving the actual image, the byte code for the png. There is no html to parse or search for with xpath.
Note here, the content is bytes:
r = requests.get("http://imgs.xkcd.com/comics/self_driving_issues.png")
print(r.content)
b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x02\xe4\x00\x00\x01#\x08\x03\x00\x00\x00M\x7f\xe4\xc6\x00\x00\x00\x04gAMA\x00\x00\xb1\x8f
Here you see that you can save the results directly to disk.
import requests
r = requests.get("http://imgs.xkcd.com/comics/self_driving_issues.png")
with open("myimage.png", "wb") as f:
f.write(r.content)
[Edit] And to Show the image (you will need to install pillow.)
import requests
from PIL import Image
import io
r = requests.get("http://imgs.xkcd.com/comics/self_driving_issues.png")
img = Image.open(io.BytesIO(r.content))
img.show()

Resources