tmp file in Google cloud Functions for Python - python-3.x

Python runs like a charm on google cloud functions, but for the tmp files. Here's my simplified code:
FILE_PATH = "{}/report.pdf".format(tempfile.gettempdir())
pdf.output(FILE_PATH)
...
with open(FILE_PATH,'rb') as f:
data = f.read()
f.close()
encoded = base64.b64encode(data).decode()
attachment = Attachment()
attachment.content = str(encoded)
attachment.type = "application/pdf"
attachment.filename = "report"
attachment.disposition = "attachment"
attachment.content_id = "Report"
mail = Mail(from_email, subject, to_email, content)
mail.add_attachment(attachment)
Error is: [Errno 2] No such file or directory: '/tmp/report.pdf'
It works perfectly fine locally. Docs unfortunately only shows the node version. Workarounds would also be fine for sending that PDF.

It is a little difficult to find Google official documentation for writing in temporary folder. In My case, I needed to write in a temporary directory and upload it to google cloud storage using GCF.
Writing in temporary directory of Google Cloud Functions, it will consume memory resources provisioned for the function.
After creating the file and using it, it is recommended to remove it from the temporary directory. I used this code snippet for Write a csv into a temp dir in GCF(Python 3.7).
import pandas as pd
import os
import tempfile
from werkzeug.utils import secure_filename
def get_file_path(filename):
file_name = secure_filename(filename)
return os.path.join(tempfile.gettempdir(), file_name)
def write_temp_dir():
data = [['tom', 10], ['nick', 15]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
name = 'example.csv'
path_name = get_file_path(name)
df.to_csv(path_name, index=False)
os.remove(path_name)

Related

Download pdf file(Not restricted) from google drive through URL

import os
import requests
def download_file(download_url: str, filename: str):
"""
Download resume pdf file from storage
#param download_url: URL of reusme to be downloaded
#type download_url: str
#param filename: Name and location of file to be stored
#type filename: str
#return: None
#rtype: None
"""
file_request = requests.get(download_url)
with open(f'{filename}.pdf', 'wb+') as file:
file.write(file_request.content)
cand_id = "101"
time_current = "801"
file_location = f"{cand_id}_{time_current}"
download_file("https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf", file_location)
cand_id = "201"
time_current = "901"
download_file("https://drive.google.com/file/d/0B1HXnM1lBuoqMzVhZjcwNTAtZWI5OS00ZDg3LWEyMzktNzZmYWY2Y2NhNWQx/view?hl=en&resourcekey=0-5DqnTtXPFvySMiWstuAYdA", file_location)
----------
First file is working perfectly fine (i.e. 101_801.pdf)
But Second one is not able to open in any pdf reader(i.e.
201_901.pdf)(Error: We can't open this file).
What I understood is I'm not able properly read and write for file
from drive which is open for all. How to read that file and write?
I can use google drive API but can we have better solution without
using that ?
I tried out the code and couldnt open the PDF file as well. I suggest trying out gdown package. It is easy to use and you can download even large files from google drive. I used it in my class to download .sql db files (+-20Gb) for my assignments.
If you want to build more on this code, then you should probably check out Drive API. It is a well documented fast API.
I was able to find the solution for it through wget in python. Answering it so that it could help someone in the future.
import os
import wget
def download_candidate_resume(email: str, resume_url: str):
"""
This function is used to download resume from google drive and store on the local system
#param email: candidate email
#type email: str
#param resume_url: url of resume on google drive
#type resume_url: str
"""
file_extension = "pdf"
current_time = datetime.now()
file_name = f'{email}_{int(current_time.timestamp())}.{file_extension}'
temp_file_path = os.path.join(
os.getcwd(),
f'{email}_{int(current_time.timestamp())}.{file_extension}',
)
downloadable_resume_url = re.sub(
r"https://drive\.google\.com/file/d/(.*?)/.*?\?usp=sharing",
r"https://drive.google.com/uc?export=download&id=\1",
resume_url,
)
wget.download(downloadable_resume_url, out=temp_file_path)

File streaming in python [duplicate]

I need to create a CSV and upload it to an S3 bucket. Since I'm creating the file on the fly, it would be better if I could write it directly to S3 bucket as it is being created rather than writing the whole file locally, and then uploading the file at the end.
Is there a way to do this? My project is in Python and I'm fairly new to the language. Here is what I tried so far:
import csv
import csv
import io
import boto
from boto.s3.key import Key
conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'
fieldnames = ['first_name', 'last_name']
writer = csv.DictWriter(io.StringIO(), fieldnames=fieldnames)
k.set_contents_from_stream(writer.writeheader())
I received this error: BotoClientError: s3 does not support chunked transfer
UPDATE: I found a way to write directly to S3, but I can't find a way to clear the buffer without actually deleting the lines I already wrote. So, for example:
conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'
testDict = [{
"fieldA": "8",
"fieldB": None,
"fieldC": "888888888888"},
{
"fieldA": "9",
"fieldB": None,
"fieldC": "99999999999"}]
f = io.StringIO()
fieldnames = ['fieldA', 'fieldB', 'fieldC']
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
k.set_contents_from_string(f.getvalue())
for row in testDict:
writer.writerow(row)
k.set_contents_from_string(f.getvalue())
f.close()
Writes 3 lines to the file, however I'm unable to release memory to write a big file. If I add:
f.seek(0)
f.truncate(0)
to the loop, then only the last line of the file is written. Is there any way to release resources without deleting lines from the file?
I did find a solution to my question, which I will post here in case anyone else is interested. I decided to do this as parts in a multipart upload. You can't stream to S3. There is also a package available that changes your streaming file over to a multipart upload which I used: Smart Open.
import smart_open
import io
import csv
testDict = [{
"fieldA": "8",
"fieldB": None,
"fieldC": "888888888888"},
{
"fieldA": "9",
"fieldB": None,
"fieldC": "99999999999"}]
fieldnames = ['fieldA', 'fieldB', 'fieldC']
f = io.StringIO()
with smart_open.smart_open('s3://dev-test/bar/foo.csv', 'wb') as fout:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
fout.write(f.getvalue())
for row in testDict:
f.seek(0)
f.truncate(0)
writer.writerow(row)
fout.write(f.getvalue())
f.close()
Here is a complete example using boto3
import boto3
import io
session = boto3.Session(
aws_access_key_id="...",
aws_secret_access_key="..."
)
s3 = session.resource("s3")
buff = io.BytesIO()
buff.write("test1\n".encode())
buff.write("test2\n".encode())
s3.Object(bucket, keypath).put(Body=buff.getvalue())
We were trying to upload file contents to s3 when it came through as an InMemoryUploadedFile object in a Django request. We ended up doing the following because we didn't want to save the file locally. Hope it helps:
#action(detail=False, methods=['post'])
def upload_document(self, request):
document = request.data.get('image').file
s3.upload_fileobj(document, BUCKET_NAME,
DESIRED_NAME_OF_FILE_IN_S3,
ExtraArgs={"ServerSideEncryption": "aws:kms"})
According to docs it's possible
s3.Object('mybucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'))
so we can use StringIO in ordinary way
Update: smart_open lib from #inquiring minds answer is better solution
There's an interesting code solution mentioned in a GitHub smart_open issue (#82) that I've been meaning to try out. Copy-pasting here for posterity... looks like boto3 is required:
csv_data = io.BytesIO()
writer = csv.writer(csv_data)
writer.writerows(my_data)
gz_stream = io.BytesIO()
with gzip.GzipFile(fileobj=gz_stream, mode="w") as gz:
gz.write(csv_data.getvalue())
gz_stream.seek(0)
s3 = boto3.client('s3')
s3.upload_fileobj(gz_stream, bucket_name, key)
This specific example is streaming to a compressed S3 key/file, but it seems like the general approach -- using the boto3 S3 client's upload_fileobj() method in conjunction with a target stream, not a file -- should work.
There's a well supported library for doing just this:
pip install s3fs
s3fs is really trivial to use:
import s3fs
s3fs.S3FileSystem(anon=False)
with s3.open('mybucket/new-file', 'wb') as f:
f.write(2*2**20 * b'a')
f.write(2*2**20 * b'a')
Incidentally there's also something built into boto3 (backed by the AWS API) called MultiPartUpload.
This isn't factored as a python stream which might be an advantage for some people. Instead you can start an upload and send parts one at a time.
To write a string to an S3 object, use:
s3.Object('my_bucket', 'my_file.txt').put('Hello there')
So convert the stream to string and you're there.

Saving data in Selenium

I am trying to save output data after I am running successful script in python using Selenium. But, I am not able to save result at end of my run/ script. My code is running fine, only problem is I am not able to save out to a file which can be .json, csv or text. I need serious help on this one.
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import csv
import requests
# saving data in bangkok_vendor.text
def copy_info():
with open('bangkok_vendor.text','a') as wt:
for x in script3:
wt.write(x)
wt.close()
return
contents =[]
filename = 'link_business_filter.csv'
with open(filename,'rt') as f:
data = csv.reader(f)
for row in data:
links = row[0]
contents.append(links)
for link in contents:
url_html = requests.get(link)
print(link)
browser = webdriver.Chrome('chromedriver')
open = browser.get(link)
source = browser.page_source
data = bs(source,"html.parser")
body = data.find('body')
script = body
x_path = '//*[#id="react-root"]/section/main/div'
script2 = browser.find_element_by_xpath(x_path)
script3 = script2.text
#script2.send_keys(keys.COMMAND + 't')
browser.close()
print(script3)
copy_info()
Did you try using csv.writer for csv files? Please check out the following link. hope it helps.
Save results to csv file with Python

Is there a way for google cloud storage client to point to 'file object' on cloud storage to be then used by lxml?

With Google Cloud Storage Client I could not read a Storage file as an object as required by the lxml.etree.parse. I could read the Cloud storage file as a blob, but that did not work well with lxml.
I am trying to convert XML files using an XSLT file. I want to have a Google Cloud Function(in Python3.7) that will be triggered as soon as the XML file is uploaded to Cloud Storage. I have tried this code by storing the files locally and it works. However need a way to get this working with Cloud Storage as well.
----Using local files (Working Code):
import lxml.etree as ET
filename = "C:\\GCP\\Files\\Profile.xml"
xsltfile = "C:\\GCP\\Files\\Transform.xslt"
outpath = "C:\\GCP\\Files\\Output\\Output.json"
dom = ET.parse(filename)
xslt = ET.parse(xsltfile)
transform = ET.XSLT(xslt)
newdom = transform(dom)
xdom = str(newdom)
text_file = open(outpath, "w")
text_file.write(xdom)
text_file.close()
----Using Cloud storage(not working):
from google.cloud import storage
import lxml.etree as ET
client = storage.Client()
bucket = client.get_bucket('customerfile02')
xmlblob = bucket.blob('testprofile.xml')
inputxml=xmlblob.download_as_string()
xmldom = ET.parse(inputxml)
Error: failed to load external entity
The error is expected as I am passing an XML string instead of a File Object as expected by ET.parse
How can I pass a file object from Cloud storage to make this work?
The lxml.etree.parse() function expects a string as a filename. If you want to pass it file contents instead, you need to wrap it in a StringIO or BytesIO (in this case, the latter):
from io import BytesIO
from google.cloud import storage
import lxml.etree as ET
client = storage.Client()
bucket = client.get_bucket('customerfile02')
xmlblob = bucket.blob('testprofile.xml')
inputxml = xmlblob.download_as_string()
xmldom = ET.parse(BytesIO(inputxml))
See the lxml documentation here: https://lxml.de/parsing.html.

Restructure data loaded from dropbox file in python

I am trying to download data from a CSV file stored in a dropbox folder. So far I do like this:
import dropbox
#Get access to my dropbox folder
dbx = dropbox.Dropbox('SOME_ACCESS_TOKEN')
dbx.users_get_current_account()
#Download file
metadata, res = dbx.files_download('/Test.csv')
#Get the file content
data=res.content
print(data)
data is of the this form: b'1,2,3,4,5\r\nA,B,C,D,E\r\n1,2,3,4,5\r\nA,B,C,D,E\r\n1,2,3,4,5\r\nA,B,C,D,E\r\n'
Is there an easy way to restructure this into a list of lists?
The solution to the above mentioned problem is:
import dropbox
#Connect to dropbox folder
dbx = dropbox.Dropbox('SOME_ACCESS_TOKEN')
dbx.users_get_current_account()
#Get metadata
metadata, res = dbx.files_download('/Test.txt')
#Get and decode data
data=res.content.decode('utf-8')
#Restructure data
lines = data.split('\r\n')
lines.pop()
print(lines)

Resources