AWS Object Lambda using PyPDF2 to send back encrypted PDF

AWS Object Lambda using PyPDF2 to send back encrypted PDF - object

My AWS OBJECT Lambda Function gets an unencrypted PDF via the Object Lambda inputS3Url. I want to use PyPDF2 to convert this to encrypted PDF, and send back via s3.write_get_object_response. How do I do this?
s3_url = object_get_context["inputS3Url"]
url=s3_url
response = requests.get(url)
my_raw_data = response.content
[SAVE ENCRYPTED my_raw_data TO VARIABLE so it can returned via S3.write_get_object_response - HOW?]
s3 = boto3.client('s3')
s3.write_get_object_response(
Body= [WHAT WOULD GO HERE?]
RequestRoute=request_route,
RequestToken=request_token)

The docs got you! Encrypting PDFs and Streaming Data is what you need (at least if I got you right; let me know if you want to achieve something else than getting a password-protected PDF on S3)
Not tested, but something like this
from PyPDF2 import PdfReader, PdfWriter
from io import BytesIO
reader = PdfReader(BytesIO(my_raw_data))
writer = PdfWriter()
# Add all pages to the writer
for page in reader.pages:
writer.add_page(page)
# Add a password to the new PDF
writer.encrypt("my-secret-password")
# Save the new PDF to a file
with BytesIO() as bytes_stream:
writer.write(bytes_stream)
bytes_stream.seek(0)
s3 = boto3.client('s3')
s3.write_get_object_response(
Body=bytes_stream,
RequestRoute=request_route,
RequestToken=request_token
)

Related

How do I open images stored in GCP in Google datalab?

I have been trying to open a image that I stored in the GCP bucket in my datalab notebook. When I use Image.open() it says like "No such file or directory: 'images/00001.jpeg'"
My code is:
nama_bucket = storage.Bucket("sample_bucket")
for obj in nama_bucket.objects():
Image.open(obj.key)
I just need to open the images stored in the bucket and view it. Thanks for the help!

I was able to reproduce the issue and get the same error as you (No such file or directory).
I will describe the workaround I used to solve it. However,there are few issues that I can see in the code snippet provided:
Class IPython.display.Image has no method 'open'.
You will need to wrap the Image constructor in a display() method.
With Storage APIs for Google Cloud Datalab, what resolved the issue for me was using the url parameter instead of the filename.
Here is the solution that worked for me:
import google.datalab.storage as storage
from IPython.display import Image
bucket_name = '<my-bucket-name>'
sample_bucket = storage.Bucket(bucket_name)
for obj in sample_bucket.objects():
display(Image(url='https://storage.googleapis.com/{}/{}'.format(bucket_name, obj.key)))
Let me know if it helps!
EDIT 1:
As you mentioned that you're using the PIL and would like your images to be handled by it, here's the way to achieve that (I have tested it and it worked well for me):
import google.datalab.storage as storage
from PIL import Image
import requests
from io import BytesIO
bucket_name = '<my-bucket-name>'
sample_bucket = storage.Bucket(bucket_name)
for obj in sample_bucket.objects():
url='https://storage.googleapis.com/{}/{}'.format(bucket_name, obj.key)
response = requests.get(url)
img = Image.open(BytesIO(response.content))
print("Filename: {}\nFormat: {}\nSize: {}\nMode: {}".format(obj.key, img.format, img.size, img.mode))
display(img)
Notice that this way you will not need to use IPython.display.Image at all.
EDIT 2:
Indeed, the error cannot identify image file <_io.BytesIO object at 0x7f8f33bdbdb0> is appearing because you have a directory in your bucket. In order to solve this issue it's important to understand how Google Cloud Storage sub-directories work.
Here's how I organized the files in my bucket to replicate your situation:
my-bucket/
img/
test-file-1.png
test-file-2.png
test-file-3.jpeg
test-file-4.png
Even though gsutil achieves the hierarchical file tree illusion by applying a variety of rules, to try to make naming work the way users would expect, in fact, the test-files 1-3 just happen to have '/'s in their names while there's no actual 'img' directory.
You can still still list all images from your bucket. With the structure I mentioned above it can be achieved, for example, by checking the file's extension:
import google.datalab.storage as storage
from PIL import Image
import requests
from io import BytesIO
bucket_name = '<my-bucket-name>'
sample_bucket = storage.Bucket(bucket_name)
for obj in sample_bucket.objects():
# Check that the object is an image
if obj.key[-3:].lower() in ('jpg','png') or obj.key[-4:].lower() in ('jpeg'):
url='https://storage.googleapis.com/{}/{}'.format(bucket_name, obj.key)
response = requests.get(url)
img = Image.open(BytesIO(response.content))
print("Filename: {}\nFormat: {}\nSize: {}\nMode: {}".format(obj.key, img.format, img.size, img.mode))
display(img)
If you need to get only the images "stored in a particular sub-directory" of your bucket, you will also need to check the files by name:
import google.datalab.storage as storage
from PIL import Image
import requests
from io import BytesIO
bucket_name = '<my-bucket-name>'
folder = '<name-of-the-directory>'
sample_bucket = storage.Bucket(bucket_name)
for obj in sample_bucket.objects():
# Check that the object is an image AND that it has the required sub-directory in its name
if (obj.key[-3:].lower() in ('jpg','png') or obj.key[-4:].lower() in ('jpeg')) and folder in obj.key:
url='https://storage.googleapis.com/{}/{}'.format(bucket_name, obj.key)
response = requests.get(url)
img = Image.open(BytesIO(response.content))
print("Filename: {}\nFormat: {}\nSize: {}\nMode: {}".format(obj.key, img.format, img.size, img.mode))
display(img)

File streaming in python [duplicate]

I need to create a CSV and upload it to an S3 bucket. Since I'm creating the file on the fly, it would be better if I could write it directly to S3 bucket as it is being created rather than writing the whole file locally, and then uploading the file at the end.
Is there a way to do this? My project is in Python and I'm fairly new to the language. Here is what I tried so far:
import csv
import csv
import io
import boto
from boto.s3.key import Key
conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'
fieldnames = ['first_name', 'last_name']
writer = csv.DictWriter(io.StringIO(), fieldnames=fieldnames)
k.set_contents_from_stream(writer.writeheader())
I received this error: BotoClientError: s3 does not support chunked transfer
UPDATE: I found a way to write directly to S3, but I can't find a way to clear the buffer without actually deleting the lines I already wrote. So, for example:
conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'
testDict = [{
"fieldA": "8",
"fieldB": None,
"fieldC": "888888888888"},
{
"fieldA": "9",
"fieldB": None,
"fieldC": "99999999999"}]
f = io.StringIO()
fieldnames = ['fieldA', 'fieldB', 'fieldC']
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
k.set_contents_from_string(f.getvalue())
for row in testDict:
writer.writerow(row)
k.set_contents_from_string(f.getvalue())
f.close()
Writes 3 lines to the file, however I'm unable to release memory to write a big file. If I add:
f.seek(0)
f.truncate(0)
to the loop, then only the last line of the file is written. Is there any way to release resources without deleting lines from the file?

I did find a solution to my question, which I will post here in case anyone else is interested. I decided to do this as parts in a multipart upload. You can't stream to S3. There is also a package available that changes your streaming file over to a multipart upload which I used: Smart Open.
import smart_open
import io
import csv
testDict = [{
"fieldA": "8",
"fieldB": None,
"fieldC": "888888888888"},
{
"fieldA": "9",
"fieldB": None,
"fieldC": "99999999999"}]
fieldnames = ['fieldA', 'fieldB', 'fieldC']
f = io.StringIO()
with smart_open.smart_open('s3://dev-test/bar/foo.csv', 'wb') as fout:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
fout.write(f.getvalue())
for row in testDict:
f.seek(0)
f.truncate(0)
writer.writerow(row)
fout.write(f.getvalue())
f.close()

Here is a complete example using boto3
import boto3
import io
session = boto3.Session(
aws_access_key_id="...",
aws_secret_access_key="..."
)
s3 = session.resource("s3")
buff = io.BytesIO()
buff.write("test1\n".encode())
buff.write("test2\n".encode())
s3.Object(bucket, keypath).put(Body=buff.getvalue())

We were trying to upload file contents to s3 when it came through as an InMemoryUploadedFile object in a Django request. We ended up doing the following because we didn't want to save the file locally. Hope it helps:
#action(detail=False, methods=['post'])
def upload_document(self, request):
document = request.data.get('image').file
s3.upload_fileobj(document, BUCKET_NAME,
DESIRED_NAME_OF_FILE_IN_S3,
ExtraArgs={"ServerSideEncryption": "aws:kms"})

According to docs it's possible
s3.Object('mybucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'))
so we can use StringIO in ordinary way
Update: smart_open lib from #inquiring minds answer is better solution

There's an interesting code solution mentioned in a GitHub smart_open issue (#82) that I've been meaning to try out. Copy-pasting here for posterity... looks like boto3 is required:
csv_data = io.BytesIO()
writer = csv.writer(csv_data)
writer.writerows(my_data)
gz_stream = io.BytesIO()
with gzip.GzipFile(fileobj=gz_stream, mode="w") as gz:
gz.write(csv_data.getvalue())
gz_stream.seek(0)
s3 = boto3.client('s3')
s3.upload_fileobj(gz_stream, bucket_name, key)
This specific example is streaming to a compressed S3 key/file, but it seems like the general approach -- using the boto3 S3 client's upload_fileobj() method in conjunction with a target stream, not a file -- should work.

There's a well supported library for doing just this:
pip install s3fs
s3fs is really trivial to use:
import s3fs
s3fs.S3FileSystem(anon=False)
with s3.open('mybucket/new-file', 'wb') as f:
f.write(2*2**20 * b'a')
f.write(2*2**20 * b'a')
Incidentally there's also something built into boto3 (backed by the AWS API) called MultiPartUpload.
This isn't factored as a python stream which might be an advantage for some people. Instead you can start an upload and send parts one at a time.

To write a string to an S3 object, use:
s3.Object('my_bucket', 'my_file.txt').put('Hello there')
So convert the stream to string and you're there.

unable to read the data of a CSV file present in S3 bucket

I am trying to access a CSV file present in S3 bucket . Unfortunately nothing seems to work till now. Can anyone please assist me. All I want to do is access data from CSV which in inside S3 bucket and print some of its columns
Below is the code I tried:
'''
bucket = "bucket name"
file_name = "SUBFOLDER1/SUBFOLDER2/.CSV file"
obj = s3_resource.get_object(Bucket= bucket, Key= file_name)
initial_df = pd.read_csv(obj['Body'])
print(initial_df)
''''
Error I get is :
s3.ServiceResource' object has no attribute 'get_object'

There are two ways of using boto3:
Client method
s3_client = boto3.client('s3')
Calls on a client map one-to-one to AWS API calls. They return a JSON blog that you can access like:
for reservation in response['Reservations']:
Resource method
s3_resource = boto3.resource('s3')
This returns Python objects that can be used directly like:
for object in Bucket('name').objects.all():
Now, to answer your question...
You show this line of code:
obj = s3_resource.get_object(Bucket= bucket, Key= file_name)
The get_object() call should be used on a boto3 client, but it appears that you are attempting to use it on a boto3 resource. So, you can either use a client instead of a resource, or change the call to a resource-style call.

Is there a way for google cloud storage client to point to 'file object' on cloud storage to be then used by lxml?

With Google Cloud Storage Client I could not read a Storage file as an object as required by the lxml.etree.parse. I could read the Cloud storage file as a blob, but that did not work well with lxml.
I am trying to convert XML files using an XSLT file. I want to have a Google Cloud Function(in Python3.7) that will be triggered as soon as the XML file is uploaded to Cloud Storage. I have tried this code by storing the files locally and it works. However need a way to get this working with Cloud Storage as well.
----Using local files (Working Code):
import lxml.etree as ET
filename = "C:\\GCP\\Files\\Profile.xml"
xsltfile = "C:\\GCP\\Files\\Transform.xslt"
outpath = "C:\\GCP\\Files\\Output\\Output.json"
dom = ET.parse(filename)
xslt = ET.parse(xsltfile)
transform = ET.XSLT(xslt)
newdom = transform(dom)
xdom = str(newdom)
text_file = open(outpath, "w")
text_file.write(xdom)
text_file.close()
----Using Cloud storage(not working):
from google.cloud import storage
import lxml.etree as ET
client = storage.Client()
bucket = client.get_bucket('customerfile02')
xmlblob = bucket.blob('testprofile.xml')
inputxml=xmlblob.download_as_string()
xmldom = ET.parse(inputxml)
Error: failed to load external entity
The error is expected as I am passing an XML string instead of a File Object as expected by ET.parse
How can I pass a file object from Cloud storage to make this work?

The lxml.etree.parse() function expects a string as a filename. If you want to pass it file contents instead, you need to wrap it in a StringIO or BytesIO (in this case, the latter):
from io import BytesIO
from google.cloud import storage
import lxml.etree as ET
client = storage.Client()
bucket = client.get_bucket('customerfile02')
xmlblob = bucket.blob('testprofile.xml')
inputxml = xmlblob.download_as_string()
xmldom = ET.parse(BytesIO(inputxml))
See the lxml documentation here: https://lxml.de/parsing.html.

How to upload an image with flask and store in couchdb?

A previous question asks how to retrieve at attachment from couchdb and display it in a flask application.
This question asks how to perform the opposite, i.e. how can an image be uploaded using flask and saved as a couchdb attachment.

Take a look at the example from WTF:
from werkzeug.utils import secure_filename
from flask_wtf.file import FileField
class PhotoForm(FlaskForm):
photo = FileField('Your photo')
#app.route('/upload/', methods=('GET', 'POST'))
def upload():
form = PhotoForm()
if form.validate_on_submit():
filename = secure_filename(form.photo.data.filename)
form.photo.data.save('uploads/' + filename)
else:
filename = None
return render_template('upload.html', form=form, filename=filename)
Take a look at the FileField api docs. There you have a stream method giving you access to the uploaded data. Instead of using the save method as in the example you can access the bytes from the stream, base64 encode it and save as an attachment in couchdb, e.g. Using put_attachment. Alternatively, the FileStorage api docs suggest you can use read() to retrieve the data.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

AWS Object Lambda using PyPDF2 to send back encrypted PDF - object

Related

How do I open images stored in GCP in Google datalab?

File streaming in python [duplicate]

unable to read the data of a CSV file present in S3 bucket

Is there a way for google cloud storage client to point to 'file object' on cloud storage to be then used by lxml?

How to upload an image with flask and store in couchdb?

Categories

Resources