Can't read PNG files from S3 in Python 3? - python-3.x

I have a bucket on S3.
I want to be able to connect to it and read the pictures/PDFs into my EC2 machine memory, perform OCR and get needed fields.
Here is what I have done so far but unfortunately it doesn't work.
import cv2
import boto3
import matplotlib
import pytesseract
from PIL import Image
boto3.setup_default_session(profile_name='default-mfasession')
s3_client = boto3.client('s3')
s3_resource = boto3.resource('s3')
bucket_name = "my_bucket"
key = "my-files/._Screenshot 2020-04-20 at 14.21.20.png"
bucket = s3_resource.Bucket(bucket_name)
object = bucket.Object(key)
response = object.get()
file_stream = response['Body']
im = Image.open(file_stream)
np.array(im)
Returns me an error:
UnidentifiedImageError: cannot identify image file <_io.BytesIO object
at 0x7fae33dce110>
I have tried all the answers related to this issue in SO nothing helped.
Including:
matplotlib: ValueError: invalid PNG header
and
PIL cannot identify image file for io.BytesIO object
Please advise how to solve it?

This is what I usually use. Maybe it will work for you as well:
def image_from_s3(bucket, key):
bucket = s3_resource.Bucket(bucket)
image = bucket.Object(key)
img_data = image.get().get('Body').read()
return Image.open(io.BytesIO(img_data))
And in your handler you execute this:
img = image_from_s3(image_bucket, image_key)
img should be Pillow's image if it successfully executes.

Related

Gzip file compression and boto3

I am a beginner in using boto3 and I'd like to compress a file that is on a s3 bucket without downloading it to my local laptop. It is supposed to be a streaming compression (Glue aws). Here you can find my three attempts. The first one would be the best one because it is, in my opinion, on stream (similar to "gzip.open" function).
First wrong attempt (gzip.s3.open does not exists...):
with gzip.s3.open('s3://bucket/attempt.csv','wb') as fo:
"operations (write a file)"
Second wrong attempt (s3fs gzip compression on pandas dataframe):
import gzip
import boto3
from io import BytesIO, TextIOWrapper
s3 = boto3.client('s3', aws_access_key_id='', aws_secret_access_key='')
# read file
source_response_m = s3.get_object(Bucket=bucket,Key='file.csv')
df = pd.read_csv(io.BytesIO(source_response_m['Body'].read()))
# compress file
buffer = BytesIO()
with gzip.GzipFile(mode='w', fileobj=buffer) as zipped_file:
df.to_csv(TextIOWrapper(zipped_file, 'utf8'), index=False)
# upload it
s3_resource = boto3.resource('s3',aws_access_key_id='', aws_secret_access_key='')
s3_object = s3_resource.Object(bucket, 'file.csv.gz')
s3_object.put(Body=buffer.getvalue())
Third wrong attempt (Upload Gzip file using Boto3 & https://gist.github.com/tobywf/079b36898d39eeb1824977c6c2f6d51e)
from io import BytesIO
import gzip
import shutil
import boto3
from tempfile import TemporaryFile
s3 = boto3.resource('s3',aws_access_key_id='', aws_secret_access_key='')
bucket = s3.Bucket('bucket')
def upload_gzipped(bucket, key, fp, compressed_fp=None, content_type='text/plain'):
"""Compress and upload the contents from fp to S3.
If compressed_fp is None, the compression is performed in memory.
"""
if not compressed_fp:
compressed_fp = BytesIO()
with gzip.GzipFile(fileobj=compressed_fp, mode='wb') as gz:
shutil.copyfileobj(fp, gz)
compressed_fp.seek(0)
bucket.upload_fileobj(compressed_fp, key, {'ContentType': content_type, 'ContentEncoding': 'gzip'})
upload_gzipped(bucket,'folder/file.gz.csv', 'file.csv.gz')
Honestly I have no idea how to use the latter attempt. The doc I have found is not very clear and there are no examples.
Do you have any ideas/suggestions to overcome my issue?
Thanks in advance.
Solution
I was able to solve my issue using the link below. Hope it will be useful for you.
https://gist.github.com/veselosky/9427faa38cee75cd8e27
D

Read a .jpg from RAM

from io import StringIO
from PIL import Image
import requests
response = requests.get(image.url)
# Works fine, but requests a disk write.
f = open('tmp.jpg', 'bw')
f.write(response.content)
img = Image.open('tmp.jpg')
# Fails with `OSError: cannot identify image file <_io.StringIO object at 0x7fb666238a68>`
#file = StringIO(str(response.content))
#img = Image.open(file)
I am trying to run the code from this tutorial but in python3. The commented out version is the closest I have gone to the original idea of "get an image from the network into RAM and work with that". I don't mind using cv2 if easier. How do I write this code pythonically and efficiently?
As Mark Setchell said, you likely want BytesIO not StringIO.
from io import BytesIO
from PIL import Image
import requests
response = requests.get(image.url)
file = BytesIO(response.content)
img = Image.open(file)

Writing figure to Google Cloud Storage instead of local drive

I want to upload the figure which is made with matplotlib to GCS.
Current code:
from tensorflow.gfile import MakeDirs, Open
import numpy as np
import matplotlib.pyplot as plt
import datetime
_LOGDIR = "{date:%Y%m%d-%H%M%S}".format(date=datetime.datetime.now())
_PATH_LOGDIR = 'gs://{0}/logs/{1}'.format('skin_cancer_mnist', _LOGDIR)
MakeDirs(_PATH_LOGDIR))
def saving_figure(path_logdir):
data = np.arange(0, 21, 2)
fig = plt.figure(figsize=(20, 10))
plt.plot(data)
fig.savefig("{0}/accuracy_loss_graph.png".format(path_logdir))
plt.close()
saving_figure(_PATH_LOGDIR)
"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/matplotlib/backends/backend_agg.py", line 512, in print_png
filename_or_obj = open(filename_or_obj, 'wb')
FileNotFoundError: [Errno 2] No such file or directory: 'gs://skin_cancer_mnist/logs/20190116-195604/accuracy_loss_graph.png'
(The directory exists, I checked)
I could change the source code of matplotlib to use the Open method of tf.Gfile.Open, but there should be a better option...
Joans 2nd Option didn't work for me, I found a solution that worked for me:
from google.cloud import storage
import io
def saving_figure(path_logdir):
data = np.arange(0, 21, 2)
fig = plt.figure(figsize=(20, 10))
plt.plot(data)
fig_to_upload = plt.gcf()
# Save figure image to a bytes buffer
buf = io.BytesIO()
fig_to_upload.savefig(buf, format='png')
# init GCS client and upload buffer contents
client = storage.Client()
bucket = client.get_bucket('skin_cancer_mnist')
blob = bucket.blob('logs/20190116-195604/accuracy_loss_graph.png')
blob.upload_from_file(buf, content_type='image/png', rewind=True)
You cannot directly upload a file to Google Cloud Storage using the python open function (which is the one that matplotlib.pyplot.savefig is using behind the curtains).
Instead, you should use the Cloud Storage Client Library for Python. Check this documentation for details on how this library is used. This will allow you to manipulate files and upload/download them to GCS, among other things.
You will have to import this library in order to use it, you can install it by running pip install google-cloud-storage and import it as from google.cloud import storage.
As well, since the plt.figure is an object, and not the actual .png image that you want to upload, you cannot directly upload it to Google Cloud Storage either.
However you can do either one of the following:
Option 1: Save the image locally, and then upload it to Google Cloud Storage:
Using your code:
from google.cloud import storage
def saving_figure(path_logdir):
data = np.arange(0, 21, 2)
fig = plt.figure(figsize=(20, 10))
plt.plot(data)
fig.savefig("your_local_path/accuracy_loss_graph.png".format(path_logdir))
plt.close()
# init GCS client and upload file
client = storage.Client()
bucket = client.get_bucket('skin_cancer_mnist')
blob = bucket.blob('logs/20190116-195604/accuracy_loss_graph.png') # This defines the path where the file will be stored in the bucket
your_file_contents = blob.upload_from_filename(filename="your_local_path/accuracy_loss_graph.png")
Option 2: Save the image result from the figure to a variable, then upload it to GCS as a string (of bytes):
I have found the following StackOverflow answer that seems to save the figure image into a .png byte string, however I haven't tried it myself.
Again, based in your code:
from google.cloud import storage
import io
import urllib, base64
def saving_figure(path_logdir):
data = np.arange(0, 21, 2)
fig = plt.figure(figsize=(20, 10))
plt.plot(data)
fig_to_upload = plt.gcf()
# Save figure image to a bytes buffer
buf = io.BytesIO()
fig_to_upload.savefig(buf, format='png')
buf.seek(0)
image_as_a_string = base64.b64encode(buf.read())
# init GCS client and upload buffer contents
client = storage.Client()
bucket = client.get_bucket('skin_cancer_mnist')
blob = bucket.blob('logs/20190116-195604/accuracy_loss_graph.png') # This defines the path where the file will be stored in the bucket
your_file_contents = blob.upload_from_string(image_as_a_string, content_type='image/png')
Edit: Both options assume that the environment you are running the script from, has the Cloud SDK installed, and a Google Cloud authenticated account activated (if you haven't, you can check this documentation that explains how to do it).

How to load images from Google Cloud Storage with keras.preprocessing

I am writing machine learning code that can be trained locally or in the cloud. I am using keras.preprocessing to load images, which under the hood uses PIL. It works fine for local files, but understandably doesn't understand Google Cloud Storage paths, like "gs://...".
from keras.preprocessing import image
image.load_img("gs://myapp-some-bucket/123.png")
Gives this error:
.../lib/python2.7/site-packages/keras/preprocessing/image.py", line 320, in load_img img = pil_image.open(path) File
.../lib/python2.7/site-packages/PIL/Image.py", line 2530, in open fp = builtins.open(filename, "rb") IOError: [Errno 2] No such file or directory: 'gs://myapp-some-bucket/123.png'
What is the correct way of doing this? I ultimately need a folder of images to be a single numpy array (images decoded and grayscale).
Found a replacement for keras.preprocessing.image.load_img, that understands GCS. I also included more code to read the whole folder, and turn every image in the folder into a single numpy array for training...
import os
import tensorflow as tf
from tensorflow.python.platform import gfile
filelist = gfile.ListDirectory("gs://myapp-some-bucket")
sess = tf.Session()
with sess.as_default():
x = np.array([np.array(tf.image.decode_png(tf.read_file(os.path.join(train_files_dir, filename))).eval()) for filename in filelist])
Load image:
image_path = 'gs://xxxxxxx.jpg'
image = tf.read_file(image_path)
image = tf.image.decode_jpeg(image)
image_array = sess.run(image)
Save image:
job_dir = 'gs://xxxxxxxx'
image = tf.image.encode_jpeg(image_array)
file_name = 'xxx.jpg'
write_op = tf.write_file(os.path.join(job_dir, file_name), image)
sess.run(write_op)

How to access images with plpython using the pillow libary and saving on postgresql bytea column?

I have a problem accessing an image procesed from database stored as a bytea here is my pl to resize an image.
CREATE OR REPLACE FUNCTION public.ajustar(randstring bytea)
RETURNS bytea
LANGUAGE plpythonu
AS $function$
from io import BytesIO
import PIL
from PIL import Image
basewidth = 300
mem_file = BytesIO()
mem_file.write(randstring)
img = Image.open(mem_file)
wpercent = (basewidth/float(img.size[0]))
hsize = int((float(img.size[1])*float(wpercent)))
img = img.resize((basewidth,hsize), PIL.Image.ANTIALIAS)
img.close()
return img
$function$;
So my problem is returning a bytea but get an adress, How I can get the image rather than the adress?. The only way it works is saving the img to a file with img.save('/home/postgres/imagen.jpg'), instead I need to put in a object to replace the image in the database.
pruebas=# select encode(ajustar(foto), 'escape') from personal where id=193;
encode
------------------------------------------------------------
<PIL.Image.Image image mode=RGB size=300x347 at 0x1895990>
(1 fila)
Thanks in advance
With this function you can process images on the database (jpeg format).
This are made with plpython and if you are using Linux update the pillow library on Centos 7 with
$sudo pip install Pillow
or
$sudo pip update Pillow
This is the function.
CREATE FUNCTION ajustar(randstring bytea) RETURNS bytea
LANGUAGE plpythonu
AS $$
from io import BytesIO
import PIL
from PIL import Image
basewidth = 300
mem_file = BytesIO()
mem_file.write(randstring)
img = Image.open(mem_file)
wpercent = (basewidth/float(img.size[0]))
hsize = int((float(img.size[1])*float(wpercent)))
img = img.resize((basewidth,hsize), PIL.Image.ANTIALIAS)
salida = BytesIO()
img.save(salida, format='JPEG')
hex_data = salida.getvalue()
img.close()
return hex_data
$$;

Resources