Gzip file compression and boto3 - python-3.x

I am a beginner in using boto3 and I'd like to compress a file that is on a s3 bucket without downloading it to my local laptop. It is supposed to be a streaming compression (Glue aws). Here you can find my three attempts. The first one would be the best one because it is, in my opinion, on stream (similar to "gzip.open" function).
First wrong attempt (gzip.s3.open does not exists...):
with gzip.s3.open('s3://bucket/attempt.csv','wb') as fo:
"operations (write a file)"
Second wrong attempt (s3fs gzip compression on pandas dataframe):
import gzip
import boto3
from io import BytesIO, TextIOWrapper
s3 = boto3.client('s3', aws_access_key_id='', aws_secret_access_key='')
# read file
source_response_m = s3.get_object(Bucket=bucket,Key='file.csv')
df = pd.read_csv(io.BytesIO(source_response_m['Body'].read()))
# compress file
buffer = BytesIO()
with gzip.GzipFile(mode='w', fileobj=buffer) as zipped_file:
df.to_csv(TextIOWrapper(zipped_file, 'utf8'), index=False)
# upload it
s3_resource = boto3.resource('s3',aws_access_key_id='', aws_secret_access_key='')
s3_object = s3_resource.Object(bucket, 'file.csv.gz')
s3_object.put(Body=buffer.getvalue())
Third wrong attempt (Upload Gzip file using Boto3 & https://gist.github.com/tobywf/079b36898d39eeb1824977c6c2f6d51e)
from io import BytesIO
import gzip
import shutil
import boto3
from tempfile import TemporaryFile
s3 = boto3.resource('s3',aws_access_key_id='', aws_secret_access_key='')
bucket = s3.Bucket('bucket')
def upload_gzipped(bucket, key, fp, compressed_fp=None, content_type='text/plain'):
"""Compress and upload the contents from fp to S3.
If compressed_fp is None, the compression is performed in memory.
"""
if not compressed_fp:
compressed_fp = BytesIO()
with gzip.GzipFile(fileobj=compressed_fp, mode='wb') as gz:
shutil.copyfileobj(fp, gz)
compressed_fp.seek(0)
bucket.upload_fileobj(compressed_fp, key, {'ContentType': content_type, 'ContentEncoding': 'gzip'})
upload_gzipped(bucket,'folder/file.gz.csv', 'file.csv.gz')
Honestly I have no idea how to use the latter attempt. The doc I have found is not very clear and there are no examples.
Do you have any ideas/suggestions to overcome my issue?
Thanks in advance.
Solution
I was able to solve my issue using the link below. Hope it will be useful for you.
https://gist.github.com/veselosky/9427faa38cee75cd8e27
D

Related

Extracting data from a UCI dataset Online using python if the file is compressed(.zip)

I want to use web scrapping to get the data from file
https://archive.ics.uci.edu/ml/machine-learning-databases/00380/YouTube-Spam-Collection-v1.zip
How can I do that using requests in python?
You can use this example how to load the zip file using requests and built-in zipfile module:
import requests
from io import BytesIO
from zipfile import ZipFile
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00380/YouTube-Spam-Collection-v1.zip"
with ZipFile(BytesIO(requests.get(url).content), "r") as myzip:
# print content of zip:
# print(myzip.namelist())
# print content of one of the file:
with myzip.open("Youtube01-Psy.csv", "r") as f_in:
print(f_in.read())
Prints:
b'COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS\n
...

How to convert a byte stream (binary form) to a CSV file using Python 3.8?

I have to process a .csv file using AWS Lambda function. I serve the .csv file to the Lambda function using an AWS API Gateway. Now the API Gateway transforms the .csv file into a base64 string as it is received in the request. Any idea how to convert it back to .csv file.
I have mentioned my code below for reference.
import os
import sys
CWD = os.path.dirname(os.path.realpath(__file__))
sys.path.insert(0, os.path.join(CWD, "lib"))
import json
import base64
import boto3
import numpy as np
import io
from io import BytesIO
import pandas as pd
def lambda_handler(event, context):
s3 = boto3.client("s3")
# retrieving data from event which is base64 string
get_file_content_from_postman = event["content"]
# decoding data. Here the file content is converted back to binary form
binary_file= base64.b64decode(get_file_content_from_postman)
Since your binary_file will by bytes, you can just wrap it in BytesIO to treat it as a file for your pandas:
df = pd.read_csv(BytesIO(binary_file))
print(df)

Can't read PNG files from S3 in Python 3?

I have a bucket on S3.
I want to be able to connect to it and read the pictures/PDFs into my EC2 machine memory, perform OCR and get needed fields.
Here is what I have done so far but unfortunately it doesn't work.
import cv2
import boto3
import matplotlib
import pytesseract
from PIL import Image
boto3.setup_default_session(profile_name='default-mfasession')
s3_client = boto3.client('s3')
s3_resource = boto3.resource('s3')
bucket_name = "my_bucket"
key = "my-files/._Screenshot 2020-04-20 at 14.21.20.png"
bucket = s3_resource.Bucket(bucket_name)
object = bucket.Object(key)
response = object.get()
file_stream = response['Body']
im = Image.open(file_stream)
np.array(im)
Returns me an error:
UnidentifiedImageError: cannot identify image file <_io.BytesIO object
at 0x7fae33dce110>
I have tried all the answers related to this issue in SO nothing helped.
Including:
matplotlib: ValueError: invalid PNG header
and
PIL cannot identify image file for io.BytesIO object
Please advise how to solve it?
This is what I usually use. Maybe it will work for you as well:
def image_from_s3(bucket, key):
bucket = s3_resource.Bucket(bucket)
image = bucket.Object(key)
img_data = image.get().get('Body').read()
return Image.open(io.BytesIO(img_data))
And in your handler you execute this:
img = image_from_s3(image_bucket, image_key)
img should be Pillow's image if it successfully executes.

Writing figure to Google Cloud Storage instead of local drive

I want to upload the figure which is made with matplotlib to GCS.
Current code:
from tensorflow.gfile import MakeDirs, Open
import numpy as np
import matplotlib.pyplot as plt
import datetime
_LOGDIR = "{date:%Y%m%d-%H%M%S}".format(date=datetime.datetime.now())
_PATH_LOGDIR = 'gs://{0}/logs/{1}'.format('skin_cancer_mnist', _LOGDIR)
MakeDirs(_PATH_LOGDIR))
def saving_figure(path_logdir):
data = np.arange(0, 21, 2)
fig = plt.figure(figsize=(20, 10))
plt.plot(data)
fig.savefig("{0}/accuracy_loss_graph.png".format(path_logdir))
plt.close()
saving_figure(_PATH_LOGDIR)
"/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/matplotlib/backends/backend_agg.py", line 512, in print_png
filename_or_obj = open(filename_or_obj, 'wb')
FileNotFoundError: [Errno 2] No such file or directory: 'gs://skin_cancer_mnist/logs/20190116-195604/accuracy_loss_graph.png'
(The directory exists, I checked)
I could change the source code of matplotlib to use the Open method of tf.Gfile.Open, but there should be a better option...
Joans 2nd Option didn't work for me, I found a solution that worked for me:
from google.cloud import storage
import io
def saving_figure(path_logdir):
data = np.arange(0, 21, 2)
fig = plt.figure(figsize=(20, 10))
plt.plot(data)
fig_to_upload = plt.gcf()
# Save figure image to a bytes buffer
buf = io.BytesIO()
fig_to_upload.savefig(buf, format='png')
# init GCS client and upload buffer contents
client = storage.Client()
bucket = client.get_bucket('skin_cancer_mnist')
blob = bucket.blob('logs/20190116-195604/accuracy_loss_graph.png')
blob.upload_from_file(buf, content_type='image/png', rewind=True)
You cannot directly upload a file to Google Cloud Storage using the python open function (which is the one that matplotlib.pyplot.savefig is using behind the curtains).
Instead, you should use the Cloud Storage Client Library for Python. Check this documentation for details on how this library is used. This will allow you to manipulate files and upload/download them to GCS, among other things.
You will have to import this library in order to use it, you can install it by running pip install google-cloud-storage and import it as from google.cloud import storage.
As well, since the plt.figure is an object, and not the actual .png image that you want to upload, you cannot directly upload it to Google Cloud Storage either.
However you can do either one of the following:
Option 1: Save the image locally, and then upload it to Google Cloud Storage:
Using your code:
from google.cloud import storage
def saving_figure(path_logdir):
data = np.arange(0, 21, 2)
fig = plt.figure(figsize=(20, 10))
plt.plot(data)
fig.savefig("your_local_path/accuracy_loss_graph.png".format(path_logdir))
plt.close()
# init GCS client and upload file
client = storage.Client()
bucket = client.get_bucket('skin_cancer_mnist')
blob = bucket.blob('logs/20190116-195604/accuracy_loss_graph.png') # This defines the path where the file will be stored in the bucket
your_file_contents = blob.upload_from_filename(filename="your_local_path/accuracy_loss_graph.png")
Option 2: Save the image result from the figure to a variable, then upload it to GCS as a string (of bytes):
I have found the following StackOverflow answer that seems to save the figure image into a .png byte string, however I haven't tried it myself.
Again, based in your code:
from google.cloud import storage
import io
import urllib, base64
def saving_figure(path_logdir):
data = np.arange(0, 21, 2)
fig = plt.figure(figsize=(20, 10))
plt.plot(data)
fig_to_upload = plt.gcf()
# Save figure image to a bytes buffer
buf = io.BytesIO()
fig_to_upload.savefig(buf, format='png')
buf.seek(0)
image_as_a_string = base64.b64encode(buf.read())
# init GCS client and upload buffer contents
client = storage.Client()
bucket = client.get_bucket('skin_cancer_mnist')
blob = bucket.blob('logs/20190116-195604/accuracy_loss_graph.png') # This defines the path where the file will be stored in the bucket
your_file_contents = blob.upload_from_string(image_as_a_string, content_type='image/png')
Edit: Both options assume that the environment you are running the script from, has the Cloud SDK installed, and a Google Cloud authenticated account activated (if you haven't, you can check this documentation that explains how to do it).

Compress a CSV file written to a StringIO Buffer in Python3

I'm parsing text from pdf files into rows of ordered char metadata; I need to serialize these files to cloud storage, which is all working fine, however due to their size I'd also like to gzip these files but I've run into some issues there.
Here is my code:
import io
import csv
import zlib
# This data file is sent over Flask
page_position_data = pdf_parse_page_layouts(data_file)
field_order = ['char', 'position', 'page']
output_buffer = io.StringIO()
writer = csv.DictWriter(output_buffer, field_order)
writer.writeheader()
for page, rows in page_position_data.items():
for text_char_data_row in rows:
writer.writerow(text_char_data_row)
stored_format = zlib.compress(output_buffer)
This reads each row into the io.StringIO Buffer successfully, but gzip/zlib seem to only work with bytes-like objects like io.BytesIO so the last line errors; I cannot create read a csv into a BytesIO Buffer because DictWriter/Writer error unless io.StringIO() is used.
Thank you for your help!
I figured this out and wanted to show my answer for anyone who runs into this:
The issue is that zlib.compress is expecting a Bytes-like object; this actually doesn't mean either StringIO or BytesIO as both of these are "file-like" objects which implment read() and your normal unix file handles.
All you have to do to fix this is use StringIO() to write the csv file to and then call get the string from the StringIO() object and encode it into a bytestring; it can then be compressed by zlib.
import io
import csv
import zlib
# This data file is sent over Flask
page_position_data = pdf_parse_page_layouts(data_file)
field_order = ['char', 'position', 'page']
output_buffer = io.StringIO()
writer = csv.DictWriter(output_buffer, field_order)
writer.writeheader()
for page, rows in page_position_data.items():
for text_char_data_row in rows:
writer.writerow(text_char_data_row)
encoded = output_buffer.getvalue().encode()
stored_format = zlib.compress(encoded)
I have an alternative answer for anyone interested which should use less intermediate space, it needs python 3.3 and over to use the getbuffer() method:
from io import BytesIO, TextIOWrapper
import csv
import zlib
def compress_csv(series):
byte_buf = BytesIO()
fp = TextIOWrapper(byte_buf, newline='', encoding='utf-8')
writer = csv.writer(fp)
for row in series:
writer.writerow(row)
compressed = zlib.compress(byte_buf.getbuffer())
fp.close()
byte_buf.close()
return compressed

Resources