Good morning,
Yesterday I saved a file from SageMaker conda_python3 to S3 like this:
s3 = boto3.client(
's3',
aws_access_key_id='XXXX',
aws_secret_access_key='XXXX'
)
y = pandas.DataFrame(df.tag_factor,index = df.index)
s3.put_object(Body = y.values.tobytes(), Bucket='xxx', Key='xxx')
Today I am trying to open it with conda_python3 as a pandas.Series or as a numpy.array object, with this code:
s3 = boto3.client(
's3',
aws_access_key_id='XXX',
aws_secret_access_key='XXX'
)
y_bytes = s3.get_object(Bucket='xxx', Key='xxx')
y = numpy.load(io.BytesIO(y_bytes['Body'].read()))
but I am getting this error: OSError: Failed to interpret file <_io.BytesIO >object at 0x7fcb0b403258> as a pickle
I tried this:
y = numpy.fromfile(io.BytesIO(y_bytes['Body'].read()))
and I get this error:
UnsupportedOperation: fileno
I tried this:
y = pd.read_csv(io.BytesIO(y_bytes['Body'].read()), sep=" ", header=None)
and I get this error:
EmptyDataError: No columns to parse from file
How can I read this file?
As suggested in a previous comment, you probably want to save your data into a known file format for reading from and writing data to S3.
As an example, here is some code that converts a pandas DataFrame to csv, saves it in S3, and reads the file from S3 back into a DataFrame.
import pandas as pd
import boto3
import io
df = pd.dataFrame(...)
csv_buffer = io.StringIO()
df.to_csv(csv_buffer, index=False)
s3 = boto3.client('s3')
bucket = 'mybucket'
key = 'myfile.csv'
s3.put_object(Body=csv_buffer.getvalue(), Bucket=bucket, Key=key)
obj = s3.get_object(Body=csv_buffer.getvalue(), Bucket=bucket, Key=key)
df2 = pd.read_csv(io.BytesIO(object['Body'].read()))
Related
I am new to Google Cloud Storage.
In my python code, I have couple of Dataframes and I want to store them in a GCS bucket as a single excel file with multiple sheets.
In local directory, I am able to do that with using ExcelWriter. Here is the code for that
writer = pd.ExcelWriter(filename)
dataframe1.to_excel(writer, 'sheet1', index=False)
dataframe2.to_excel(writer, 'sheet2', index=False)
writer.save()
I don't want to save a temp file in local directory and then upload it to GCS.
You can skip the use of gcsfs and directly use the ExcelWriter object with storage client:
import io
import pandas as pd
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.bucket('name_of_your_bucket')
blob = bucket.blob('path/to/excel')
with io.BytesIO() as output:
writer = pd.ExcelWriter(output, engine='xlsxwriter')
dataframe1.to_excel(writer, sheet_name='sheet1', index=False)
dataframe2.to_excel(writer, sheet_name='sheet2', index=False)
writer.save()
output.seek(0)
blob.upload_from_file(output, content_type="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
OLD Answer:
You can instantiate your ExcelWriter() with engine=xlsxwriter and use fs-gcsfs to write the bytes array to excel file on your GCS bucket.
In your case you can do the following:
import io
import pandas as pd
from fs_gcsfs import GCSFS
gcsfs = GCSFS(bucket_name='name_of_your_bucket',
root_path='path/to/excel',
#set a different root path if you wish to upload multiple files in different locations
strict=False)
gcsfs.fix_storage()
output = io.BytesIO()
writer = pd.ExcelWriter(output, engine='xlsxwriter')
dataframe1.to_excel(writer, sheet_name='sheet1', index=False)
dataframe2.to_excel(writer, sheet_name='sheet2', index=False)
writer.save()
xlsx_data = output.getvalue()
with gcsfs.open('./excel_file.xlsx', 'wb') as f:
f.write(xlsx_data)
PS: I had to use strict=False as fs-gcsfs wasn't able to locate the root path (Do check the limitations section in the documentation for fs-gcsfs)
Source: https://xlsxwriter.readthedocs.io/working_with_pandas.html#saving-the-dataframe-output-to-a-string
I have a bucket on S3.
I want to be able to connect to it and read the pictures/PDFs into my EC2 machine memory, perform OCR and get needed fields.
Here is what I have done so far but unfortunately it doesn't work.
import cv2
import boto3
import matplotlib
import pytesseract
from PIL import Image
boto3.setup_default_session(profile_name='default-mfasession')
s3_client = boto3.client('s3')
s3_resource = boto3.resource('s3')
bucket_name = "my_bucket"
key = "my-files/._Screenshot 2020-04-20 at 14.21.20.png"
bucket = s3_resource.Bucket(bucket_name)
object = bucket.Object(key)
response = object.get()
file_stream = response['Body']
im = Image.open(file_stream)
np.array(im)
Returns me an error:
UnidentifiedImageError: cannot identify image file <_io.BytesIO object
at 0x7fae33dce110>
I have tried all the answers related to this issue in SO nothing helped.
Including:
matplotlib: ValueError: invalid PNG header
and
PIL cannot identify image file for io.BytesIO object
Please advise how to solve it?
This is what I usually use. Maybe it will work for you as well:
def image_from_s3(bucket, key):
bucket = s3_resource.Bucket(bucket)
image = bucket.Object(key)
img_data = image.get().get('Body').read()
return Image.open(io.BytesIO(img_data))
And in your handler you execute this:
img = image_from_s3(image_bucket, image_key)
img should be Pillow's image if it successfully executes.
Iam new to the aws-glue. I am trying to read the csv and transforming to the json object. As i seen the approach would be to read the csv via crawler and convert to Pyspark DF, then convert to json object.
Till now, i have converted to json object. Now i would need to write these json back to s3 bucket?
Below is the code
#########################################
### IMPORT LIBRARIES AND SET VARIABLES
#########################################
#Import python modules
from datetime import datetime
#Import pyspark modules
from pyspark.context import SparkContext
import pyspark.sql.functions as f
#Import glue modules
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
import json
import boto3
#Initialize contexts and session
spark_context = SparkContext.getOrCreate()
glue_context = GlueContext(spark_context)
session = glue_context.spark_session
s3_source = boto3.resource('s3')
#Parameters
glue_db = "umesh-db"
glue_tbl = "read"
#########################################
### EXTRACT (READ DATA)
#########################################
#Read movie data to Glue dynamic frame
dynamic_frame_read = glue_context.create_dynamic_frame.from_catalog(database = glue_db, table_name = glue_tbl)
#Convert dynamic frame to data frame to use standard pyspark functions
data_frame = dynamic_frame_read.toDF()
## Show DF data
print("Showing Df data")
data_frame.show()
### Convert the DF to the json
jsonContent = data_frame.toJSON()
jsonValue={}
arrraYObj=[]
for row in jsonContent.collect():
print("Row data ", row)
arrraYObj.append(row)
print("Array Obj",arrraYObj)
jsonValue['Employee']=arrraYObj
print("Json object ", jsonValue)
#Log end time
#dt_end = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
#print("Start time:", dt_end)
Appreciate if anyone can help to provide the right approach?
Thanks
data_frame.write.format(‘json’).save(‘s3://bucket/key’)
Or directly from dynamic frame
glue_context.write_dynamic_frame.from_options(frame = dynamic_frame_read,
connection_type = "s3",
connection_options = {"path": "s3://bucket/key"},
format = "json")
How can I create a pyspark data frame with 2 JSON files?
file1: this file has complete data
file2: this file has only the schema of file1 data.
file1
{"RESIDENCY":"AUS","EFFDT":"01-01-1900","EFF_STATUS":"A","DESCR":"Australian Resident","DESCRSHORT":"Australian"}
file2
[{"fields":[{"metadata":{},"name":"RESIDENCY","nullable":true,"type":"string"},{"metadata":{},"name":"EFFDT","nullable":true,"type":"string"},{"metadata":{},"name":"EFF_STATUS","nullable":true,"type":"string"},{"metadata":{},"name":"DESCR","nullable":true,"type":"string"},{"metadata":{},"name":"DESCRSHORT","nullable":true,"type":"string"}],"type":"struct"}]
You have to read, first, the schema file using Python json.load, then convert it to DataType using StructType.fromJson.
import json
from pyspark.sql.types import StructType
with open("/path/to/file2.json") as f:
json_schema = json.load(f)
schema = StructType.fromJson(json_schema[0])
Now just pass that schema to DataFrame Reader:
df = spark.read.schema(schema).json("/path/to/file1.json")
df.show()
#+---------+----------+----------+-------------------+----------+
#|RESIDENCY| EFFDT|EFF_STATUS| DESCR|DESCRSHORT|
#+---------+----------+----------+-------------------+----------+
#| AUS|01-01-1900| A|Australian Resident|Australian|
#+---------+----------+----------+-------------------+----------+
EDIT:
If the file containing the schema is located in GCS, you can use Spark or Hadoop API to get the file content. Here is an example using Spark:
file_content = spark.read.text("/path/to/file2.json").rdd.map(
lambda r: " ".join([str(elt) for elt in r])
).reduce(
lambda x, y: "\n".join([x, y])
)
json_schema = json.loads(file_content)
I have found GCSFS packages to access files in GCP Buckets:
pip install gcsfs
import gcsfs
fs = gcsfs.GCSFileSystem(project='your GCP project name')
with fs.open('path/toread/sample.json', 'rb') as f:
json_schema=json.load(f)
The code used for inference is:
import json
import pickle
import numpy as np
import boto3
s3 = boto3.resource('s3')
# Function for transferring pickles from s3 to lambda
def download_files_from_s3():
with open('/tmp/km_model_on_space_data.pkl', 'wb') as f:
s3.Bucket("ml-model-first-try").download_fileobj("km_model_on_space_data.pkl", f)
with open('/tmp/tfidf_vectorizer.pkl', 'wb') as f:
s3.Bucket("ml-model-first-try").download_fileobj("tfidf_vectorizer.pkl", f)
with open('/tmp/cluster_distances.pkl', 'wb') as f:
s3.Bucket("ml-model-first-try").download_fileobj("cluster_distances.pkl", f)
with open('/tmp/space_ids_only.pkl', 'wb') as f:
s3.Bucket("ml-model-first-try").download_fileobj("space_ids_only.pkl", f)
# Downloading files from s3 to lambda ------------ Comment the next line if data is already downloaded
download_files_from_s3()
# Loading pickles
km = pickle.load(open('/tmp/km_model_on_space_data.pkl','rb'))
tfidf_vectorizer = pickle.load(open('/tmp/tfidf_vectorizer.pkl','rb'))
cluster_distances = pickle.load(open('/tmp/cluster_distances.pkl','rb'))
space_ids = pickle.load(open('/tmp/space_ids_only.pkl','rb'))
# Automatically called everytime
def lambda_handler(event, context):
tfidf = tfidf_vectorizer.transform([event['text']])
cluster = km.predict(tfidf)
cluster_arr = cluster_distances[:,cluster[0]]
cluter_arr_sorted = cluster_arr.argsort()
recommended_space_ids = []
for i in cluter_arr_sorted[0:10]:
recommended_space_ids.append(space_ids[i])
return {'Cluster_Number' : cluster[0], 'Space_Ids' : recommended_space_ids}
The pickled files are stored in s3 and retrieved into lambda. Then they are used for inference. The zip file which is uploaded to s3 contains all the necessary libraries like numpy and sklearn etc along with a function.py file whose code is given above. It contains the lambda_handler which is used for inference.
The error is:
"Unable to import module 'function': dynamic module does not define module export function (PyInit_multiarray)"