Error in joblib.load when reading file from s3 - python-3.x

When trying to read a file from s3 with joblib.load() I get the error ValueError: embedded null byte when attempting to read files.
The files were created by joblib and can be successfully loaded from local copies (that were made locally before uploading to s3), so the error is presumably in storage and retrieval protocols from S3.
Min code:
####Imports (AWS credentials assumed)
import boto3
from sklearn.externals import joblib
s3 = boto3.resource('s3')
bucket_str = "my-aws-bucket"
bucket_key = "some-pseudo/folder-set/my-filename.joblib"
joblib.loads(s3.Bucket(bucket_str).Object(bucket_key).get()['Body'].read())

The following code reconstructs a local copy of the file in memory before feeding into joblib.load(), enabling a successful load.
from io import BytesIO
import boto3
from sklearn.externals import joblib
s3 = boto3.resource('s3')
bucket_str = "my-aws-bucket"
bucket_key = "some-pseudo/folder-set/my-filename.joblib"
with BytesIO() as data:
s3.Bucket(bucket_str).download_fileobj(bucket_key, data)
data.seek(0) # move back to the beginning after writing
df = joblib.load(data)
I assume, but am not certain, that something in how boto3 chunks files for download creates a null byte that breaks joblib, and BytesIO fixes this before letting joblib.load() see the datastream.
PS. In this method the file never touches the local disk, which is helpful under some circumstances (eg. node with big RAM but tiny disk space...)

You can do it like this using the s3fs package.
import s3fs
fs = s3fs.FileSystem()
with fs.open('s3://my-aws-bucket/some-pseudo/folder-set/my-filename.joblib', encoding='utf8')
df = joblib.load(f)
I guess everybody has their own preference but I really like s3fs because it makes the code look very familiar to people who haven't worked with s3 before.

Related

How can I write Python script result to S3 instead of local drive?

#Im working on below script to pull the information of AWS ec2 instances with details.
#What it does now is that it pull the information that I am after and create a csv file in the same patch of the script. I am working on expanding this to make it a Lambda function to pull down the information from all AWS accounts and instead of creating CSV on local machine push the data (CSV) format to S3 location.
#Now challenge ahead that I need help on how to modify the script to be able to write to S3 bucket instead of writing to the local drive?
import boto3
import csv
from pprint import pprint
session = boto3.Session(profile_name='#ProfileName')
ec2_re=session.resource(service_name="ec2",region_name="ap-southeast-2")
ec2_cli=session.client(service_name="ec2")
fo=open('EC2_Details.csv','w',newline='')
data_obj=csv.writer(fo) data_obj.writerow(["InstanceID","InstanceType","InstanceName","InstanceLunchTime","Instance_Private_IP","InstanceState","InstanceTags"])
cnt=1
response = ec2_cli.describe_instances()
for reservation in response["Reservations"]:
for instance in reservation["Instances"]:
data_obj.writerow([instance["InstanceId"], instance["InstanceType"], instance["KeyName"], instance["LaunchTime"], instance["PrivateIpAddress"], instance["State"]["Name"], instance["Tags"]])
cnt+=1
fo.close()
Thank you
Trivial way, you upload the file you created on disk.
There are advanced way that avoid writing on disk, you can find them on SO.
import boto3
import csv
from pprint import pprint
#Creating Session With Boto3.
session = boto3.Session(
profile_name="your profile",
)
session = boto3.Session(profile_name='#ProfileName')
ec2_re=session.resource(service_name="ec2",region_name="ap-southeast-2")
ec2_cli=session.client(service_name="ec2")
fo=open('EC2_Details.csv','w',newline='')
data_obj=csv.writer(fo) data_obj.writerow(["InstanceID","InstanceType","InstanceName","InstanceLunchTime","Instance_Private_IP","InstanceState","InstanceTags"])
cnt=1
response = ec2_cli.describe_instances()
for reservation in response["Reservations"]:
for instance in reservation["Instances"]:
data_obj.writerow([instance["InstanceId"], instance["InstanceType"], instance["KeyName"], instance["LaunchTime"], instance["PrivateIpAddress"], instance["State"]["Name"], instance["Tags"]])
cnt+=1
fo.close()
#Creating S3 Resource From the Session.
s3 = session.resource('s3')
bucket = s3.Bucket("your-bucket-name")
bucket.upload_file('EC2_Details.csv')

Reading zip files from Amazon S3 using pre-signed url without knowing object key and bucket name

I have a password protected zip file stored in Amazon S3 which I need to read from a python program, extract the csv file from it and read to a dataframe. Initially, I was doing it using the object key and bucket name.
import zipfile
import boto3
import io
import pandas as pd
s3 = boto3.client('s3', aws_access_key_id="<acces_key>",
aws_secret_access_key="<secret_key>", region_name="<region>")
s3_resource = boto3.resource('s3', aws_access_key_id="<acces_key>",
aws_secret_access_key="<secret_key>", region_name="<region>")
obj = s3.get_object(Bucket="<bucket_name>", Key="<obj_key>")
with io.BytesIO(obj["Body"].read()) as tf:
# rewind the file
tf.seek(0)
with zipfile.ZipFile(tf, mode='r') as zipf:
df = pd.read_csv(zipf.open('<file_name.csv>', pwd=b'<password>'), sep='|')
print(df)
But due to some security concerns, I won't be able to do this anymore. That is, I won't be having object key and bucket name. And since I wont be having key, I will not have the
file_name.csv either. All I will have is a pre-signed URL. Is it possible to read the zip files using pre-signed URLs? How do I do that?
pre-signed URL contains all the information you require to download a file. But for that you don't need to use boto3. Instead you should use regular python tools to download files (or here )from the internet where url will be your pre-signed url.

Read excel sheets from gcp bucket

I am currently trying to read in data to my gcp notebook from a shared gcp storage bucket. I am an admin and so restrictions shouldn't apply as far as I know, but I am getting an error before I can even read in with pandas. Is this possible? Or am I going about this in the wrong way?
This is the code I have tried:
from google.cloud import storage
from io import BytesIO
import pandas as pd
client = storage.Client()
bucket = "our_data/deid"
blob = storage.blob.Blob("B_ACTIVITY.xlsx",bucket)
content = blob.download_as_string()
df = pd.read_excel(BytesIO(content))
I was hoping for the data to simply be brought in once the bucket was specified, but I get an error "'str' object has no attribute 'path'".
bucket needs to be a bucket object not just a string.
Try changing that line to
bucket = client.bucket(<BUCKET_URL>)
Here's a link to the constructor:
https://googleapis.dev/python/storage/latest/client.html#google.cloud.storage.client.Client.bucket

AudioSegment and BytesIO module gives "FileNotFoundError"

I am trying to fetch .wav file from Amazon S3 and modify it using AudioSegment library. For fetching .wav file from S3, I have used boto3 and IO module. For Audio operations, I am using AudioSegment module.
When I fetch file from S3 using BytesIO and pass it to AudioSegment, I am getting "System can not find the file specified" error. Below is my code
import boto3
from pydub import AudioSegment
import io
client = boto3.client('s3')
obj = client.get_object(Bucket='<BucketName>', Key='<FileName>')
data = io.BytesIO(obj['Body'].read())
sound1 = AudioSegment.from_file(data)
I am getting error at AudioSegment.from_file(data)
System can not find the file specified
Try specifying the format argument for AudioSegment. For example:
sound1 = AudioSegment.from_file(data, format='mp3')

Using boto to delete all buckets

I've been tasked with creating a script to delete all the current S3 buckets and create some new ones. This is something that they want to do on an ongoing basis. So far I have all the preliminaries:
import boto
from boto.s3.key import Key
import boto.s3.connection
from __future__ import print_function
conn = boto.s3.connect_to_region('us-east-1',
aws_access_key_id='my_access_key', aws_secret_access_key='my_secret_key')
ls = conn.get_all_buckets()
print(*ls,sep='\n')
This gives me a list of all the current buckets. Now if I want to remove the buckets my understanding is that they have to be emptied first, using a method something like:
db = conn.get_bucket('bucket_name')
for key in db.list():
key.delete()
And then I could do:
conn.delete_bucket('bucket_name')
I want to set it up such that it pulls each bucket name from 'ls', but I'm not sure how to go about this. I tried this:
for i in ls:
db = conn.get_bucket('i')
for key in db.list():
key.delete()
But I get an error "S3ResponseError: 400 Bad Request". I'm getting a sneaking suspicion that it's not pulling the separate elements from the list. Do I maybe have to get data frames involved? As far as I know, boto doesn't have an option to just nuke all the folders outright.
I'd recommend using boto3
The following should do the trick, though it's untested (I don't want to delete all my buckets :))
import boto3
client = session.client('s3')
s3 = boto3.resource('s3')
buckets = client.list_buckets()
for bucket in buckets['Buckets']:
s3_bucket = s3.Bucket(bucket['Name'])
s3_bucket.objects.all().delete()
s3_bucket.delete()

Resources