Read excel sheets from gcp bucket - gcp-ai-platform-notebook

I am currently trying to read in data to my gcp notebook from a shared gcp storage bucket. I am an admin and so restrictions shouldn't apply as far as I know, but I am getting an error before I can even read in with pandas. Is this possible? Or am I going about this in the wrong way?
This is the code I have tried:
from google.cloud import storage
from io import BytesIO
import pandas as pd
client = storage.Client()
bucket = "our_data/deid"
blob = storage.blob.Blob("B_ACTIVITY.xlsx",bucket)
content = blob.download_as_string()
df = pd.read_excel(BytesIO(content))
I was hoping for the data to simply be brought in once the bucket was specified, but I get an error "'str' object has no attribute 'path'".

bucket needs to be a bucket object not just a string.
Try changing that line to
bucket = client.bucket(<BUCKET_URL>)
Here's a link to the constructor:
https://googleapis.dev/python/storage/latest/client.html#google.cloud.storage.client.Client.bucket

Related

How can I write Python script result to S3 instead of local drive?

#Im working on below script to pull the information of AWS ec2 instances with details.
#What it does now is that it pull the information that I am after and create a csv file in the same patch of the script. I am working on expanding this to make it a Lambda function to pull down the information from all AWS accounts and instead of creating CSV on local machine push the data (CSV) format to S3 location.
#Now challenge ahead that I need help on how to modify the script to be able to write to S3 bucket instead of writing to the local drive?
import boto3
import csv
from pprint import pprint
session = boto3.Session(profile_name='#ProfileName')
ec2_re=session.resource(service_name="ec2",region_name="ap-southeast-2")
ec2_cli=session.client(service_name="ec2")
fo=open('EC2_Details.csv','w',newline='')
data_obj=csv.writer(fo) data_obj.writerow(["InstanceID","InstanceType","InstanceName","InstanceLunchTime","Instance_Private_IP","InstanceState","InstanceTags"])
cnt=1
response = ec2_cli.describe_instances()
for reservation in response["Reservations"]:
for instance in reservation["Instances"]:
data_obj.writerow([instance["InstanceId"], instance["InstanceType"], instance["KeyName"], instance["LaunchTime"], instance["PrivateIpAddress"], instance["State"]["Name"], instance["Tags"]])
cnt+=1
fo.close()
Thank you
Trivial way, you upload the file you created on disk.
There are advanced way that avoid writing on disk, you can find them on SO.
import boto3
import csv
from pprint import pprint
#Creating Session With Boto3.
session = boto3.Session(
profile_name="your profile",
)
session = boto3.Session(profile_name='#ProfileName')
ec2_re=session.resource(service_name="ec2",region_name="ap-southeast-2")
ec2_cli=session.client(service_name="ec2")
fo=open('EC2_Details.csv','w',newline='')
data_obj=csv.writer(fo) data_obj.writerow(["InstanceID","InstanceType","InstanceName","InstanceLunchTime","Instance_Private_IP","InstanceState","InstanceTags"])
cnt=1
response = ec2_cli.describe_instances()
for reservation in response["Reservations"]:
for instance in reservation["Instances"]:
data_obj.writerow([instance["InstanceId"], instance["InstanceType"], instance["KeyName"], instance["LaunchTime"], instance["PrivateIpAddress"], instance["State"]["Name"], instance["Tags"]])
cnt+=1
fo.close()
#Creating S3 Resource From the Session.
s3 = session.resource('s3')
bucket = s3.Bucket("your-bucket-name")
bucket.upload_file('EC2_Details.csv')

Reading zip files from Amazon S3 using pre-signed url without knowing object key and bucket name

I have a password protected zip file stored in Amazon S3 which I need to read from a python program, extract the csv file from it and read to a dataframe. Initially, I was doing it using the object key and bucket name.
import zipfile
import boto3
import io
import pandas as pd
s3 = boto3.client('s3', aws_access_key_id="<acces_key>",
aws_secret_access_key="<secret_key>", region_name="<region>")
s3_resource = boto3.resource('s3', aws_access_key_id="<acces_key>",
aws_secret_access_key="<secret_key>", region_name="<region>")
obj = s3.get_object(Bucket="<bucket_name>", Key="<obj_key>")
with io.BytesIO(obj["Body"].read()) as tf:
# rewind the file
tf.seek(0)
with zipfile.ZipFile(tf, mode='r') as zipf:
df = pd.read_csv(zipf.open('<file_name.csv>', pwd=b'<password>'), sep='|')
print(df)
But due to some security concerns, I won't be able to do this anymore. That is, I won't be having object key and bucket name. And since I wont be having key, I will not have the
file_name.csv either. All I will have is a pre-signed URL. Is it possible to read the zip files using pre-signed URLs? How do I do that?
pre-signed URL contains all the information you require to download a file. But for that you don't need to use boto3. Instead you should use regular python tools to download files (or here )from the internet where url will be your pre-signed url.

Error in joblib.load when reading file from s3

When trying to read a file from s3 with joblib.load() I get the error ValueError: embedded null byte when attempting to read files.
The files were created by joblib and can be successfully loaded from local copies (that were made locally before uploading to s3), so the error is presumably in storage and retrieval protocols from S3.
Min code:
####Imports (AWS credentials assumed)
import boto3
from sklearn.externals import joblib
s3 = boto3.resource('s3')
bucket_str = "my-aws-bucket"
bucket_key = "some-pseudo/folder-set/my-filename.joblib"
joblib.loads(s3.Bucket(bucket_str).Object(bucket_key).get()['Body'].read())
The following code reconstructs a local copy of the file in memory before feeding into joblib.load(), enabling a successful load.
from io import BytesIO
import boto3
from sklearn.externals import joblib
s3 = boto3.resource('s3')
bucket_str = "my-aws-bucket"
bucket_key = "some-pseudo/folder-set/my-filename.joblib"
with BytesIO() as data:
s3.Bucket(bucket_str).download_fileobj(bucket_key, data)
data.seek(0) # move back to the beginning after writing
df = joblib.load(data)
I assume, but am not certain, that something in how boto3 chunks files for download creates a null byte that breaks joblib, and BytesIO fixes this before letting joblib.load() see the datastream.
PS. In this method the file never touches the local disk, which is helpful under some circumstances (eg. node with big RAM but tiny disk space...)
You can do it like this using the s3fs package.
import s3fs
fs = s3fs.FileSystem()
with fs.open('s3://my-aws-bucket/some-pseudo/folder-set/my-filename.joblib', encoding='utf8')
df = joblib.load(f)
I guess everybody has their own preference but I really like s3fs because it makes the code look very familiar to people who haven't worked with s3 before.

Importing multiple files from Google Cloud Bucket to Datalab instance

I have a bucket set up on Google Cloud containing a few hundred json files and am trying to work with them in a datalab instance running python 3.
So, I can easily see them as objects using
gcs list --objects gs://<BUCKET_NAME>
Further, I can read in an individual file/object using
import google.datalab.storage as storage
import pandas as pd
from io import BytesIO
myBucket = storage.Bucket('<BUCKET_NAME')
data_csv = myBucket.object('<FILE_NAME.json')
uri = data_csv.uri
%gcs read --object $uri --variable data
df = pd.read_csv(BytesIO(data))
df.head()
(FYI, I understand that my example is reading a json as a csv, but let's ignore that- I'll cross that bridge on my own)
What I can't figure out is how to loop through the bucket and pull all of the json files into pandas...how do I do that? Is that the way I should be thinking of this- is there a way to call the files in the bucket from pandas directly (since they're already treated as objects)?
As an extra bit- what if a file is saved as a json, but isn't actually that structure? How can I handle that?
Essentially, I guess, I'm looking for the functionality of the blob package, but using cloud buckets + datalab.
Any help is greatly appreciated.
This can be done using Bucket.objects which returns an iterator with all matching files. Specify a prefix or leave it empty to match all files in the bucket. I did an example with two files countries1.csv and countries2.csv:
$ cat countries1.csv
id,country
1,sweden
2,spain
$ cat countries2.csv
id,country
3,italy
4,france
And used the following Datalab snippet:
import google.datalab.storage as storage
import pandas as pd
from io import BytesIO
myBucket = storage.Bucket('BUCKET_NAME')
object_list = myBucket.objects(prefix='countries')
df_list = []
for object in object_list:
%gcs read --object $object.uri --variable data
df_list.append(pd.read_csv(BytesIO(data)))
concatenated_df = pd.concat(df_list, ignore_index=True)
concatenated_df.head()
which will output the combined csv:
id country
0 1 sweden
1 2 spain
2 3 italy
3 4 france
Take into account that I combined all csv files into a single Pandas dataframe using this approach but you might want to load them into different ones depending on the use case. If you want to retrieve all files in the bucket just use this instead:
object_list = myBucket.objects()

Using boto to delete all buckets

I've been tasked with creating a script to delete all the current S3 buckets and create some new ones. This is something that they want to do on an ongoing basis. So far I have all the preliminaries:
import boto
from boto.s3.key import Key
import boto.s3.connection
from __future__ import print_function
conn = boto.s3.connect_to_region('us-east-1',
aws_access_key_id='my_access_key', aws_secret_access_key='my_secret_key')
ls = conn.get_all_buckets()
print(*ls,sep='\n')
This gives me a list of all the current buckets. Now if I want to remove the buckets my understanding is that they have to be emptied first, using a method something like:
db = conn.get_bucket('bucket_name')
for key in db.list():
key.delete()
And then I could do:
conn.delete_bucket('bucket_name')
I want to set it up such that it pulls each bucket name from 'ls', but I'm not sure how to go about this. I tried this:
for i in ls:
db = conn.get_bucket('i')
for key in db.list():
key.delete()
But I get an error "S3ResponseError: 400 Bad Request". I'm getting a sneaking suspicion that it's not pulling the separate elements from the list. Do I maybe have to get data frames involved? As far as I know, boto doesn't have an option to just nuke all the folders outright.
I'd recommend using boto3
The following should do the trick, though it's untested (I don't want to delete all my buckets :))
import boto3
client = session.client('s3')
s3 = boto3.resource('s3')
buckets = client.list_buckets()
for bucket in buckets['Buckets']:
s3_bucket = s3.Bucket(bucket['Name'])
s3_bucket.objects.all().delete()
s3_bucket.delete()

Resources