how to create file in aws S3 using python boto3 - python-3.x

I want to create an empty file in AWS s3 using python.
I'm using boto3 and python.
I want to know apart from the put method is there any way to create files in s3?

Assuming that you genuinely want a zero-byte file, you can do it as follows:
import boto3
s3 = boto3.client('s3')
s3.put_object(
Bucket='mybucket',
Key='myemptyfile'
)
Note the lack of a Body parameter, resulting in an empty file.

You can use upload_file() method :
s3_resource.Bucket(bucket_name).upload_file(Filename = "file_name" , Key = "key")

Related

Reading zip files from Amazon S3 using pre-signed url without knowing object key and bucket name

I have a password protected zip file stored in Amazon S3 which I need to read from a python program, extract the csv file from it and read to a dataframe. Initially, I was doing it using the object key and bucket name.
import zipfile
import boto3
import io
import pandas as pd
s3 = boto3.client('s3', aws_access_key_id="<acces_key>",
aws_secret_access_key="<secret_key>", region_name="<region>")
s3_resource = boto3.resource('s3', aws_access_key_id="<acces_key>",
aws_secret_access_key="<secret_key>", region_name="<region>")
obj = s3.get_object(Bucket="<bucket_name>", Key="<obj_key>")
with io.BytesIO(obj["Body"].read()) as tf:
# rewind the file
tf.seek(0)
with zipfile.ZipFile(tf, mode='r') as zipf:
df = pd.read_csv(zipf.open('<file_name.csv>', pwd=b'<password>'), sep='|')
print(df)
But due to some security concerns, I won't be able to do this anymore. That is, I won't be having object key and bucket name. And since I wont be having key, I will not have the
file_name.csv either. All I will have is a pre-signed URL. Is it possible to read the zip files using pre-signed URLs? How do I do that?
pre-signed URL contains all the information you require to download a file. But for that you don't need to use boto3. Instead you should use regular python tools to download files (or here )from the internet where url will be your pre-signed url.

Download from AWS S3 bucket using boto3 - incorrect timestamp format

I'm using the boto3 library to retrieve a couple of csvs from an S3 bucket:
# Scan s3 verified folder for files
s3 = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key)
response = s3.list_objects(Bucket=self.bucket, Prefix='UK_entities/Verified_Matches/')
# Ignore first file entry in dict as is just the folder name. Returns a list of files
files = response['Contents'][1:]
# For any files in /s3/verified/ - download them to local /verified_matches/
for i in range(len(files)):
s3.download_file(self.bucket, files[i]['Key'], filepath , os.path.basename(files[i]['Key'])))
The file that gets downloaded has a column match_date which is just a timestamp, and has a value for example
03:44.7
which isn't correct. When I manually download the csv from the bucket, the same value is shown correctly as
2019-08-24 01:03:44.732999
Can anyone highlight what is happening here and point me in the direction of how I might specify how to handle the retrieval of timestamps?
I solved this by specifying the exact format I required prior to uploading to the S3 bucket. Despite being able to download the file from S3 manually with the format being correct, the boto3 library somewhere along the way determines the format itself.
from dateutil.tz import gettz
import datetime as dt
# clust_df['match_date'] = pd.to_datetime('today') --> old version
df['match_date'] = dt.datetime.now(gettz()).isoformat()

How to write parquet file from pandas dataframe in S3 in python

I have a pandas dataframe. i want to write this dataframe to parquet file in S3.
I need a sample code for the same.I tried to google it. but i could not get a working sample code.
For your reference, I have the following code works.
s3_url = 's3://bucket/folder/bucket.parquet.gzip'
df.to_parquet(s3_url, compression='gzip')
In order to use to_parquet, you need pyarrow or fastparquet to be installed. Also, make sure you have correct information in your config and credentials files, located at .aws folder.
Edit: Additionally, s3fs is needed. see https://stackoverflow.com/a/54006942/1862909
the below function gets parquet output in a buffer and then write buffer.values() to S3 without any need to save parquet locally
Also, since you're creating an s3 client you can create credentials using aws s3 keys that can be either stored locally, in an airflow connection or aws secrets manager
def dataframe_to_s3(s3_client, input_datafame, bucket_name, filepath, format):
if format == 'parquet':
out_buffer = BytesIO()
input_datafame.to_parquet(out_buffer, index=False)
elif format == 'csv':
out_buffer = StringIO()
input_datafame.to_parquet(out_buffer, index=False)
s3_client.put_object(Bucket=bucket_name, Key=filepath, Body=out_buffer.getvalue())
S3_client is nothing but a boto3 client object.Hope this helps!
courtesy- https://stackoverflow.com/a/40615630/12036254
First ensure that you have pyarrow or fastparquet installed with pandas.
Then install boto3 and aws cli. Use aws cli to set up the config and credentials files, located at .aws folder.
Here is a simple script using pyarrow, and boto3 to create a temporary parquet file and then send to AWS S3.
Sample code excluding imports:
def main():
data = {0: {"data1": "value1"}}
df = pd.DataFrame.from_dict(data, orient='index')
write_pandas_parquet_to_s3(
df, "bucket", "folder/test/file.parquet", ".tmp/file.parquet")
def write_pandas_parquet_to_s3(df, bucketName, keyName, fileName):
# dummy dataframe
table = pa.Table.from_pandas(df)
pq.write_table(table, fileName)
# upload to s3
s3 = boto3.client("s3")
BucketName = bucketName
with open(fileName) as f:
object_data = f.read()
s3.put_object(Body=object_data, Bucket=BucketName, Key=keyName)
For python 3.6+, AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet
to install do;
pip install awswrangler
if you want to write your pandas dataframe as a parquet file to S3 do;
import awswrangler as wr
wr.s3.to_parquet(
dataframe=df,
path="s3://my-bucket/key/my-file.parquet"
)

Using boto to delete all buckets

I've been tasked with creating a script to delete all the current S3 buckets and create some new ones. This is something that they want to do on an ongoing basis. So far I have all the preliminaries:
import boto
from boto.s3.key import Key
import boto.s3.connection
from __future__ import print_function
conn = boto.s3.connect_to_region('us-east-1',
aws_access_key_id='my_access_key', aws_secret_access_key='my_secret_key')
ls = conn.get_all_buckets()
print(*ls,sep='\n')
This gives me a list of all the current buckets. Now if I want to remove the buckets my understanding is that they have to be emptied first, using a method something like:
db = conn.get_bucket('bucket_name')
for key in db.list():
key.delete()
And then I could do:
conn.delete_bucket('bucket_name')
I want to set it up such that it pulls each bucket name from 'ls', but I'm not sure how to go about this. I tried this:
for i in ls:
db = conn.get_bucket('i')
for key in db.list():
key.delete()
But I get an error "S3ResponseError: 400 Bad Request". I'm getting a sneaking suspicion that it's not pulling the separate elements from the list. Do I maybe have to get data frames involved? As far as I know, boto doesn't have an option to just nuke all the folders outright.
I'd recommend using boto3
The following should do the trick, though it's untested (I don't want to delete all my buckets :))
import boto3
client = session.client('s3')
s3 = boto3.resource('s3')
buckets = client.list_buckets()
for bucket in buckets['Buckets']:
s3_bucket = s3.Bucket(bucket['Name'])
s3_bucket.objects.all().delete()
s3_bucket.delete()

Specify AWS profile name to use when uploading Pandas dataframe to S3

I would like to upload a Panda's data-frame directly to S3 by specifying s3 url. I have a multi-profile AWS environment, and I would like to specify the name of the profile to use for this upload.
Since it is not possible to specify region in the s3 url, I would like to know if there is any other way I could specify the (non-default) region in the code.
I could not file any such option in the s3fs library, which is used internally by boto3 for uploading to s3.
Note that I do not want to use environment variables, or modify the default configuration in the AWS credentials files.
import pandas as pd
data = [1, 2, 3]
df = pd.DataFrame()
# I would like to specify non-default profile to use here
s3_url = 's3://my_bucket/path/to/file.parquet'
df.to_parquet(s3_url)
Use a session
session = boto3.Session(profile_name='dev')
s3_client = session.client('s3')
Save the DataFrame to a parquet file
df.to_parquet( parquet_pandas_file )
Upload the file to S3
with open( parquet_pandas_file, 'rb' ) as s3_source_data:
s3.upload_fileobj(s3_source_data, 'bucket_name', 'bucket_key_name' )
Use the below code to set the profile name when using s3fs command
fs = s3fs.S3FileSystem(profile_name='<profile name>')
with fs.open('s3://bucketname/root1/file.csv', 'w') as f:
df.to_csv(f)

Resources