How to write parquet file from pandas dataframe in S3 in python - python-3.x

I have a pandas dataframe. i want to write this dataframe to parquet file in S3.
I need a sample code for the same.I tried to google it. but i could not get a working sample code.

For your reference, I have the following code works.
s3_url = 's3://bucket/folder/bucket.parquet.gzip'
df.to_parquet(s3_url, compression='gzip')
In order to use to_parquet, you need pyarrow or fastparquet to be installed. Also, make sure you have correct information in your config and credentials files, located at .aws folder.
Edit: Additionally, s3fs is needed. see https://stackoverflow.com/a/54006942/1862909

the below function gets parquet output in a buffer and then write buffer.values() to S3 without any need to save parquet locally
Also, since you're creating an s3 client you can create credentials using aws s3 keys that can be either stored locally, in an airflow connection or aws secrets manager
def dataframe_to_s3(s3_client, input_datafame, bucket_name, filepath, format):
if format == 'parquet':
out_buffer = BytesIO()
input_datafame.to_parquet(out_buffer, index=False)
elif format == 'csv':
out_buffer = StringIO()
input_datafame.to_parquet(out_buffer, index=False)
s3_client.put_object(Bucket=bucket_name, Key=filepath, Body=out_buffer.getvalue())
S3_client is nothing but a boto3 client object.Hope this helps!
courtesy- https://stackoverflow.com/a/40615630/12036254

First ensure that you have pyarrow or fastparquet installed with pandas.
Then install boto3 and aws cli. Use aws cli to set up the config and credentials files, located at .aws folder.
Here is a simple script using pyarrow, and boto3 to create a temporary parquet file and then send to AWS S3.
Sample code excluding imports:
def main():
data = {0: {"data1": "value1"}}
df = pd.DataFrame.from_dict(data, orient='index')
write_pandas_parquet_to_s3(
df, "bucket", "folder/test/file.parquet", ".tmp/file.parquet")
def write_pandas_parquet_to_s3(df, bucketName, keyName, fileName):
# dummy dataframe
table = pa.Table.from_pandas(df)
pq.write_table(table, fileName)
# upload to s3
s3 = boto3.client("s3")
BucketName = bucketName
with open(fileName) as f:
object_data = f.read()
s3.put_object(Body=object_data, Bucket=BucketName, Key=keyName)

For python 3.6+, AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet
to install do;
pip install awswrangler
if you want to write your pandas dataframe as a parquet file to S3 do;
import awswrangler as wr
wr.s3.to_parquet(
dataframe=df,
path="s3://my-bucket/key/my-file.parquet"
)

Related

How can I write Python script result to S3 instead of local drive?

#Im working on below script to pull the information of AWS ec2 instances with details.
#What it does now is that it pull the information that I am after and create a csv file in the same patch of the script. I am working on expanding this to make it a Lambda function to pull down the information from all AWS accounts and instead of creating CSV on local machine push the data (CSV) format to S3 location.
#Now challenge ahead that I need help on how to modify the script to be able to write to S3 bucket instead of writing to the local drive?
import boto3
import csv
from pprint import pprint
session = boto3.Session(profile_name='#ProfileName')
ec2_re=session.resource(service_name="ec2",region_name="ap-southeast-2")
ec2_cli=session.client(service_name="ec2")
fo=open('EC2_Details.csv','w',newline='')
data_obj=csv.writer(fo) data_obj.writerow(["InstanceID","InstanceType","InstanceName","InstanceLunchTime","Instance_Private_IP","InstanceState","InstanceTags"])
cnt=1
response = ec2_cli.describe_instances()
for reservation in response["Reservations"]:
for instance in reservation["Instances"]:
data_obj.writerow([instance["InstanceId"], instance["InstanceType"], instance["KeyName"], instance["LaunchTime"], instance["PrivateIpAddress"], instance["State"]["Name"], instance["Tags"]])
cnt+=1
fo.close()
Thank you
Trivial way, you upload the file you created on disk.
There are advanced way that avoid writing on disk, you can find them on SO.
import boto3
import csv
from pprint import pprint
#Creating Session With Boto3.
session = boto3.Session(
profile_name="your profile",
)
session = boto3.Session(profile_name='#ProfileName')
ec2_re=session.resource(service_name="ec2",region_name="ap-southeast-2")
ec2_cli=session.client(service_name="ec2")
fo=open('EC2_Details.csv','w',newline='')
data_obj=csv.writer(fo) data_obj.writerow(["InstanceID","InstanceType","InstanceName","InstanceLunchTime","Instance_Private_IP","InstanceState","InstanceTags"])
cnt=1
response = ec2_cli.describe_instances()
for reservation in response["Reservations"]:
for instance in reservation["Instances"]:
data_obj.writerow([instance["InstanceId"], instance["InstanceType"], instance["KeyName"], instance["LaunchTime"], instance["PrivateIpAddress"], instance["State"]["Name"], instance["Tags"]])
cnt+=1
fo.close()
#Creating S3 Resource From the Session.
s3 = session.resource('s3')
bucket = s3.Bucket("your-bucket-name")
bucket.upload_file('EC2_Details.csv')

Reading zip files from Amazon S3 using pre-signed url without knowing object key and bucket name

I have a password protected zip file stored in Amazon S3 which I need to read from a python program, extract the csv file from it and read to a dataframe. Initially, I was doing it using the object key and bucket name.
import zipfile
import boto3
import io
import pandas as pd
s3 = boto3.client('s3', aws_access_key_id="<acces_key>",
aws_secret_access_key="<secret_key>", region_name="<region>")
s3_resource = boto3.resource('s3', aws_access_key_id="<acces_key>",
aws_secret_access_key="<secret_key>", region_name="<region>")
obj = s3.get_object(Bucket="<bucket_name>", Key="<obj_key>")
with io.BytesIO(obj["Body"].read()) as tf:
# rewind the file
tf.seek(0)
with zipfile.ZipFile(tf, mode='r') as zipf:
df = pd.read_csv(zipf.open('<file_name.csv>', pwd=b'<password>'), sep='|')
print(df)
But due to some security concerns, I won't be able to do this anymore. That is, I won't be having object key and bucket name. And since I wont be having key, I will not have the
file_name.csv either. All I will have is a pre-signed URL. Is it possible to read the zip files using pre-signed URLs? How do I do that?
pre-signed URL contains all the information you require to download a file. But for that you don't need to use boto3. Instead you should use regular python tools to download files (or here )from the internet where url will be your pre-signed url.

how to create file in aws S3 using python boto3

I want to create an empty file in AWS s3 using python.
I'm using boto3 and python.
I want to know apart from the put method is there any way to create files in s3?
Assuming that you genuinely want a zero-byte file, you can do it as follows:
import boto3
s3 = boto3.client('s3')
s3.put_object(
Bucket='mybucket',
Key='myemptyfile'
)
Note the lack of a Body parameter, resulting in an empty file.
You can use upload_file() method :
s3_resource.Bucket(bucket_name).upload_file(Filename = "file_name" , Key = "key")

Download from AWS S3 bucket using boto3 - incorrect timestamp format

I'm using the boto3 library to retrieve a couple of csvs from an S3 bucket:
# Scan s3 verified folder for files
s3 = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key)
response = s3.list_objects(Bucket=self.bucket, Prefix='UK_entities/Verified_Matches/')
# Ignore first file entry in dict as is just the folder name. Returns a list of files
files = response['Contents'][1:]
# For any files in /s3/verified/ - download them to local /verified_matches/
for i in range(len(files)):
s3.download_file(self.bucket, files[i]['Key'], filepath , os.path.basename(files[i]['Key'])))
The file that gets downloaded has a column match_date which is just a timestamp, and has a value for example
03:44.7
which isn't correct. When I manually download the csv from the bucket, the same value is shown correctly as
2019-08-24 01:03:44.732999
Can anyone highlight what is happening here and point me in the direction of how I might specify how to handle the retrieval of timestamps?
I solved this by specifying the exact format I required prior to uploading to the S3 bucket. Despite being able to download the file from S3 manually with the format being correct, the boto3 library somewhere along the way determines the format itself.
from dateutil.tz import gettz
import datetime as dt
# clust_df['match_date'] = pd.to_datetime('today') --> old version
df['match_date'] = dt.datetime.now(gettz()).isoformat()

Specify AWS profile name to use when uploading Pandas dataframe to S3

I would like to upload a Panda's data-frame directly to S3 by specifying s3 url. I have a multi-profile AWS environment, and I would like to specify the name of the profile to use for this upload.
Since it is not possible to specify region in the s3 url, I would like to know if there is any other way I could specify the (non-default) region in the code.
I could not file any such option in the s3fs library, which is used internally by boto3 for uploading to s3.
Note that I do not want to use environment variables, or modify the default configuration in the AWS credentials files.
import pandas as pd
data = [1, 2, 3]
df = pd.DataFrame()
# I would like to specify non-default profile to use here
s3_url = 's3://my_bucket/path/to/file.parquet'
df.to_parquet(s3_url)
Use a session
session = boto3.Session(profile_name='dev')
s3_client = session.client('s3')
Save the DataFrame to a parquet file
df.to_parquet( parquet_pandas_file )
Upload the file to S3
with open( parquet_pandas_file, 'rb' ) as s3_source_data:
s3.upload_fileobj(s3_source_data, 'bucket_name', 'bucket_key_name' )
Use the below code to set the profile name when using s3fs command
fs = s3fs.S3FileSystem(profile_name='<profile name>')
with fs.open('s3://bucketname/root1/file.csv', 'w') as f:
df.to_csv(f)

Resources