Download from AWS S3 bucket using boto3 - incorrect timestamp format - python-3.x

I'm using the boto3 library to retrieve a couple of csvs from an S3 bucket:
# Scan s3 verified folder for files
s3 = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key)
response = s3.list_objects(Bucket=self.bucket, Prefix='UK_entities/Verified_Matches/')
# Ignore first file entry in dict as is just the folder name. Returns a list of files
files = response['Contents'][1:]
# For any files in /s3/verified/ - download them to local /verified_matches/
for i in range(len(files)):
s3.download_file(self.bucket, files[i]['Key'], filepath , os.path.basename(files[i]['Key'])))
The file that gets downloaded has a column match_date which is just a timestamp, and has a value for example
03:44.7
which isn't correct. When I manually download the csv from the bucket, the same value is shown correctly as
2019-08-24 01:03:44.732999
Can anyone highlight what is happening here and point me in the direction of how I might specify how to handle the retrieval of timestamps?

I solved this by specifying the exact format I required prior to uploading to the S3 bucket. Despite being able to download the file from S3 manually with the format being correct, the boto3 library somewhere along the way determines the format itself.
from dateutil.tz import gettz
import datetime as dt
# clust_df['match_date'] = pd.to_datetime('today') --> old version
df['match_date'] = dt.datetime.now(gettz()).isoformat()

Related

Python Get MIME of s3 object on Lambda

I have a lambda that triggers upon s3 PutObject. Before proceeding the lambda needs to check if the file is actually a video file or not (mp4 in my case). File extension is not helpful because that can be fake. So I have tried checking MIME using FileType which works in local machine.
I don't want to download large files from s3, just some portion and save in local machine to check if that's mp4 or not.
So far I tried this (on local machine) -
import boto3
import filetype
from time import sleep
REGION = 'ap-southeast-1'
tmp_path = "path/src/my_file.mp4"
start_byte = 0
end_byte = 9000
s3 = boto3.client('s3', region_name=REGION)
resp = s3.get_object(
Bucket="test",
Key="MVI_1494.MP4",
Range='bytes={}-{}'.format(start_byte, end_byte)
)
# the file
object_content = resp['Body'].read()
print(type(object_content))
with open(tmp_path, "wb") as binary_file:
# Write bytes to file
binary_file.write(object_content)
sleep(5)
kind = filetype.guess_mime(tmp_path)
print(kind)
But this always return None as mimetype. I think I am not saving the binary file properly, any help would really save my day.
TLDR: Download small portion of large file from s3 -> save in tmp storage -> get mime.
Boto3 has a function S3.Client.head_object:
The HEAD action retrieves metadata from an object without returning
the object itself. This action is useful if you're only interested in
an object's metadata. To use HEAD, you must have READ access to the
object.
You can call this method to get metadata object associated with S3 bucket item.
metadata = s3client.head_object(Bucket='MyBucketName', Key='MyS3ItemKey')
This metadata includes a ContentType property, you can use this property to check the object type.
OR
If you can't trust this ContentType as this can be faked. You can simply save the object's MIME type in DynamoDB while uploading it. You can read the type from there whenever you want.
OR
You can simply create a Lambda that will get triggered, you can download the object in the Lambda as it has around 512MB as ephemeral storage. You can determine the content type there and update it, as you can also set some metadata when you upload the object and later edit it as your needs change.
You dont need to save file on disk for filetype lib.
guess_mime function accept bytes datatype as well.

Reading zip files from Amazon S3 using pre-signed url without knowing object key and bucket name

I have a password protected zip file stored in Amazon S3 which I need to read from a python program, extract the csv file from it and read to a dataframe. Initially, I was doing it using the object key and bucket name.
import zipfile
import boto3
import io
import pandas as pd
s3 = boto3.client('s3', aws_access_key_id="<acces_key>",
aws_secret_access_key="<secret_key>", region_name="<region>")
s3_resource = boto3.resource('s3', aws_access_key_id="<acces_key>",
aws_secret_access_key="<secret_key>", region_name="<region>")
obj = s3.get_object(Bucket="<bucket_name>", Key="<obj_key>")
with io.BytesIO(obj["Body"].read()) as tf:
# rewind the file
tf.seek(0)
with zipfile.ZipFile(tf, mode='r') as zipf:
df = pd.read_csv(zipf.open('<file_name.csv>', pwd=b'<password>'), sep='|')
print(df)
But due to some security concerns, I won't be able to do this anymore. That is, I won't be having object key and bucket name. And since I wont be having key, I will not have the
file_name.csv either. All I will have is a pre-signed URL. Is it possible to read the zip files using pre-signed URLs? How do I do that?
pre-signed URL contains all the information you require to download a file. But for that you don't need to use boto3. Instead you should use regular python tools to download files (or here )from the internet where url will be your pre-signed url.

Upload images to cloud and then paste the respective link to a respective dataframe

I've PDFs with tables and the image diagram related to the content of tables.
Both, table and image on a single page.
I've extracted the Tables using the Camelot library. And also images using Fitz library. Using Python
Now I want to upload those images(.png) to any possible cloud service and provide the web link of the respective image to the Dataframe of the respective table.
Please help.
This is how a single Page of PDF looks line.
In case of any public cloud, you can use S3 to store images using BOTO3 (python library).
sample code to store images in AWS S3 bucket:
import boto3
s3 = boto3.client('s3')
bucket = 'your-bucket-name'
file_name = 'location-of-your-image'
key_name = 'name-of-image-in-s3'
s3.upload_file(file_name, bucket, key_name)
To obtain the uploaded file url, you can construct it as:
s3_url = f"https://{bucket}.s3.{region}.amazonaws.com/{file_name}"
and store s3_url in dataframe.

how to create file in aws S3 using python boto3

I want to create an empty file in AWS s3 using python.
I'm using boto3 and python.
I want to know apart from the put method is there any way to create files in s3?
Assuming that you genuinely want a zero-byte file, you can do it as follows:
import boto3
s3 = boto3.client('s3')
s3.put_object(
Bucket='mybucket',
Key='myemptyfile'
)
Note the lack of a Body parameter, resulting in an empty file.
You can use upload_file() method :
s3_resource.Bucket(bucket_name).upload_file(Filename = "file_name" , Key = "key")

How to write parquet file from pandas dataframe in S3 in python

I have a pandas dataframe. i want to write this dataframe to parquet file in S3.
I need a sample code for the same.I tried to google it. but i could not get a working sample code.
For your reference, I have the following code works.
s3_url = 's3://bucket/folder/bucket.parquet.gzip'
df.to_parquet(s3_url, compression='gzip')
In order to use to_parquet, you need pyarrow or fastparquet to be installed. Also, make sure you have correct information in your config and credentials files, located at .aws folder.
Edit: Additionally, s3fs is needed. see https://stackoverflow.com/a/54006942/1862909
the below function gets parquet output in a buffer and then write buffer.values() to S3 without any need to save parquet locally
Also, since you're creating an s3 client you can create credentials using aws s3 keys that can be either stored locally, in an airflow connection or aws secrets manager
def dataframe_to_s3(s3_client, input_datafame, bucket_name, filepath, format):
if format == 'parquet':
out_buffer = BytesIO()
input_datafame.to_parquet(out_buffer, index=False)
elif format == 'csv':
out_buffer = StringIO()
input_datafame.to_parquet(out_buffer, index=False)
s3_client.put_object(Bucket=bucket_name, Key=filepath, Body=out_buffer.getvalue())
S3_client is nothing but a boto3 client object.Hope this helps!
courtesy- https://stackoverflow.com/a/40615630/12036254
First ensure that you have pyarrow or fastparquet installed with pandas.
Then install boto3 and aws cli. Use aws cli to set up the config and credentials files, located at .aws folder.
Here is a simple script using pyarrow, and boto3 to create a temporary parquet file and then send to AWS S3.
Sample code excluding imports:
def main():
data = {0: {"data1": "value1"}}
df = pd.DataFrame.from_dict(data, orient='index')
write_pandas_parquet_to_s3(
df, "bucket", "folder/test/file.parquet", ".tmp/file.parquet")
def write_pandas_parquet_to_s3(df, bucketName, keyName, fileName):
# dummy dataframe
table = pa.Table.from_pandas(df)
pq.write_table(table, fileName)
# upload to s3
s3 = boto3.client("s3")
BucketName = bucketName
with open(fileName) as f:
object_data = f.read()
s3.put_object(Body=object_data, Bucket=BucketName, Key=keyName)
For python 3.6+, AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet
to install do;
pip install awswrangler
if you want to write your pandas dataframe as a parquet file to S3 do;
import awswrangler as wr
wr.s3.to_parquet(
dataframe=df,
path="s3://my-bucket/key/my-file.parquet"
)

Resources