Python Get MIME of s3 object on Lambda - python-3.x

I have a lambda that triggers upon s3 PutObject. Before proceeding the lambda needs to check if the file is actually a video file or not (mp4 in my case). File extension is not helpful because that can be fake. So I have tried checking MIME using FileType which works in local machine.
I don't want to download large files from s3, just some portion and save in local machine to check if that's mp4 or not.
So far I tried this (on local machine) -
import boto3
import filetype
from time import sleep
REGION = 'ap-southeast-1'
tmp_path = "path/src/my_file.mp4"
start_byte = 0
end_byte = 9000
s3 = boto3.client('s3', region_name=REGION)
resp = s3.get_object(
Bucket="test",
Key="MVI_1494.MP4",
Range='bytes={}-{}'.format(start_byte, end_byte)
)
# the file
object_content = resp['Body'].read()
print(type(object_content))
with open(tmp_path, "wb") as binary_file:
# Write bytes to file
binary_file.write(object_content)
sleep(5)
kind = filetype.guess_mime(tmp_path)
print(kind)
But this always return None as mimetype. I think I am not saving the binary file properly, any help would really save my day.
TLDR: Download small portion of large file from s3 -> save in tmp storage -> get mime.

Boto3 has a function S3.Client.head_object:
The HEAD action retrieves metadata from an object without returning
the object itself. This action is useful if you're only interested in
an object's metadata. To use HEAD, you must have READ access to the
object.
You can call this method to get metadata object associated with S3 bucket item.
metadata = s3client.head_object(Bucket='MyBucketName', Key='MyS3ItemKey')
This metadata includes a ContentType property, you can use this property to check the object type.
OR
If you can't trust this ContentType as this can be faked. You can simply save the object's MIME type in DynamoDB while uploading it. You can read the type from there whenever you want.
OR
You can simply create a Lambda that will get triggered, you can download the object in the Lambda as it has around 512MB as ephemeral storage. You can determine the content type there and update it, as you can also set some metadata when you upload the object and later edit it as your needs change.

You dont need to save file on disk for filetype lib.
guess_mime function accept bytes datatype as well.

Related

compress .txt file on s3 location to .gz file

I need to compress a .txt- file to .gz which is on S3 location and then upload it to a different S3 bucket. I have written following code but its not working as expected:
def upload_gzipped(bucket, key, fp, compressed_fp=None, content_type='text/plain'):
with gzip.GzipFile(fileobj=compressed_fp, mode='wb') as gz:
shutil.copyfileobj(fp, gz)
compressed_fp.seek(0)
print(compressed_fp)
bucket.upload_fileobj(
compressed_fp,
key,
{'ContentType': content_type, 'ContentEncoding': 'gzip'})
source_bucket = event['Records'][0]['s3']['bucket']['name']
file_key_name = event['Records'][0]['s3']['object']['key']
response = s3.get_object(Bucket=source_bucket, Key=file_key_name)
original = BytesIO(response['Body'].read())
original.seek(0)
upload_gzipped(source_bucket, file_key_name, original)
Can someone please help here, or any other approach to gzip the file on S3 location
It would appear that you are writing an AWS Lambda function.
A simpler program flow would probably be:
Download the file to /tmp/ using s3_client.download_file()
Gzip the file
Upload the file to S3 using s3.client_upload_file()
Delete the files in /tmp/
Also, please note that the AWS Lambda function might be invoked with multiple objects being passed via the event. However, your code is currently only processing the first record with event['Records'][0]. The program should loop through these records like this:
for record in event['Records']:
source_bucket = record['s3']['bucket']['name']
file_key_name = record['s3']['object']['key']
...
Instead of writing the file into your /tmp folder, it might be better to read it into a buffer since the /tmp folder has limited memory.
buffer = BytesIO(file.get()["Body"].read())
For gzipping you can simply use something like this:
gzipped_content = gzip.compress(f_in.read())
destinationbucket.upload_fileobj(io.BytesIO(gzipped_content),
final_file_path,
ExtraArgs={"ContentType": "text/plain"}
)
There's a similar tutorial here for a Lambda function: https://medium.com/p/f7bccf0099c9

Uploading a file from memory to S3 with Boto3

This question has been asked many times, but my case is ever so slightly different. I'm trying to create a lambda that makes an .html file and uploads it to S3. It works when the file was created on disk, then I can upload it like so:
boto3.client('s3').upload_file('index.html', bucket_name, 'folder/index.html')
So now I have to create the file in memory, for this I first tried StringIO(). However then .upload_file throws an error.
boto3.client('s3').upload_file(temp_file, bucket_name, 'folder/index.html')
ValueError: Filename must be a string`.
So I tried using .upload_fileobj() but then I get the error TypeError: a bytes-like object is required, not 'str'
So I tried using Bytesio() which wants me to convert the str to bytes first, so I did:
temp_file = BytesIO()
temp_file.write(index_top.encode('utf-8'))
print(temp_file.getvalue())
boto3.client('s3').upload_file(temp_file, bucket_name, 'folder/index.html')
But now it just uploads an empty file, despite the .getvalue() clearly showing that it does have content in there.
What am I doing wrong?
If you wish to create an object in Amazon S3 from memory, use put_object():
import boto3
s3_client = boto3.client('s3')
html = "<h2>Hello World</h2>"
s3_client.put_object(Body=html, Bucket='my-bucket', Key='foo.html', ContentType='text/html')
But now it just uploads an empty file, despite the .getvalue() clearly >showing that it does have content in there.
When you finish writing to a file buffer, the position stays at the end. When you upload a buffer, it starts from the position it is currently in. Since you're at the end, you get no data. To fix this, you just need to add a seek(0) to reset the buffer back to the beginning after you finish writing to it. Your code would look like this:
temp_file = BytesIO()
temp_file.write(index_top.encode('utf-8'))
temp_file.seek(0)
print(temp_file.getvalue())
boto3.client('s3').upload_file(temp_file, bucket_name, 'folder/index.html')

How to we send a file (accepted as part of Multipart request) to MINIO object storage in python without saving the file in local storage?

I am trying to write an API in python (Falcon) to accept a file from multipart-form parameter and put the file in MINIO object storage. The problem is I want to send the file to Minio without saving it in any temp location.
Minio-python client has a function using which we can send the file.
`put_object(bucket_name, object_name, data, length)`
where data is the file data and length is total length of object.
For more explanation: https://docs.min.io/docs/python-client-api-reference.html#put_object
I am facing problem accumulating the values of "data" and "length" arguments in the put_object function.
The type of file accepted in the API class is of falcon_multipart.parser.Parser which cannot be sent to Minio.
I can make it work if I write the file to any temp location and then read it from the desired location and send.
Can anyone help me finding a solution to this?
I tried reading file data from the Parser object and tried converting the file to bytes io.BytesIO. But it did not work.
def on_post(self,req, resp):
file = req.get_param('file')
file_data = file.file.read()
file_data= io.BytesIO(file_data)
bucket_name = req.get_param('bucket_name')
self.upload_file_to_minio(bucket_name, file, file_data)
def upload_file_to_minio(self, bucket_name, file, file_data):
minioClient = Minio("localhost:9000", access_key='minio', secret_key='minio', secure=False)
try:
file_stat = sys.getsizeof(file_data)
#file_stat = file_data.getbuffer().nbytes
minioClient.put_object(bucket_name, "SampleFile" , file, file_stat)
except ResponseError as err:
print(err)
Traceback (most recent call last):
File "/home/user/.local/lib/python3.6/site-packages/minio/helpers.py", line 382, in is_non_empty_string
if not input_string.strip():
AttributeError: 'NoneType' object has no attribute 'strip'
A very late answer to your question. As of Falcon 3.0, this should be possible leveraging the framework's native multipart/form-data support.
There is an example how to perform the same task to AWS S3: How can I save POSTed files (from a multipart form) directly to AWS S3?
However, as I understand, MinIO requires either the total length, which is unknown, or alternatively, it requires you to wrap the upload as a multipart form. That should be doable by reading reasonably large (e.g., 8 MiB or similar) chunks into the memory, and wrapping them as multipart upload parts without storing anything on disk.
IIRC Boto3's transfer manager does something like that under the hood for you too.

How to pull only certain csv's and concat the data from s3?

I have a bucket with various files. I am only interested in pulling files that begin with the word 'member' and storing each member file in a list to be concated further into a dataframe.
Currently I am pulling data like this:
import boto3
my_bucket = s3.Bucket('my-bucket')
obj = s3.Object('my-bucket','member')
file_content = obj.get()['Body'].read().decode('utf-8')
df = pd.read_csv(file_content)
How ever this is only pulling the member file. I have member files that look like this 'member_1229013','member_2321903' etc.
How can I read in all the 'member' files, save the data in a list so I can concat later. All column names are the same in all csv's
You can only download/access one object per API call.
I normally recommend downloading the objects to a local directory, and then accessing them as normal local files. Here is an example of how to download an object from Amazon S3:
import boto3
s3 = boto3.client('s3')
s3.download_file('mybucket', 'hello.txt', '/tmp/hello.txt')
See: download_file() documentation
If you want to read multiple files, you will first need to obtain a listing of the files (eg with list_objects_v2(), and then access each object individually.
One tip for boto3... There are two ways to make calls: via a Resource (eg using s3.Object() or s3.Bucket()) or via a Client, which passes everything as parameters.

saving an image to bytes and uploading to boto3 returning content-MD5 mismatch

I'm trying to pull an image from s3, quantize it/manipulate it, and then store it back into s3 without saving anything to disk (entirely in-memory). I was able to do it once, but upon returning to the code and trying it again it did not work. The code is as follows:
import boto3
import io
from PIL import Image
client = boto3.client('s3',aws_access_key_id='',
aws_secret_access_key='')
cur_image = client.get_object(Bucket='mybucket',Key='2016-03-19 19.15.40.jpg')['Body'].read()
loaded_image = Image.open(io.BytesIO(cur_image))
quantized_image = loaded_image.quantize(colors=50)
saved_quantized_image = io.BytesIO()
quantized_image.save(saved_quantized_image,'PNG')
client.put_object(ACL='public-read',Body=saved_quantized_image,Key='testimage.png',Bucket='mybucket')
The error I received is:
botocore.exceptions.ClientError: An error occurred (BadDigest) when calling the PutObject operation: The Content-MD5 you specified did not match what we received.
It works fine if I just pull an image, and then put it right back without manipulating it. I'm not quite sure what's going on here.
I had this same problem, and the solution was to seek to the beginning of the saved in-memory file:
out_img = BytesIO()
image.save(out_img, img_type)
out_img.seek(0) # Without this line it fails
self.bucket.put_object(Bucket=self.bucket_name,
Key=key,
Body=out_img)
The file may need to be saved and reloaded before you send it off to S3. The file pointer seek also needs to be at 0.
My problem was sending a file after reading out the first few bytes of it. Opening a file cleanly did the trick.
I found this question getting the same error trying to upload files -- two scripts clashed, one creating, the other uploading. My answer was to create using ".filename" then:
os.rename(filename.replace(".filename","filename"))
The upload script then needs to ignore . files. This ensured the file was done being created.
To anyone else facing similar errors, this usually happens when content of the file gets modified during file upload, possibly due to file being modified by another process/thread.
A classic example would be to scripts modifying the same file at the same time, which throws the bad digest due to change in MD5 content. In the below example, the data file is being uploaded to s3, while it is being uploaded, if another process overwrites it, you will end up with this exception
random_uuid=$(uuidgen)
cat data
aws s3api put-object --acl bucket-owner-full-control --bucket $s3_bucket --key $random_uuid --body data

Resources