compress .txt file on s3 location to .gz file - python-3.x

I need to compress a .txt- file to .gz which is on S3 location and then upload it to a different S3 bucket. I have written following code but its not working as expected:
def upload_gzipped(bucket, key, fp, compressed_fp=None, content_type='text/plain'):
with gzip.GzipFile(fileobj=compressed_fp, mode='wb') as gz:
shutil.copyfileobj(fp, gz)
compressed_fp.seek(0)
print(compressed_fp)
bucket.upload_fileobj(
compressed_fp,
key,
{'ContentType': content_type, 'ContentEncoding': 'gzip'})
source_bucket = event['Records'][0]['s3']['bucket']['name']
file_key_name = event['Records'][0]['s3']['object']['key']
response = s3.get_object(Bucket=source_bucket, Key=file_key_name)
original = BytesIO(response['Body'].read())
original.seek(0)
upload_gzipped(source_bucket, file_key_name, original)
Can someone please help here, or any other approach to gzip the file on S3 location

It would appear that you are writing an AWS Lambda function.
A simpler program flow would probably be:
Download the file to /tmp/ using s3_client.download_file()
Gzip the file
Upload the file to S3 using s3.client_upload_file()
Delete the files in /tmp/
Also, please note that the AWS Lambda function might be invoked with multiple objects being passed via the event. However, your code is currently only processing the first record with event['Records'][0]. The program should loop through these records like this:
for record in event['Records']:
source_bucket = record['s3']['bucket']['name']
file_key_name = record['s3']['object']['key']
...

Instead of writing the file into your /tmp folder, it might be better to read it into a buffer since the /tmp folder has limited memory.
buffer = BytesIO(file.get()["Body"].read())
For gzipping you can simply use something like this:
gzipped_content = gzip.compress(f_in.read())
destinationbucket.upload_fileobj(io.BytesIO(gzipped_content),
final_file_path,
ExtraArgs={"ContentType": "text/plain"}
)
There's a similar tutorial here for a Lambda function: https://medium.com/p/f7bccf0099c9

Related

Python Get MIME of s3 object on Lambda

I have a lambda that triggers upon s3 PutObject. Before proceeding the lambda needs to check if the file is actually a video file or not (mp4 in my case). File extension is not helpful because that can be fake. So I have tried checking MIME using FileType which works in local machine.
I don't want to download large files from s3, just some portion and save in local machine to check if that's mp4 or not.
So far I tried this (on local machine) -
import boto3
import filetype
from time import sleep
REGION = 'ap-southeast-1'
tmp_path = "path/src/my_file.mp4"
start_byte = 0
end_byte = 9000
s3 = boto3.client('s3', region_name=REGION)
resp = s3.get_object(
Bucket="test",
Key="MVI_1494.MP4",
Range='bytes={}-{}'.format(start_byte, end_byte)
)
# the file
object_content = resp['Body'].read()
print(type(object_content))
with open(tmp_path, "wb") as binary_file:
# Write bytes to file
binary_file.write(object_content)
sleep(5)
kind = filetype.guess_mime(tmp_path)
print(kind)
But this always return None as mimetype. I think I am not saving the binary file properly, any help would really save my day.
TLDR: Download small portion of large file from s3 -> save in tmp storage -> get mime.
Boto3 has a function S3.Client.head_object:
The HEAD action retrieves metadata from an object without returning
the object itself. This action is useful if you're only interested in
an object's metadata. To use HEAD, you must have READ access to the
object.
You can call this method to get metadata object associated with S3 bucket item.
metadata = s3client.head_object(Bucket='MyBucketName', Key='MyS3ItemKey')
This metadata includes a ContentType property, you can use this property to check the object type.
OR
If you can't trust this ContentType as this can be faked. You can simply save the object's MIME type in DynamoDB while uploading it. You can read the type from there whenever you want.
OR
You can simply create a Lambda that will get triggered, you can download the object in the Lambda as it has around 512MB as ephemeral storage. You can determine the content type there and update it, as you can also set some metadata when you upload the object and later edit it as your needs change.
You dont need to save file on disk for filetype lib.
guess_mime function accept bytes datatype as well.

Use checksum to verify integrity of uploaded and downloaded files from AWS S3 via Django

Using Django I am trying to upload multiple files in AWS S3. The file size may vary from 500 MB to 2 GB. I need to check the integrity of both uploaded and downloaded files.
I have seen that using PUT operation I can upload a single object up to 5GB in size. They also provide 'ContentMD5' option to verify the file. My questions are:
Should I use PUT option if I upload a file larger than 1 GB? Because generating MD5 checksum of such file may exceed system memory. How can I solve this issue? Or is there any better solution available for this task?
To download the file with checksum, AWS has get_object()function. My questions are:
Is it okay to use this function to downlaod multiple files?
How can I use it to download multiple files with checksum from S3? I look for some example but there is noting much. Now I am using following code to download and serve multiple files as a zip in Django. I want to serve the files as zip with checksum to validate the transfer.
s3 = boto3.resource('s3', aws_access_key_id=base.AWS_ACCESS_KEY_ID, aws_secret_access_key=base.AWS_SECRET_ACCESS_KEY)
bucket = s3.Bucket(base.AWS_STORAGE_BUCKET_NAME)
s3_file_path = bucket.objects.filter(Prefix='media/{}/'.format(url.split('/')[-1]))
# set up zip folder
zip_subdir = url.split('/')[-1]
zip_filename = zip_subdir + ".zip"
byte_stream = BytesIO()
zf = ZipFile(byte_stream, "w")
for path in s3_file_path:
s3_url = f"https://%s.s3.%s.amazonaws.com/%s" % (base.AWS_STORAGE_BUCKET_NAME,base.AWS_S3_REGION_NAME,path.key)
file_response = requests.get(s3_url)
if file_response.status_code == 200:
try:
tmp = tempfile.NamedTemporaryFile()
print(tmp.name)
tmp.name = path.key.split('/')[-1]
f1 = open(tmp.name, 'wb')
f1.write(file_response.content)
f1.close()
zip_path = os.path.join('/'.join(path.key.split('/')[1:-1]), tmp.name)
zf.write(tmp.name,zip_path)
finally:
os.remove(tmp.name)
zf.close()
response = HttpResponse(byte_stream.getvalue(), content_type="application/x-zip-compressed")
response['Content-Disposition'] = 'attachment; filename=%s' % zip_filename
I am learning about AWS S3 and this is the first time I am using it. I would appreciate any kind of suggestion regarding this problem.

Uploading a file from memory to S3 with Boto3

This question has been asked many times, but my case is ever so slightly different. I'm trying to create a lambda that makes an .html file and uploads it to S3. It works when the file was created on disk, then I can upload it like so:
boto3.client('s3').upload_file('index.html', bucket_name, 'folder/index.html')
So now I have to create the file in memory, for this I first tried StringIO(). However then .upload_file throws an error.
boto3.client('s3').upload_file(temp_file, bucket_name, 'folder/index.html')
ValueError: Filename must be a string`.
So I tried using .upload_fileobj() but then I get the error TypeError: a bytes-like object is required, not 'str'
So I tried using Bytesio() which wants me to convert the str to bytes first, so I did:
temp_file = BytesIO()
temp_file.write(index_top.encode('utf-8'))
print(temp_file.getvalue())
boto3.client('s3').upload_file(temp_file, bucket_name, 'folder/index.html')
But now it just uploads an empty file, despite the .getvalue() clearly showing that it does have content in there.
What am I doing wrong?
If you wish to create an object in Amazon S3 from memory, use put_object():
import boto3
s3_client = boto3.client('s3')
html = "<h2>Hello World</h2>"
s3_client.put_object(Body=html, Bucket='my-bucket', Key='foo.html', ContentType='text/html')
But now it just uploads an empty file, despite the .getvalue() clearly >showing that it does have content in there.
When you finish writing to a file buffer, the position stays at the end. When you upload a buffer, it starts from the position it is currently in. Since you're at the end, you get no data. To fix this, you just need to add a seek(0) to reset the buffer back to the beginning after you finish writing to it. Your code would look like this:
temp_file = BytesIO()
temp_file.write(index_top.encode('utf-8'))
temp_file.seek(0)
print(temp_file.getvalue())
boto3.client('s3').upload_file(temp_file, bucket_name, 'folder/index.html')

Uploading data from a lambda job to s3 is very slow

I’ve implemented an AWS lambda using Serverless framework to receive S3 ObjectCreated event and uncompress tar.gz files. I’m noticing that copying the extracted files in S3 takes a long time and times out. The .tar.gz file is ~ 18M in size and number of files in the compressed file is ~ 12000. I’ve tried using a ThreadPoolExecutor with 500s timeout. Any suggestions on how I can work around this issue
The lambda code implemented in python:
https://gist.github.com/arjunurs/7848137321148d9625891ecc1e3a9455
In the gist that you have shared, there are a number of changes.
I suggest avoiding reading the extracted tar file in memory where you can stream stream its contents directly to the S3 bucket.
def extract(filename):
upload_status = 'success'
try:
s3.upload_fileobj(
tardata.extractfile(filename),
bucket,
os.path.join(path, tarname, filename)
)
except Exception:
logger.error(
'Failed to upload %s in tarfile %s',
filename, tarname, exc_info=True)
upload_status = 'fail'
finally:
return filename, upload_status

Getting a data stream from a zipped file sitting in a S3 bucket using boto3 lib and AWS Lambda

I am trying to create a serverless processor for my chron job, In this job I receive a zipped file in my S3 bucket from one of my clients, file is around 50MB in size but once you unzip it, it becomes 1.5GB in size and there's a hard limit to the space available on AWS Lambda which is 500MB due to which I cannot download this file from S3 bucket and unzip it on my Lambda, I was successfully able to unzip my file and stream the content line by line from S3 using funzip in unix script.
for x in $files ; do echo -n "$x: " ; timeout 5 aws s3 cp $monkeydir/$x - | funzip
My Bucket Name:MonkeyBusiness
Key:/Daily/Business/Banana/{current-date}
Object:banana.zip
but now since I am trying to achieve same output using boto3, how I can stream the zipped content to sys i/o and unzip the stream save the content in separate files divided by 10000 lines each and upload the chunked files back to S3.
Need guidance as I am pretty new to AWS and boto3.
Please let me know if you need more details about the job.
Below given suggested solution is not applicable here because zlib documentation clearly states that said lib is compatible for gzip file format and my question is for zip file format.
import zlib
def stream_gzip_decompress(stream):
dec = zlib.decompressobj(32 + zlib.MAX_WBITS) # offset 32 to skip the header
for chunk in stream:
rv = dec.decompress(chunk)
if rv:
yield rv
So I used BytesIO to read the compressed file into a buffer object, then I used zipfile to open the decompressed stream as uncompressed data and I was able to get the datum line by line.
import io
import zipfile
import boto3
import sys
s3 = boto3.resource('s3', 'us-east-1')
def stream_zip_file():
count = 0
obj = s3.Object(
bucket_name='MonkeyBusiness',
key='/Daily/Business/Banana/{current-date}/banana.zip'
)
buffer = io.BytesIO(obj.get()["Body"].read())
print (buffer)
z = zipfile.ZipFile(buffer)
foo2 = z.open(z.infolist()[0])
print(sys.getsizeof(foo2))
line_counter = 0
for _ in foo2:
line_counter += 1
print (line_counter)
z.close()
if __name__ == '__main__':
stream_zip_file()
This is not the exact answer. But you can try this out.
First, please adapt the answer that mentioned about gzip file with limited memory, this method allow one to stream file chunk by chunk. And boto3 S3 put_object() and upload_fileobj seems allow streaming.
You need to mix and adapt the above mentioned code with following decompression.
stream = cStringIO.StringIO()
stream.write(s3_data)
stream.seek(0)
blocksize = 1 << 16 #64kb
with gzip.GzipFile(fileobj=stream) as decompressor:
while True:
boto3.client.upload_fileobj(decompressor.read(blocksize), "bucket", "key")
I cannot guarantee the above code will works, it is just give you the idea to decompress file and re-uplaod it by chunks. You might even need to pipeline the decompress data to ByteIo and pipeline to upload_fileobj. There is a lot of testing.
if you don't need to decompress the file ASAP, my suggestion is use lambda to put the file into SQS queue. When there is "enough" file, trigger a SPOT instance (that will be pretty cheap) that will read the queue and process the file.

Resources