Uploading data from a lambda job to s3 is very slow - python-3.x

I’ve implemented an AWS lambda using Serverless framework to receive S3 ObjectCreated event and uncompress tar.gz files. I’m noticing that copying the extracted files in S3 takes a long time and times out. The .tar.gz file is ~ 18M in size and number of files in the compressed file is ~ 12000. I’ve tried using a ThreadPoolExecutor with 500s timeout. Any suggestions on how I can work around this issue
The lambda code implemented in python:
https://gist.github.com/arjunurs/7848137321148d9625891ecc1e3a9455

In the gist that you have shared, there are a number of changes.
I suggest avoiding reading the extracted tar file in memory where you can stream stream its contents directly to the S3 bucket.
def extract(filename):
upload_status = 'success'
try:
s3.upload_fileobj(
tardata.extractfile(filename),
bucket,
os.path.join(path, tarname, filename)
)
except Exception:
logger.error(
'Failed to upload %s in tarfile %s',
filename, tarname, exc_info=True)
upload_status = 'fail'
finally:
return filename, upload_status

Related

compress .txt file on s3 location to .gz file

I need to compress a .txt- file to .gz which is on S3 location and then upload it to a different S3 bucket. I have written following code but its not working as expected:
def upload_gzipped(bucket, key, fp, compressed_fp=None, content_type='text/plain'):
with gzip.GzipFile(fileobj=compressed_fp, mode='wb') as gz:
shutil.copyfileobj(fp, gz)
compressed_fp.seek(0)
print(compressed_fp)
bucket.upload_fileobj(
compressed_fp,
key,
{'ContentType': content_type, 'ContentEncoding': 'gzip'})
source_bucket = event['Records'][0]['s3']['bucket']['name']
file_key_name = event['Records'][0]['s3']['object']['key']
response = s3.get_object(Bucket=source_bucket, Key=file_key_name)
original = BytesIO(response['Body'].read())
original.seek(0)
upload_gzipped(source_bucket, file_key_name, original)
Can someone please help here, or any other approach to gzip the file on S3 location
It would appear that you are writing an AWS Lambda function.
A simpler program flow would probably be:
Download the file to /tmp/ using s3_client.download_file()
Gzip the file
Upload the file to S3 using s3.client_upload_file()
Delete the files in /tmp/
Also, please note that the AWS Lambda function might be invoked with multiple objects being passed via the event. However, your code is currently only processing the first record with event['Records'][0]. The program should loop through these records like this:
for record in event['Records']:
source_bucket = record['s3']['bucket']['name']
file_key_name = record['s3']['object']['key']
...
Instead of writing the file into your /tmp folder, it might be better to read it into a buffer since the /tmp folder has limited memory.
buffer = BytesIO(file.get()["Body"].read())
For gzipping you can simply use something like this:
gzipped_content = gzip.compress(f_in.read())
destinationbucket.upload_fileobj(io.BytesIO(gzipped_content),
final_file_path,
ExtraArgs={"ContentType": "text/plain"}
)
There's a similar tutorial here for a Lambda function: https://medium.com/p/f7bccf0099c9

Use checksum to verify integrity of uploaded and downloaded files from AWS S3 via Django

Using Django I am trying to upload multiple files in AWS S3. The file size may vary from 500 MB to 2 GB. I need to check the integrity of both uploaded and downloaded files.
I have seen that using PUT operation I can upload a single object up to 5GB in size. They also provide 'ContentMD5' option to verify the file. My questions are:
Should I use PUT option if I upload a file larger than 1 GB? Because generating MD5 checksum of such file may exceed system memory. How can I solve this issue? Or is there any better solution available for this task?
To download the file with checksum, AWS has get_object()function. My questions are:
Is it okay to use this function to downlaod multiple files?
How can I use it to download multiple files with checksum from S3? I look for some example but there is noting much. Now I am using following code to download and serve multiple files as a zip in Django. I want to serve the files as zip with checksum to validate the transfer.
s3 = boto3.resource('s3', aws_access_key_id=base.AWS_ACCESS_KEY_ID, aws_secret_access_key=base.AWS_SECRET_ACCESS_KEY)
bucket = s3.Bucket(base.AWS_STORAGE_BUCKET_NAME)
s3_file_path = bucket.objects.filter(Prefix='media/{}/'.format(url.split('/')[-1]))
# set up zip folder
zip_subdir = url.split('/')[-1]
zip_filename = zip_subdir + ".zip"
byte_stream = BytesIO()
zf = ZipFile(byte_stream, "w")
for path in s3_file_path:
s3_url = f"https://%s.s3.%s.amazonaws.com/%s" % (base.AWS_STORAGE_BUCKET_NAME,base.AWS_S3_REGION_NAME,path.key)
file_response = requests.get(s3_url)
if file_response.status_code == 200:
try:
tmp = tempfile.NamedTemporaryFile()
print(tmp.name)
tmp.name = path.key.split('/')[-1]
f1 = open(tmp.name, 'wb')
f1.write(file_response.content)
f1.close()
zip_path = os.path.join('/'.join(path.key.split('/')[1:-1]), tmp.name)
zf.write(tmp.name,zip_path)
finally:
os.remove(tmp.name)
zf.close()
response = HttpResponse(byte_stream.getvalue(), content_type="application/x-zip-compressed")
response['Content-Disposition'] = 'attachment; filename=%s' % zip_filename
I am learning about AWS S3 and this is the first time I am using it. I would appreciate any kind of suggestion regarding this problem.

How to we send a file (accepted as part of Multipart request) to MINIO object storage in python without saving the file in local storage?

I am trying to write an API in python (Falcon) to accept a file from multipart-form parameter and put the file in MINIO object storage. The problem is I want to send the file to Minio without saving it in any temp location.
Minio-python client has a function using which we can send the file.
`put_object(bucket_name, object_name, data, length)`
where data is the file data and length is total length of object.
For more explanation: https://docs.min.io/docs/python-client-api-reference.html#put_object
I am facing problem accumulating the values of "data" and "length" arguments in the put_object function.
The type of file accepted in the API class is of falcon_multipart.parser.Parser which cannot be sent to Minio.
I can make it work if I write the file to any temp location and then read it from the desired location and send.
Can anyone help me finding a solution to this?
I tried reading file data from the Parser object and tried converting the file to bytes io.BytesIO. But it did not work.
def on_post(self,req, resp):
file = req.get_param('file')
file_data = file.file.read()
file_data= io.BytesIO(file_data)
bucket_name = req.get_param('bucket_name')
self.upload_file_to_minio(bucket_name, file, file_data)
def upload_file_to_minio(self, bucket_name, file, file_data):
minioClient = Minio("localhost:9000", access_key='minio', secret_key='minio', secure=False)
try:
file_stat = sys.getsizeof(file_data)
#file_stat = file_data.getbuffer().nbytes
minioClient.put_object(bucket_name, "SampleFile" , file, file_stat)
except ResponseError as err:
print(err)
Traceback (most recent call last):
File "/home/user/.local/lib/python3.6/site-packages/minio/helpers.py", line 382, in is_non_empty_string
if not input_string.strip():
AttributeError: 'NoneType' object has no attribute 'strip'
A very late answer to your question. As of Falcon 3.0, this should be possible leveraging the framework's native multipart/form-data support.
There is an example how to perform the same task to AWS S3: How can I save POSTed files (from a multipart form) directly to AWS S3?
However, as I understand, MinIO requires either the total length, which is unknown, or alternatively, it requires you to wrap the upload as a multipart form. That should be doable by reading reasonably large (e.g., 8 MiB or similar) chunks into the memory, and wrapping them as multipart upload parts without storing anything on disk.
IIRC Boto3's transfer manager does something like that under the hood for you too.

How to read csv with pandas in tmp directory in aws lambda

I am writing a lambda to read some data from a csv into a dataframe, manipulate said data then convert it back to a csv and make an api call with the new csv all on a python lambda.
I am running into an issue using pandas.read_csv command. It ends my lambdas trigger execution with no errors.
os.chdir('/tmp')
for root, dirs, files in os.walk('/tmp', topdown=True):
for name in files:
if '.csv' in name:
testdic[name] = root
print(os.path.isfile('/tmp/' + name))
print(os.path.isfile(name))
df = pd.read_csv(name)
df = pd.read_csv('/tmp/' + name)
Both os.path.isfile return true and i have tried both versions of read_csv, both do not work and end the lambda prematurely without error.
I have confirmed the csv is downloaded into the lambda tmp directory, I can read and print off rows of the csv in tmp. However when i run = pd.read_csv('/tmp/file.csv') or changing my directory to /tmp and doing = pd.read_csv('file.csv') it ends the lambda with no error and does not pass that point in the code. I am using pandas 0.23.4 as that is what I need to use and the code works locally. Any suggestions would be helpful
Expected results should be the csv being read into a dataframe so I can manipulate it.
FIXED: Could not just use '/tmp/' + filename. Had to use os.path.join(root, filename), also had to increase the timeout of my lambda due to file size.
os.path.join - works for different platforms
Use
file_path = os.path.join(root, name)
and then
pd.read_csv(file_path)
NOTE: Increase the AWS lambda timeout as suggested in comments by #Gabe Maurer

Getting a data stream from a zipped file sitting in a S3 bucket using boto3 lib and AWS Lambda

I am trying to create a serverless processor for my chron job, In this job I receive a zipped file in my S3 bucket from one of my clients, file is around 50MB in size but once you unzip it, it becomes 1.5GB in size and there's a hard limit to the space available on AWS Lambda which is 500MB due to which I cannot download this file from S3 bucket and unzip it on my Lambda, I was successfully able to unzip my file and stream the content line by line from S3 using funzip in unix script.
for x in $files ; do echo -n "$x: " ; timeout 5 aws s3 cp $monkeydir/$x - | funzip
My Bucket Name:MonkeyBusiness
Key:/Daily/Business/Banana/{current-date}
Object:banana.zip
but now since I am trying to achieve same output using boto3, how I can stream the zipped content to sys i/o and unzip the stream save the content in separate files divided by 10000 lines each and upload the chunked files back to S3.
Need guidance as I am pretty new to AWS and boto3.
Please let me know if you need more details about the job.
Below given suggested solution is not applicable here because zlib documentation clearly states that said lib is compatible for gzip file format and my question is for zip file format.
import zlib
def stream_gzip_decompress(stream):
dec = zlib.decompressobj(32 + zlib.MAX_WBITS) # offset 32 to skip the header
for chunk in stream:
rv = dec.decompress(chunk)
if rv:
yield rv
So I used BytesIO to read the compressed file into a buffer object, then I used zipfile to open the decompressed stream as uncompressed data and I was able to get the datum line by line.
import io
import zipfile
import boto3
import sys
s3 = boto3.resource('s3', 'us-east-1')
def stream_zip_file():
count = 0
obj = s3.Object(
bucket_name='MonkeyBusiness',
key='/Daily/Business/Banana/{current-date}/banana.zip'
)
buffer = io.BytesIO(obj.get()["Body"].read())
print (buffer)
z = zipfile.ZipFile(buffer)
foo2 = z.open(z.infolist()[0])
print(sys.getsizeof(foo2))
line_counter = 0
for _ in foo2:
line_counter += 1
print (line_counter)
z.close()
if __name__ == '__main__':
stream_zip_file()
This is not the exact answer. But you can try this out.
First, please adapt the answer that mentioned about gzip file with limited memory, this method allow one to stream file chunk by chunk. And boto3 S3 put_object() and upload_fileobj seems allow streaming.
You need to mix and adapt the above mentioned code with following decompression.
stream = cStringIO.StringIO()
stream.write(s3_data)
stream.seek(0)
blocksize = 1 << 16 #64kb
with gzip.GzipFile(fileobj=stream) as decompressor:
while True:
boto3.client.upload_fileobj(decompressor.read(blocksize), "bucket", "key")
I cannot guarantee the above code will works, it is just give you the idea to decompress file and re-uplaod it by chunks. You might even need to pipeline the decompress data to ByteIo and pipeline to upload_fileobj. There is a lot of testing.
if you don't need to decompress the file ASAP, my suggestion is use lambda to put the file into SQS queue. When there is "enough" file, trigger a SPOT instance (that will be pretty cheap) that will read the queue and process the file.

Resources