upload files to tmp directory lambda - python-3.x

I have a Lambda function that triggers when an S3 upload happens. It then downloads to the /tmp and then sends to GCP Storage. Issue is that the logfiles can be up to 900 MB so there is not enough space on the /tmp storage in the Lambda function. Is there away around this?
I tried sending to memory but I believe the memory is read only.
Also there is talk about mounting efs but not sure this will work.
retrieve bucket name and file_key from the S3 event
logger.info(event)
s3_bucket_name = event['Records'][0]['s3']['bucket']['name']
file_key = event['Records'][0]['s3']['object']['key']
logger.info('Reading {} from {}'.format(file_key, s3_bucket_name))
logger.info(s3_bucket_name)
logger.info(file_key)
# s3 download file
s3.download_file(s3_bucket_name, file_key, '/tmp/{}'.format(file_key))
# upload to google bucket
bucket = google_storage.get_bucket(google_bucket_name)
blob = bucket.blob(file_key)
blob.upload_from_filename('/tmp/{}'.format(file_key))
This the error from cloudwatch logs for the lambda function.
[ERROR] OSError: [Errno 28] No space left on device
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 30, in lambda_handler
s3.download_file(s3_bucket_name, file_key, '/tmp/

storage_client = storage.Client()
bucket = storage_client.get_bucket("YOUR_BUCKET_NAME")
blob = bucket.blob("file/path.csv") #file path on your gcs
blob.upload_from_filename("/tmp/path.csv") #tmp file
i hope that will help you.

Related

Writing json to AWS S3 from AWS Lambda

I am trying to write a response to AWS S3 as a new file each time.
Below is the code I am using
s3 = boto3.resource('s3', region_name=region_name)
s3_obj = s3.Object(s3_bucket, f'/{folder}/{file_name}.json')
resp_ = s3_obj.put(Body=json.dumps(response_json).encode('UTF-8'))
I can see that I get a 200 response and the file on the directory as well. But it also produces the below exception :
[DEBUG] 2020-10-13T08:29:10.828Z. Event needs-retry.s3.PutObject: calling handler <bound method S3RegionRedirector.redirect_from_error of <botocore.utils.S3RegionRedirector object at 0x7f2cf2fdfe123>>
My code throws 500 Exception even though it works. I have other business logic as part of the lambda and things work just fine as the write to S3 operation is at the last. Any help would be appreciated.
The Key (filename) of an Amazon S3 object should not start with a slash (/).

How do I use a cloud function to unzip a large file in cloud storage?

I have a cloud function which is triggered when a zip is uploaded to cloud storage and is supposed to unpack it. However the function runs out of memory, presumably since the unzipped file is too large (~2.2 Gb).
I was wondering what my options are for dealing with this problem? I read that it's possible to stream large files into cloud storage but I don't know how to do this from a cloud function or while unzipping. Any help would be appreciated.
Here is the code of the cloud function so far:
storage_client = storage.Client()
bucket = storage_client.get_bucket("bucket-name")
destination_blob_filename = "large_file.zip"
blob = bucket.blob(destination_blob_filename)
zipbytes = io.BytesIO(blob.download_as_string())
if is_zipfile(zipbytes):
with ZipFile(zipbytes, 'r') as myzip:
for contentfilename in myzip.namelist():
contentfile = myzip.read(contentfilename)
blob = bucket.blob(contentfilename)
blob.upload_from_string(contentfile)
Your target process is risky:
If you stream file without unzipping it totally, you can't validate the checksum of the zip
If you stream data into GCS, file integrity is not guaranteed
Thus, you have 2 successful operation without checksum validation!
Before having Cloud Function or Cloud Run with more memory, you can use Dataflow template to unzip your files

Cloud Storage python client fails to retrieve bucket

I am trying to use the python client library to write blobs to cloud storage. The VM I'm using has Read/Write permissions for storage and I'm able to access the bucket via gsutil, however python is giving me the following error
>>> from google.cloud import storage
>>> storage_client = storage.Client()
>>> bucket = storage_client.get_bucket("gs://<bucket name>")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/dist-packages/google/cloud/storage/client.py", line 225, in get_bucket
bucket.reload(client=self)
File "/usr/local/lib/python3.5/dist-packages/google/cloud/storage/_helpers.py", line 108, in reload
_target_object=self)
File "/usr/local/lib/python3.5/dist-packages/google/cloud/_http.py", line 293, in api_request
raise exceptions.from_http_response(response)
google.api_core.exceptions.NotFound: 404 GET https://www.googleapis.com/storage/v1/b/gs://<bucket name>?projection=noAcl: Not Found
Phix is right. You only need to specify the bucket name without the 'gs://'. As a matter of fact the API being called (e.g. Buckets: update https://www.googleapis.com/storage/v1/b/bucket) is found here. And here is more on Python's Google Cloud Storage API client library and an example of how to use it.

Boto3 ListObject Forbidden for Admin User

I have been trying to write a small script to download to lambda /tmp all the content of a S3 folder. To do this I need to list all Objects in a specific bucket. Unfortunately I keep getting the following error;
An error occurred (403) when calling the HeadObject operation: Forbidden
Here is how I try to download all the files from a folder:
#initialize S3
try:
s3 = boto3.resource('s3',
aws_access_key_id=os.getenv('S3USERACCESSKEY'),
aws_secret_access_key=os.getenv('S3USERSECRETKEY')
)
s3_client = boto3.client('s3',
aws_access_key_id=os.getenv('S3USERACCESSKEY'),
aws_secret_access_key=os.getenv('S3USERSECRETKEY')
)
except Exception as e:
logger.error("Could not connect to s3 bucket: " + str(e))
#Function to download whole folders from s3
for s3_key in s3_client.list_objects(Bucket=os.getenv('S3BUCKETNAME'))['Contents']:
s3_object = s3_key['Key']
if not s3_object.endswith("/"):
s3_client.download_file('bucket', s3_object, s3_object)
else:
import os
if not os.path.exists(s3_object):
os.makedirs(s3_object)
The access keys above have full admin rights:
EDIT
Still no success after removing my manual keys, here are the right i attached to Lambda:
Here is the actual error from cloudwatch:
The code now looks like so:
#initialize S3
try:
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')
except Exception as e:
[....]
Seems like "Forbidden" might be another issue then permission but I can't find any doc on it.
Make sure the access key belongs to the user with the IAM role that has rights to access the s3 bucket.
If you run from lambda, there's no need to use an access key, just attach the IAM role to lambda
https://docs.aws.amazon.com/lambda/latest/dg/accessing-resources.html
Did you import boto?
Try to run execute only this:
UPDATE
import boto3
s3 = boto3.resource('s3')
for bucket in s3.buckets.all():
print(bucket.name)

Increase Azure blob block upload limit from 32 MB

I am trying to upload a contents to azure blob and the size is over 32MB. The c# code snippet below:
CloudBlockBlob blob = _blobContainer.GetBlockBlobReference(blobName);
blob.UploadFromByteArray(contents, 0, contents.Length, AccessCondition.GenerateIfNotExistsCondition(), options:writeOptions);
Everytime the blob is over 32MB, the above raises an exception:
Exception thrown: 'Microsoft.WindowsAzure.Storage.StorageException' in Microsoft.WindowsAzure.Storage.dll
Additional information: The remote server returned an error: (404) Not Found.
As per this
When a block blob upload is larger than the value in this property,
storage clients break the file into blocks.
Should there be a separate line of code to enable this.
Storage clients default to a 32 MB maximum single block upload. When a block blob upload is larger than the value in SingleBlobUploadThresholdInBytes property, storage clients break the file into blocks.
As Tamra said, the storage client handles the work of breaking the file into blocks. Here is my tests for you to have a better understanding of it.
Code Sample
CloudBlockBlob blob = container.GetBlockBlobReference(blobName);
var writeOptions = new BlobRequestOptions()
{
SingleBlobUploadThresholdInBytes = 50 * 1024 * 1024, //maximum for 64MB,32MB by default
};
blob.UploadFromByteArray(contents, 0, contents.Length, AccessCondition.GenerateIfNotExistsCondition(), options: writeOptions);
Scenario
If you are writing a block blob that is no more than the SingleBlobUploadThresholdInBytes property in size, you could upload it in its entirety with a single write operation.
You could understand it by capturing the Network Package via Fiddler when you invoke the UploadFromByteArray method.
When a block blob upload is larger than the value in SingleBlobUploadThresholdInBytes property in size, storage clients break the file into blocks automatically.
I upload a blob which size is nearly 90MB, then you could find the difference as follows:
Upon the snapshot, you could find that storage clients break the file into blocks with 4MB in size and upload the blocks simultaneously.
Every time the blob is over 32MB, the above raises an exception
You could try to set the SingleBlobUploadThresholdInBytes property or capture the Network Package when you invoke the UploadFromByteArray method to find the detailed error.

Resources