Uploading a huge file from ec2 to s3 fails

Uploading a huge file from ec2 to s3 fails - linux

I'm trying to upload 160 Gb file from ec2 to s3 using
s3cmd put --continue-put FILE s3://bucket/FILE
but every time uploading interrupts with the message:
FILE -> s3://bucket/FILE [part 10001 of 10538, 15MB] 8192 of 15728640 0% in 1s 6.01 kB/s failed
ERROR: Upload of 'FILE' part 10001 failed. Aborting multipart upload.
ERROR: Upload of 'FILE' failed too many times. Skipping that file.
The target bucket does exist.
What is the issue's reason?
Are there any other ways to upload the file?
Thanks.

You can have up to 10000 upload parts per object, so it fails on part 10001. Using larger parts may solve the issue.

"huge"---is it 10s or 100s of GBs? s3 limits the object size to 5GB and uploading may fail if it exceeds the size limitation.

Related

AWS Lambda: how to give ffmpeg large files?

Scenario:
Using AWS Lambda (Node.js), I want to process large files from S3 ( > 1GB).
The /tmp fs limit of 512MB means that I can't copy the S3 input there.
I can certainly increase the Lambda memory space, in order to read in the files.
Do I pass the memory buffer to ffmpeg? (node.js, how?)
Or....should I just make an EFS mount point and use that as the transcoding scratchpad?

You can just use the HTTP(s) protocol as input for ffmpeg.
Lambda has max 10GB memory limit, and data transfer speed from S3 is around 300MB per second the last time I test. So if you have only 1GB max video and are not doing memory intensive transformation, this approach should work fine
ffmpeg -i "https://public-qk.s3.ap-southeast-1.amazonaws.com/sample.mp4" -ss 00:00:10 -vframes 1 -f image2 "image%03d.jpg"

ffmpeg works on files, so maybe an alternative would be to setup a unix pipe and then read that pipe with ffmpeg, constantly feeding it with the s3 stream.
But maybe you'd wanna consider running this as an ECS task instead, you wouldn't have a time constraint, and not the same storage constraint either. Cold start of it using Fargate would be 1-2 minutes though, which maybe isn't acceptable?

Lambda now supports up to 10Gb storage:
https://aws.amazon.com/blogs/aws/aws-lambda-now-supports-up-to-10-gb-ephemeral-storage/
Update with cli:
$ aws lambda update-function-configuration --function-name PDFGenerator --ephemeral-storage '{"Size": 10240}'

What is the best way to transfer large files using aws s3 cp command of awscli

I am transferring around 150 files each 1 gb to s3 using aws s3 cp command in a loop which takes around 20 sec/file, so would be 50 mins. If i put all the files in directory it takes upto 40 mins if i use folder copy with --recursive which is Multithread. I tried to change the s3 config by specifying the concurrent req to 20 , increased bandwidth, but its almost same time. What is the best way to reduce the time.

How to force exceptions and errors to print in AWS Lambda that is forcing the Lambda function to run twice?

I have a AWS lambda function (written in python 3.7) that is triggered when a specific JSON file is uploaded from a server to a s3 bucket. Currently I have trigger set on AWS lambda for a PUT request with the specific suffix of the file.
The issue is the lambda function is running twice everytime the JSON file is uploaded once to the s3 bucket. I confirmed via cloudwatch that every instance of any additional runs is roughly 10seconds to 1min apart and each run has an unique requestID.
To troubleshoot, I confirmed that JSON input is coming from one bucket and outputs are being written to completely separate bucket. I silenced all warnings coming from pandas, and do not see any errors that would occur in the code pop up in cloudwatch. I also have changed the retry attempts from 2 to 0.
The function also has the following metrics when it is running, with a timeout set at 40seconds and memory size set to 1920MB. There should be enough time and memory for the function to use:
Duration: 1216.03 ms Billed Duration: 1300 ms Memory Size: 1920 MB Max Memory Used: 164 MB
I am at a loss as to what I am doing wrong.
How can I force AWS Lambda to display the issues or errors it is encountering that is forcing the Lambda function to run multiple times in my python code or where ever the issue is occurring?

The issue was that my code was throwing an error, but for some reason cloudwatch was not showing the error.

Azure put_block_blob_from_path(): Avoid timeout error

I am uploading a 60 GB file using Python and azure-storage. I get a timeout error (read timeout=65) more often than not:
HTTPSConnectionPool(host='myaccount.blob...', port=443): Read timed out. (read timeout=65)
The code:
bs = BlobService(account_name=storage_account, account_key =account_key)
bs.put_block_blob_from_path(
container_name = my_container,
blob_name = azure_blobname,
file_path = localpath,
x_ms_blob_content_type = "text/plain",
max_connections=5
)
Is there something I can do to increase the timeout or otherwise fix this issue? put_block_blob_from_path() doesn't seem to have a timeout parameter.
I am using an older version of azure-storage (0.20.0). That's so we don't have to rewrite our code (put_block_blob_from_path no longer exists) and so we avoid the inevitable downtime as we install the new version, switch the code over, and deal with whatever crap is related to installing the new version over the old version. Is this timeout an issue that has been solved in newer versions?

There're a few things you could try:
Increase the timeout: BlobService constructor has a timeout parameter, default value of which is 65 seconds. You can try to increase that. I believe the max timeout value you can specify is 90 seconds. So your code would be:
bs = BlobService(account_name=storage_account, account_key=account_key, timeout=90)
Reduce "max_connections": max_connections property defines the maximum number of parallel threads in which upload is happening. Since you're uploading a 60GB file, SDK automatically splits that file in 4MB chunks and upload 5 chunks (based on your current value) in parallel. You can try by reducing this value and see if that gets rid of the timeout error you're receiving.
Manually implement put_block and put_block_list: By default the SDK splits the blob in 4MB chunks and upload these chunks. You can try by using put_block and put_block_list methods in the SDK where you're reducing the chunk size from 4MB to a smaller value (say 512KB or 1MB).

Image files after upload are larger than they should be

I am uploading (through aws-sdk library to node.js) some files to Amazon S3. When it comes to image file - it looks like it is much bigger on S3 than body.length printed in node.js
E.g. I've got file with 7050103 of body.length, but S3 browser shows that it is:
Size: 8,38 MB (8789522 bytes)
I know that there are some meta here - but what meta could take more than 1MB?
What is the source of such a big difference? Is there a way to find out what size it would be on s3 before sending this file to s3?

I have upload file via s3 console and actually in this case there was no difference in size. I found out that the problem was in using lwip library for rotating image. I had a bug - I did rotate even if angle was 0, so I was rotating by 0 deg. After such rotating the image was bigger. I think that compression to jpg may be in different quality or something.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Uploading a huge file from ec2 to s3 fails - linux

You can have up to 10000 upload parts per object, so it fails on part 10001. Using larger parts may solve the issue.

"huge"---is it 10s or 100s of GBs? s3 limits the object size to 5GB and uploading may fail if it exceeds the size limitation.

Related

AWS Lambda: how to give ffmpeg large files?

What is the best way to transfer large files using aws s3 cp command of awscli

How to force exceptions and errors to print in AWS Lambda that is forcing the Lambda function to run twice?

Azure put_block_blob_from_path(): Avoid timeout error

Image files after upload are larger than they should be

Categories

Resources