Trying to save money on EBS snapshots, so the idea is to take manual copies of the file systems (using dd) and storing manually in S3 to lifecycle to IA and Glacier.
The following works fine for smaller files (tested with 1GB), but on larger (~800GB), after around 40GB, everything slows to a crawl and never finishes
sudo dd if=/dev/sdb bs=64M status=progress | aws s3 cp - s3://my-bucket/sdb_backup.img --sse AES256 --storage-class STANDARD_IA
Running this from an m4.4xlarge instance (16 vcpu, 64GB RAM)
Not exactly sure why it's crawling to a halt, or whether this is the best way to solve this problem (manually storing file systems on s3 Infrequent Access storage class)
Any thoughts?
Thanks!!
It is not a good idea because snapshots are incremental, so you'll spend more starting from the next few hand-made snapshots.
If you still want this way then consider multi-part upload (chunks up to 5GB).
You can use something like goofys to redirect output to S3. I've personally tested with files up to 1TB.
First consider the multipart upload for large sizes.
Second use the compressed version,
dd if=/dev/sdX | gzip -c | aws s3 cp - s3://bucket-name/desired_image_name.img
If you wish to copy files to Amazon S3, the easiest method is to use the AWS Command-Line Interface (CLI):
aws s3 sync dir s3://my-bucket/dir
As an alternative to Standard-Infrequent Access, you could create a lifecycle policy on the S3 bucket to move files to Glacier. (This is worthwhile for long-term storage, but not for the short-term due to higher request charges.)
Related
I need to save files I get from S3 into a Lambda's file system and I wanted to know if I can do that simply using fs.writeFileSync ?
Or do I have to still use the context function as described here:
How to Write and Read files to Lambda-AWS with Node.js
(tried to find newer examples, but could not).
What is the recommended method?
Please advise.
Yes, you can use the typical fs functions to read/write from local disk, but be aware that writing is limited to the /tmp directory and the default max diskspace available to your Lambda function in that location is 512 MB. Also note that files written there may persist to the next (warm) Lambda invocation.
If you want to simply download an object from S3 to the local disk (assuming it will fit in the available diskspace) then you can combine AWS SDK methods and Node.js streaming to stream the content to disk.
Also, it's worth noting that, depending on your app, you may be able to process the entire S3 object in RAM via streaming, without any need to actually persist to disk. This is helpful if your object size is over 512MB.
Update: as of March 2022, you can now configure Lambda functions with more ephemeral storage in /tmp, up to 10 GB. You get 512 MB included with your Lambda function invocation and are charged for the additional configured storage above 512 MB.
If you need to persist very large files, consider using Elastic File System.
Lambda does not allow access to the local file system. It is mean to be an ephemeral environment. They allow access to the /tmp folder, but only for a maximum of 512MB. If you want to have storage along with your function, you will need to implement AWS S3 or AWS EFS.
Here's an article from AWS explaining this.
Here's the docs on adding storage to Lambda.
I am working on object detection for a school project. To train my CNN model I am using a google cloud server because I do not own a strong enough GPU to train it locally.
The training data consists of images (.jpg files) and annotations (.txt files) and is spread over around 20 folders due to the fact that they come from different sources and I do not want to mix pictures from different sources so I want to keep this directory structure.
My current issue is that I could not find a fast way of uploading them to my google cloud server.
My workaround was to upload those image folders as a .zip file on google drive and download them on the cloud and unzip them there. This process needs way too much time because I have to upload many folders and google drive does not have a good API to download folders to Linux.
On my local computer, I am using Windows 10 and my cloud server runs Debian.
Therefore, I'd be really grateful if you know a fast and easy way to either upload my images directly to the server or at least to upload my zipped folders.
Couldn't you just create an infinite loop to look for jpg files and scp/sftp the jpg directly to your server once the file is there? On windows, you can achieve this using WSL.
(sorry this may not be your final answer, but i don't have the reputation to ask you this question)
I would upload them to a Google Cloud Storage bucket using gsutil with multithreading. This means that multiple files are copied at once, so the only limitation here is your internet speed. Gsutil installers for Windows and Linux are found here. Example command:
gsutil -m cp -r dir gs://my-bucket
Then on the VM you do exactly the opposite:
gsutil -m cp -r gs://my-bucket dir
This is super fast, and you only pay a small amount for the storage, which is super cheap and/or falls within the GCP free tier.
Note: make sure you have write permissions on the storage bucket and the default compute service account (i.e. the VM service account) has read permissions on the storage bucket.
The best stack for the use case will be gsutil + storage bucket
Copy the zip files to cloud storage bucket and put a sync cron to get the files on the VM.
Make use of gsutil
https://cloud.google.com/storage/docs/gsutil
I am trying to do file syncing from local source to a S3 bucket where I am uploading the files to S3 bucket by calculating MD5 checksum and putting it in the metadata for each file. The issue is that while doing so I also checked the files which are already there at destination to avoid duplicate upload. This I do by creating a list of files for upload which doesn't match on name and MD5 both. This operation of fetching the metadata for S3 files and computing MD5 for local files on the fly and then matching them is taking lot of time as I have around 200000 to 500000 files for matching.
Is there any better way to achieve this either by using multithreading or anything else. I have not much idea how to achieve it in multithreading environment as I eventually need one list and multiple threads doing the processing and adding to the same list. Any code sample or help is much appreciated.
This Windows job application is written in C#, using .NET 4.6.1 framework.
You could use the AWS Command-Line Interface (CLI), which has a aws s3 sync command that performs very similar to what you describe. However, with several hundreds thousand files, it is going to perform slowly on the matching, too.
Or, you could use Amazon S3 Inventory - Amazon Simple Storage Service to obtain a daily listing of the files in the S3 bucket (including MD5 checksum) and then compare your files against that.
In my AWS s3 bucket I have thousand of mp3 files and I want to modify the ID3 tags for those files. please suggest the best way.
Sorry to give you the bad news, but only way to do is downloading files one by one update id3 tags and upload them back to s3 bucket. You cannot edit files in place, because AWS S3 is object storage, meaning it keeps data in key: value pairs, key is the folder/filename, value is the file content. It's not suitable for file system, databases, etc.
If you do it this way, one warning, check if you have versioning is on or off for your bucket. Sometimes it's nice to have versioning handled automatically by S3 but, you should remember that each version adds to the storage space that you're paying for.
If you want to edit/modify your files every now and then, you can use AWS EBS or EFS. Both EBS and EFS are block storages, and you can attach them to any EC2 instance, then you can edit/modify your files. The difference between EBS and EFS mainly is, EFS can be attached to multiple EC2 instances at the same time, and share the files in between them.
One more thing about EBS and EFS though, to reach your files, you need to attach it to an EC2 instance. There is no other way to reach your files as easily as in S3.
I need to send backup files of ~2TB to S3. I guess the most hassle-free option would be Linux scp command (have difficulty with s3cmd and don't want an overkill java/RoR to do so).
However I am not sure whether it is possible: How to use S3's private and public keys with scp, and don't know what would be my destination IP/url/path?
I appreciate your hints.
As of 2015, SCP/SSH is not supported (and probably never will be for the reasons mentioned in the other answers).
Official AWS tools for copying files to/from S3
command line tool (pip3 install awscli) - note credentials need to be specified, I prefer via environment variables rather than a file: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY.
aws s3 cp /tmp/foo/ s3://bucket/ --recursive --exclude "*" --include "*.jpg"
http://docs.aws.amazon.com/cli/latest/reference/s3/index.html
and an rsync-like command:
aws s3 sync . s3://mybucket
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
Web interface:
https://console.aws.amazon.com/s3/home?region=us-east-1
Non-AWS methods
Any other solutions depend on third-party executables (e.g. botosync, jungledisk...) which can be great as long as they are supported. But third party tools come and go as years go by and your scripts will have a shorter shelf life.
https://github.com/ncw/rclone
EDIT: Actually, AWS CLI is based on botocore:
https://github.com/boto/botocore
So botosync deserves a bit more respect as an elder statesman than I perhaps gave it.
Here's just the thing for this, boto-rsync. From any Linux box, install boto-rsync and then use this to transfer /local/path/ to your_bucket/remote/path/:
boto-rsync -a your_access_key -s your_secret_key /local/path/ s3://your_bucket/remote/path/
The paths can also be files.
For a S3-compatible provider other than AWS, use --endpoint:
boto-rsync -a your_access_key -s your_secret_key --endpoint some.provider.com /local/path/ s3://your_bucket/remote/path/
You can't SCP.
The quickest way, if you don't mind spending money, is probably just to send it to them on a disk and they'll put it up there for you. See their Import/Export service.
Here you go,
scp USER#REMOTE_IP:/FILE_PATH >(aws s3 cp - s3://BUCKET/SAVE_FILE_AS_THIS_NAME)
Why don't you scp it to an EBS volume and then use s3cmd from there? As long as your EBS volume and s3 bucket are in the same region, you'll only be charged for inbound data charges once (from your network to the EBS volume)
I've found that once within the s3 network, s3cmd is much more reliable and the data transfer rate is far higher than direct to s3.
There is an amazing tool called Dragon Disk. It works as a sync tool even and not just as plain scp.
http://www.s3-client.com/
The Guide to setup the amazon s3 is provided here and after setting it up you can either copy paste the files from your local machine to s3 or setup an automatic sync. The User Interface is very similiar to WinSCP or Filezilla.
for our AWS backups we use a combination of duplicity and trickle duplicity for rsync and encryption and trickle to limit the upload speed