Quick way of uploading many images (3000+) to a google cloud server - linux

I am working on object detection for a school project. To train my CNN model I am using a google cloud server because I do not own a strong enough GPU to train it locally.
The training data consists of images (.jpg files) and annotations (.txt files) and is spread over around 20 folders due to the fact that they come from different sources and I do not want to mix pictures from different sources so I want to keep this directory structure.
My current issue is that I could not find a fast way of uploading them to my google cloud server.
My workaround was to upload those image folders as a .zip file on google drive and download them on the cloud and unzip them there. This process needs way too much time because I have to upload many folders and google drive does not have a good API to download folders to Linux.
On my local computer, I am using Windows 10 and my cloud server runs Debian.
Therefore, I'd be really grateful if you know a fast and easy way to either upload my images directly to the server or at least to upload my zipped folders.

Couldn't you just create an infinite loop to look for jpg files and scp/sftp the jpg directly to your server once the file is there? On windows, you can achieve this using WSL.
(sorry this may not be your final answer, but i don't have the reputation to ask you this question)

I would upload them to a Google Cloud Storage bucket using gsutil with multithreading. This means that multiple files are copied at once, so the only limitation here is your internet speed. Gsutil installers for Windows and Linux are found here. Example command:
gsutil -m cp -r dir gs://my-bucket
Then on the VM you do exactly the opposite:
gsutil -m cp -r gs://my-bucket dir
This is super fast, and you only pay a small amount for the storage, which is super cheap and/or falls within the GCP free tier.
Note: make sure you have write permissions on the storage bucket and the default compute service account (i.e. the VM service account) has read permissions on the storage bucket.

The best stack for the use case will be gsutil + storage bucket
Copy the zip files to cloud storage bucket and put a sync cron to get the files on the VM.
Make use of gsutil
https://cloud.google.com/storage/docs/gsutil

Related

Move files from S3 to AWS EFS on the fly

I am building an application where the user can upload images, for this I am using S3 as files storage.
In other area of the application there is some process deployed on EC2 that need to use the uploaded images.
This process need the images multiple times (it generate some report with it) and it's in part of multiple EC2 - using elastic beanstalk.
The process doesn't need all the images at once, but need some subset of it every job it gets (depend the parameters it gets).
Every ec2 instance is doing an independent job - they are not sharing file between them but they might need the same uploaded images.
What I am doing now is to download all the images from s3 to the EC2 machine because it's need the files locally.
I have read that EFS can be mounted to an EC2 and then I can access it like it was a local storage.
I did not found any example of uploading directly to EFS with nodejs (or other lang) but I found a way to transfer file from S3 to EFS - "DataSync".
https://docs.aws.amazon.com/efs/latest/ug/transfer-data-to-efs.html
So I have 3 questions about it:
It is true that I can't upload directly to EFS from my application? (nodesjs + express)
After I move files to EFS, will I able to use it exactly like it in the local storage of the ec2?
Is it a good idea to move file from s3 to efs all the time or there is other solution to the problem I described?
For this exact situation, we use https://github.com/kahing/goofys
It's very reliable, and additionally, offers the ability to mount S3 buckets as folders on any device - windows, mac as well as of course Linux.
Works outside of the AWS cloud 'boundary' too - great for developer laptops.
Downside is that it does /not/ work in a Lambda context, but you can't have everything!
Trigger a Lambda to call an ECS task when the file is uploaded to s3. The ECS task starts and mounts the EFS volume and copies the file from s3 to the EFS.
This wont run into problems with really large files with Lambda getting timed out.
I dont have the code but would be interested if someone already has this solution coded.

Nodejs API, Docker (Swarm), scalability and storage

I programmed an API with nodejs and express like million others out there and it will go live in a few weeks. The API currently runs on a single docker host with one volume for persistent data containing images uploaded by users.
I'm now thinking about scalability and a high availability setup where the question about network volumes come in. I've read a lot about NFS volumes and potentially the S3 Driver for a docker swarm.
From the information I gathered, I sorted out two possible solutions for the swarm setup:
Docker Volume Driver
I could connect each docker host either to an S3 Bucket or EFS Storage with the compose file
Connection should work even if I move VPS Provider
Better security if I put a NFS storage on the private part of the network (no S3 or EFS)
API Multer S3 Middleware
No attached volume required since the S3 connection is done from within the container
Easier swarm and docker management
Things have to be re-programmed and a few files needs to be migrated
On a GET request, the files will be provided by AWS directly instead of the API
Please, tell me your opposition on this. Am I getting this right or do I miss something? Which route should I take? Is there something to consider with latency or permissions when mounting from different hosts?
Tipps on S3, EFS are definitely welcome, since I have no knowledge yet.
I would not recommend saving to disk, instead use S3 API directly - create buckets and write in your app code.
If you're thinking of mounting a single S3 bucket as your drive there are severe limitations with that. The 5Gb limit. Anytime you modify contents in any way the driver will reupload the entire bucket. If there's any contention it'll have to retry. Years ago when I tried this the fuse drivers weren't stable enough to use as part of a production system, they'd crash and you have to remount. It was a nice idea but could only be used as an ad hoc kind of thing on the command line.
As far as NFS for the love of god don't do this to yourself you're taking on responsibility for this on yourself.
EFS can't really comment, by the time it was available most people just learned to use S3 and it is cheaper.

AWS S3 File Sync

I am trying to do file syncing from local source to a S3 bucket where I am uploading the files to S3 bucket by calculating MD5 checksum and putting it in the metadata for each file. The issue is that while doing so I also checked the files which are already there at destination to avoid duplicate upload. This I do by creating a list of files for upload which doesn't match on name and MD5 both. This operation of fetching the metadata for S3 files and computing MD5 for local files on the fly and then matching them is taking lot of time as I have around 200000 to 500000 files for matching.
Is there any better way to achieve this either by using multithreading or anything else. I have not much idea how to achieve it in multithreading environment as I eventually need one list and multiple threads doing the processing and adding to the same list. Any code sample or help is much appreciated.
This Windows job application is written in C#, using .NET 4.6.1 framework.
You could use the AWS Command-Line Interface (CLI), which has a aws s3 sync command that performs very similar to what you describe. However, with several hundreds thousand files, it is going to perform slowly on the matching, too.
Or, you could use Amazon S3 Inventory - Amazon Simple Storage Service to obtain a daily listing of the files in the S3 bucket (including MD5 checksum) and then compare your files against that.

File read/write on cloud(heroku) using node.js

First of all I am a beginner with node.js.
In node.js when I use functions such as fs.writeFile(); the file is created and is visible in my repository. But when this same process is done on a cloud such as heroku no file is visible in the repository(cloned via git). I know the file is being made because I am able to read it but I cannot view it. Why is this??? Plus how can I view the file?
I had the same issue, and found out that Heroku and other cloud services generally prefer that you don't write in their file system; everything you write/save will be store in "ephemeral filesystem", it's like a ghost file system really.
Usually you would want to use Amazon S3 or reddis for json files etc, and other bigger ones like mp3.
I think it will work if you rent a remote server, like ECS, with a linux system, and a mounted storage space, then this might work.

How to scp to Amazon s3?

I need to send backup files of ~2TB to S3. I guess the most hassle-free option would be Linux scp command (have difficulty with s3cmd and don't want an overkill java/RoR to do so).
However I am not sure whether it is possible: How to use S3's private and public keys with scp, and don't know what would be my destination IP/url/path?
I appreciate your hints.
As of 2015, SCP/SSH is not supported (and probably never will be for the reasons mentioned in the other answers).
Official AWS tools for copying files to/from S3
command line tool (pip3 install awscli) - note credentials need to be specified, I prefer via environment variables rather than a file: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY.
aws s3 cp /tmp/foo/ s3://bucket/ --recursive --exclude "*" --include "*.jpg"
http://docs.aws.amazon.com/cli/latest/reference/s3/index.html
and an rsync-like command:
aws s3 sync . s3://mybucket
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
Web interface:
https://console.aws.amazon.com/s3/home?region=us-east-1
Non-AWS methods
Any other solutions depend on third-party executables (e.g. botosync, jungledisk...) which can be great as long as they are supported. But third party tools come and go as years go by and your scripts will have a shorter shelf life.
https://github.com/ncw/rclone
EDIT: Actually, AWS CLI is based on botocore:
https://github.com/boto/botocore
So botosync deserves a bit more respect as an elder statesman than I perhaps gave it.
Here's just the thing for this, boto-rsync. From any Linux box, install boto-rsync and then use this to transfer /local/path/ to your_bucket/remote/path/:
boto-rsync -a your_access_key -s your_secret_key /local/path/ s3://your_bucket/remote/path/
The paths can also be files.
For a S3-compatible provider other than AWS, use --endpoint:
boto-rsync -a your_access_key -s your_secret_key --endpoint some.provider.com /local/path/ s3://your_bucket/remote/path/
You can't SCP.
The quickest way, if you don't mind spending money, is probably just to send it to them on a disk and they'll put it up there for you. See their Import/Export service.
Here you go,
scp USER#REMOTE_IP:/FILE_PATH >(aws s3 cp - s3://BUCKET/SAVE_FILE_AS_THIS_NAME)
Why don't you scp it to an EBS volume and then use s3cmd from there? As long as your EBS volume and s3 bucket are in the same region, you'll only be charged for inbound data charges once (from your network to the EBS volume)
I've found that once within the s3 network, s3cmd is much more reliable and the data transfer rate is far higher than direct to s3.
There is an amazing tool called Dragon Disk. It works as a sync tool even and not just as plain scp.
http://www.s3-client.com/
The Guide to setup the amazon s3 is provided here and after setting it up you can either copy paste the files from your local machine to s3 or setup an automatic sync. The User Interface is very similiar to WinSCP or Filezilla.
for our AWS backups we use a combination of duplicity and trickle duplicity for rsync and encryption and trickle to limit the upload speed

Resources