Move files from S3 to AWS EFS on the fly - node.js

I am building an application where the user can upload images, for this I am using S3 as files storage.
In other area of the application there is some process deployed on EC2 that need to use the uploaded images.
This process need the images multiple times (it generate some report with it) and it's in part of multiple EC2 - using elastic beanstalk.
The process doesn't need all the images at once, but need some subset of it every job it gets (depend the parameters it gets).
Every ec2 instance is doing an independent job - they are not sharing file between them but they might need the same uploaded images.
What I am doing now is to download all the images from s3 to the EC2 machine because it's need the files locally.
I have read that EFS can be mounted to an EC2 and then I can access it like it was a local storage.
I did not found any example of uploading directly to EFS with nodejs (or other lang) but I found a way to transfer file from S3 to EFS - "DataSync".
https://docs.aws.amazon.com/efs/latest/ug/transfer-data-to-efs.html
So I have 3 questions about it:
It is true that I can't upload directly to EFS from my application? (nodesjs + express)
After I move files to EFS, will I able to use it exactly like it in the local storage of the ec2?
Is it a good idea to move file from s3 to efs all the time or there is other solution to the problem I described?

For this exact situation, we use https://github.com/kahing/goofys
It's very reliable, and additionally, offers the ability to mount S3 buckets as folders on any device - windows, mac as well as of course Linux.
Works outside of the AWS cloud 'boundary' too - great for developer laptops.
Downside is that it does /not/ work in a Lambda context, but you can't have everything!

Trigger a Lambda to call an ECS task when the file is uploaded to s3. The ECS task starts and mounts the EFS volume and copies the file from s3 to the EFS.
This wont run into problems with really large files with Lambda getting timed out.
I dont have the code but would be interested if someone already has this solution coded.

Related

Nodejs API, Docker (Swarm), scalability and storage

I programmed an API with nodejs and express like million others out there and it will go live in a few weeks. The API currently runs on a single docker host with one volume for persistent data containing images uploaded by users.
I'm now thinking about scalability and a high availability setup where the question about network volumes come in. I've read a lot about NFS volumes and potentially the S3 Driver for a docker swarm.
From the information I gathered, I sorted out two possible solutions for the swarm setup:
Docker Volume Driver
I could connect each docker host either to an S3 Bucket or EFS Storage with the compose file
Connection should work even if I move VPS Provider
Better security if I put a NFS storage on the private part of the network (no S3 or EFS)
API Multer S3 Middleware
No attached volume required since the S3 connection is done from within the container
Easier swarm and docker management
Things have to be re-programmed and a few files needs to be migrated
On a GET request, the files will be provided by AWS directly instead of the API
Please, tell me your opposition on this. Am I getting this right or do I miss something? Which route should I take? Is there something to consider with latency or permissions when mounting from different hosts?
Tipps on S3, EFS are definitely welcome, since I have no knowledge yet.
I would not recommend saving to disk, instead use S3 API directly - create buckets and write in your app code.
If you're thinking of mounting a single S3 bucket as your drive there are severe limitations with that. The 5Gb limit. Anytime you modify contents in any way the driver will reupload the entire bucket. If there's any contention it'll have to retry. Years ago when I tried this the fuse drivers weren't stable enough to use as part of a production system, they'd crash and you have to remount. It was a nice idea but could only be used as an ad hoc kind of thing on the command line.
As far as NFS for the love of god don't do this to yourself you're taking on responsibility for this on yourself.
EFS can't really comment, by the time it was available most people just learned to use S3 and it is cheaper.

AWS Beanstalk NodeJS app - what happens when I save to file system?

I have an app that currently saves some data as a file to file system. On self-hosted server it saves it to a disk. When I deploy it to AWS Beanstalk service where will this file end up? Does AWS use persistent or ephemereal file system?
My use case is very simple and I don't want to bother with setting up S3 storage, is it possible to just leave it be? Can I access the file system somehow?
Underneath the beanstalk wrapper, there are ec2 instances, which are running your code. So if you use file system storage, the file will be saved in on the disk/attached volume. Volume data are persistent and you can ssh into the ec2 instance to find your saved file.

Update the ID3 tags of S3 bucket files

In my AWS s3 bucket I have thousand of mp3 files and I want to modify the ID3 tags for those files. please suggest the best way.
Sorry to give you the bad news, but only way to do is downloading files one by one update id3 tags and upload them back to s3 bucket. You cannot edit files in place, because AWS S3 is object storage, meaning it keeps data in key: value pairs, key is the folder/filename, value is the file content. It's not suitable for file system, databases, etc.
If you do it this way, one warning, check if you have versioning is on or off for your bucket. Sometimes it's nice to have versioning handled automatically by S3 but, you should remember that each version adds to the storage space that you're paying for.
If you want to edit/modify your files every now and then, you can use AWS EBS or EFS. Both EBS and EFS are block storages, and you can attach them to any EC2 instance, then you can edit/modify your files. The difference between EBS and EFS mainly is, EFS can be attached to multiple EC2 instances at the same time, and share the files in between them.
One more thing about EBS and EFS though, to reach your files, you need to attach it to an EC2 instance. There is no other way to reach your files as easily as in S3.

Bidirectional synchronisation between Amazon s3 bucket and physical server

We have a folder in physical server and need to synchronise with one of our Aws s3 bucket. But here the requirement is , we have to synchronise the contents in both the ways (Changes done in the physical server should reflect in Aws S3 bucket and vice versa).Is it possible.?
Use AWS CLI S3 sync. Note that sync is one-way, so you have to issue two separate commands switching source and target to achieve bidirectional sync.
From local directory to S3
aws s3 sync . s3://mybucket
From S3 to local directory
aws s3 sync s3://mybucket .
Running both will get you both directions of the sync.
As pointed out in the comments below each time you modify S3 or your local folder you need to sync in the opposite direction or risk overwriting updated files later.
There are products that do this - Open Source Owncloud and NextCloud run on S3 and a local computer and can sync two folders as a two-way near-live mirror a-la Dropbox. Also Resilio Sync uses Bittorrent to do fast two-way mirrors and can run on S3.

When hosting on EC2, should I use FS to store files "locally" or s3fs to store files on my s3 service "indirectly"?

I'm hosting a node.js express application on EC2, and i'm using Amazon's S3 storage service.
Within my application, hosted on amazon, should I write the files locally (since the server is already running on aws) or should I still use the s3fs package to store the files on the S3 service as if I'm on a remote machine?
Thanks all!
Don't use s3fs. It's nice, but if you try to use it in production, it will be a nightmare. S3FS has to 'translate' any AWS errors into the very limited set that a filesystem can return. It also can't give you fine-grained control of retries, etc.
It's much better to write code to interact with S3. You will be able to get the full error from S3, you can decide what your retry policy is, etc.

Resources