I programmed an API with nodejs and express like million others out there and it will go live in a few weeks. The API currently runs on a single docker host with one volume for persistent data containing images uploaded by users.
I'm now thinking about scalability and a high availability setup where the question about network volumes come in. I've read a lot about NFS volumes and potentially the S3 Driver for a docker swarm.
From the information I gathered, I sorted out two possible solutions for the swarm setup:
Docker Volume Driver
I could connect each docker host either to an S3 Bucket or EFS Storage with the compose file
Connection should work even if I move VPS Provider
Better security if I put a NFS storage on the private part of the network (no S3 or EFS)
API Multer S3 Middleware
No attached volume required since the S3 connection is done from within the container
Easier swarm and docker management
Things have to be re-programmed and a few files needs to be migrated
On a GET request, the files will be provided by AWS directly instead of the API
Please, tell me your opposition on this. Am I getting this right or do I miss something? Which route should I take? Is there something to consider with latency or permissions when mounting from different hosts?
Tipps on S3, EFS are definitely welcome, since I have no knowledge yet.
I would not recommend saving to disk, instead use S3 API directly - create buckets and write in your app code.
If you're thinking of mounting a single S3 bucket as your drive there are severe limitations with that. The 5Gb limit. Anytime you modify contents in any way the driver will reupload the entire bucket. If there's any contention it'll have to retry. Years ago when I tried this the fuse drivers weren't stable enough to use as part of a production system, they'd crash and you have to remount. It was a nice idea but could only be used as an ad hoc kind of thing on the command line.
As far as NFS for the love of god don't do this to yourself you're taking on responsibility for this on yourself.
EFS can't really comment, by the time it was available most people just learned to use S3 and it is cheaper.
Related
Is it good practice for node.js service containers running under AWS ECS to mount a shared node_modules volume persisted on EFS? If so, what's the best way to pre-populate the EFS before the app launches?
My front-end services run a node.js app, launched on AWS Fargate instances. This app uses many node_modules. Is it necessary for each instance to install the entire body of node_modules within its own container? Or can they all mount a shared EFS filesystem containing a single copy of the node_modules?
I've been migrating to AWS Copilot to orchestrate, but the docs are pretty fuzzy on how to pre-populate the EFS. At one point they say, "we recommend mounting a temporary container and using it to hydrate the EFS, but WARNING: we don't recommend this approach for production." (Storage: AWS Copilot Advanced Use Cases)
Thanks for the question! This is pointing out some gaps in our documentation that have been opened as we released new features. There is actually a manifest field, image.depends_on which mitigates the issue called out in the docs about prod usage.
To answer your question specifically about hydrating EFS volumes prior to service container start, you can use a sidecar and the image.depends_on field in your manifest.
For example:
image:
build: ./Dockerfile
depends_on:
bootstrap: success
storage:
volumes:
common:
path: /var/copilot/common
read_only: true
efs: true
sidecars:
bootstrap:
image: public.ecr.aws/my-image:latest
essential: false #Allows the sidecar to run and terminate without the ECS task failing
mount_points:
- source_volume: common
path: /var/copilot/common
read_only: false
On deployment, you'd build and push your sidecar image to ECR. It should include either your packaged data or a script to pull down the data you need, then move it over into the EFS volume at /var/copilot/common in the container filesystem.
Then, when you next run copilot svc deploy, the following things will happen:
Copilot will create an EFS filesystem in your environment. Each service will get its own isolated access point in EFS, which means service containers all share data but can't see data added to EFS by other services.
Your sidecar will run to completion. That means that all currently running services will see the changes to the EFS filesystem whenever a new copy of the task is deployed unless you specifically create task-specific subfolders in EFS in your startup script.
Once the sidecar exits successfully, the new service container will come up on ECS and operate as normal. It will have access to the EFS volume which will contain the latest copy of your startup data.
Hope this helps.
Is it good practice for node.js service containers running under AWS
ECS to mount a shared node_modules volume persisted on EFS?
Asking if it is "good" or not is a matter of opinion. It is generally a common practice in ECS however. You have to be very cognizant of the IOPS your application is going to generate against the EFS volume however. Once an EFS volume runs out of burst credits it can really slow down and impact the performance of your application.
I have not ever seen an EFS volume used to store node_modules before. In all honesty it seems like a bad idea to me. Dependencies like that should always be bundled in your Docker image. Otherwise it's going to be difficult when it comes time to upgrade those dependencies in your EFS volume, and may require down-time to upgrade.
If so, what's the best way to pre-populate the EFS before the app
launches?
You would have to create the initial EFS volume and mount it somewhere like an EC2 instance, or another ECS container, and then run whatever commands necessary in that EC2/ECS instance to copy your files to the volume.
The quote in your question isn't present on the page you linked, so it's difficult to say exactly what other approach would the Copilot team would recommend.
I am building an application where the user can upload images, for this I am using S3 as files storage.
In other area of the application there is some process deployed on EC2 that need to use the uploaded images.
This process need the images multiple times (it generate some report with it) and it's in part of multiple EC2 - using elastic beanstalk.
The process doesn't need all the images at once, but need some subset of it every job it gets (depend the parameters it gets).
Every ec2 instance is doing an independent job - they are not sharing file between them but they might need the same uploaded images.
What I am doing now is to download all the images from s3 to the EC2 machine because it's need the files locally.
I have read that EFS can be mounted to an EC2 and then I can access it like it was a local storage.
I did not found any example of uploading directly to EFS with nodejs (or other lang) but I found a way to transfer file from S3 to EFS - "DataSync".
https://docs.aws.amazon.com/efs/latest/ug/transfer-data-to-efs.html
So I have 3 questions about it:
It is true that I can't upload directly to EFS from my application? (nodesjs + express)
After I move files to EFS, will I able to use it exactly like it in the local storage of the ec2?
Is it a good idea to move file from s3 to efs all the time or there is other solution to the problem I described?
For this exact situation, we use https://github.com/kahing/goofys
It's very reliable, and additionally, offers the ability to mount S3 buckets as folders on any device - windows, mac as well as of course Linux.
Works outside of the AWS cloud 'boundary' too - great for developer laptops.
Downside is that it does /not/ work in a Lambda context, but you can't have everything!
Trigger a Lambda to call an ECS task when the file is uploaded to s3. The ECS task starts and mounts the EFS volume and copies the file from s3 to the EFS.
This wont run into problems with really large files with Lambda getting timed out.
I dont have the code but would be interested if someone already has this solution coded.
I'm looking for the best way to switch between using the local filesystem and the Amazon S3 filesystem.
I think ideally I would like a wrapper to both filesystems that I can code against. A configuration change would tell the wrapper which filesystem to use. This is useful to me because a developer can use their local filesystem, but our hosted environments can use Amazon S3 by just changing a configuration option.
Are there any existing wrappers that do this? Should I write my own wrapper?
Is there another approach I am not aware of that would be better?
There's a project named s3fs that offers a subset of POSIX file system function on top of S3. There's no native Amazon-provided way to do this.
However, you should think long and hard about whether or not this is a sensible option. S3 is an object store, not a regular file system, and it has quite different performance and latency characteristics.
If you're looking for high iops, NAS-style storage then Amazon EFS (in preview) would be more appropriate. Or roll your own NFS/CIFS solution using EBS volumes or SoftNAS or Gluster.
I like your idea to build a wrapper that can use either the local file system or S3. I'm not aware of anything existing that would provide that for you, but would certainly be interested to hear if you find anything.
An alternative would be to use some sort of S3 file system mount, so that your application can always use standard file system I/O but the data might be written to S3 if your system has that location configured as an S3 mount. I don't recommend this approach because I've never heard of an S3 mounting solution that didn't have issues.
Another alternative is to only design your application to use S3, and then use some sort of S3 compatible local object storage in your development environment. There are several answers to this question that could provide an S3 compatible service during development.
There's a service called JuiceFS that can do what you want.
According to their documentation:
JuiceFS is a POSIX-compatible shared filesystem specifically designed
to work in the cloud.
It is designed to run in the cloud so you can utilize the cheap price
of object storage service to store your data economically.
It is a
POSIX-compatible filesystem so you can access your data seamlessly as
accessing local files.
It is a shared filesystem so you can share your
files across multiple machines.
s3 is one of the backends supported, you can even configure it to replicate files to a different object storage system on another cloud.
For some scenarios a clustered file system is just too much. This is, if I got it right, the use case for the data volume container pattern. But even CoreOS needs updates from time to time. If I'd still like to minimise the down time of applications, I'd have to move the data volume container with the app container to an other host, while the old host is being updated.
Are there best practices existing? A solution mentioned more often is the "backup" of a container with docker export on the old host and docker import on the new host. But this would include scp-ing of tar-files to an other host. Can this be managed with fleet?
#brejoc, I wouldn't call this a solution, but it may help:
Alternative
1: Use another OS, which does have clustering, or at least - doesn't prevent it. I am now experimenting with CentOS.
2: I've created a couple of tools that help in some use cases. First tool, retrieves data from S3 (usually artifacts), and is uni-directional. Second tool, which I call 'backup volume container', has a lot of potential in it, but requires some feedback. It provides a 2-way backup/restore for data, from/to many persistent data stores including S3 (but also Dropbox, which is cool). As it is implemented now, when you run it for the first time, it would restore to the container. From that point on, it would monitor the relevant folder in the container for changes, and upon changes (and after a quiet period), it would back up to the persistent store.
Backup volume container: https://registry.hub.docker.com/u/yaronr/backup-volume-container/
File sync from S3: https://registry.hub.docker.com/u/yaronr/awscli/
(docker run yaronr/awscli aws s3 etc etc - read aws docs)
I am planning to use cloudfoundry paas service (from VMWare) for hosting my node.js application. I have seen that it has support for mongo and redis in the service layer and node.js framework. So far so good.
Now I need to store my mediafiles(images uploaded by users) to a filesystem. I have the metadata stored in Mongo.
I have been searching internet, but have not yet got good information.
You cannot do that for the following reasons:
There are multiple host machines running your application. They each have their own filesystems. Each running process in your application would see a different set of files.
The host machines on which your particular application is running can change moment-to-moment. Indeed, they will change every time you re-deploy your application. Every time a process is started on a new host machine, it will see an empty set of files. Every time a process is stopped on an old host machine, all the files would be permanently deleted.
You absolutely must solve this problem in another way.
Store the media files in MongoDB GridFS.
Store the media files in an object store such as Amazon S3 or Rackspace Cloud Files.
Filesystem in most cloud solutions are "ephemeral", so you can not use FS. You will have to use solutions like S3/ DB for such purpose