AWS Lambda with Node - saving files into Lambda's file system - node.js

I need to save files I get from S3 into a Lambda's file system and I wanted to know if I can do that simply using fs.writeFileSync ?
Or do I have to still use the context function as described here:
How to Write and Read files to Lambda-AWS with Node.js
(tried to find newer examples, but could not).
What is the recommended method?
Please advise.

Yes, you can use the typical fs functions to read/write from local disk, but be aware that writing is limited to the /tmp directory and the default max diskspace available to your Lambda function in that location is 512 MB. Also note that files written there may persist to the next (warm) Lambda invocation.
If you want to simply download an object from S3 to the local disk (assuming it will fit in the available diskspace) then you can combine AWS SDK methods and Node.js streaming to stream the content to disk.
Also, it's worth noting that, depending on your app, you may be able to process the entire S3 object in RAM via streaming, without any need to actually persist to disk. This is helpful if your object size is over 512MB.
Update: as of March 2022, you can now configure Lambda functions with more ephemeral storage in /tmp, up to 10 GB. You get 512 MB included with your Lambda function invocation and are charged for the additional configured storage above 512 MB.
If you need to persist very large files, consider using Elastic File System.

Lambda does not allow access to the local file system. It is mean to be an ephemeral environment. They allow access to the /tmp folder, but only for a maximum of 512MB. If you want to have storage along with your function, you will need to implement AWS S3 or AWS EFS.
Here's an article from AWS explaining this.
Here's the docs on adding storage to Lambda.

Related

Storing node module on S3 bucket for AWS Lambda

I have developed a nodejs based function/program and want to run it on AWS Lambda. The problem is that the size is greater than 50MB and AWS Lambda supports direct function code to be under 50MB.
Mainly on my code the node module are of 43MB and the actual code is around 7MB. So is there any way I can separate my node module from code, May be if we can store the node modules in S3 bucket and then access it on AWS Lambda? Any suggestions would be helpful. Thanks
P.S: Due to some dependencies issues I cant run this function as a Docker image on Lambda.
If you do not want or cannot use Docker packaging, you can zip up your node_modules into an S3 bucket.
Your handler (or the module containing your handler), can then download the zip archive and extract files to /tmp. Then, you require() your modules from there.
The above description make not be 100% accurate as there are many ways of doing it. But that's the general idea.
This is one deployment method that zappa, a tool for deploying Python/Django apps to AWS Lambda, has supported long before docker containers were allowed in Lambda.
https://github.com/Miserlou/Zappa/pull/548
You may use lambda layers which is a perfect fit for your use case. Sometime ago, we need to use facebook sdk for one of our project and we created a lambda layer for the facebook sdk(32 mb) and then the deployment package became only 4 KB.
It is stated as
Using layers can make it faster to deploy applications with the AWS Serverless Application Model (AWS SAM) or the Serverless framework. By moving runtime dependencies from your function code to a layer, this can help reduce the overall size of the archive uploaded during a deployment.
Single Lambda function can use up to five layers. The maximum size of the total unzipped function and all layers is 250 MB which is far beyond your limits.

Move files from S3 to AWS EFS on the fly

I am building an application where the user can upload images, for this I am using S3 as files storage.
In other area of the application there is some process deployed on EC2 that need to use the uploaded images.
This process need the images multiple times (it generate some report with it) and it's in part of multiple EC2 - using elastic beanstalk.
The process doesn't need all the images at once, but need some subset of it every job it gets (depend the parameters it gets).
Every ec2 instance is doing an independent job - they are not sharing file between them but they might need the same uploaded images.
What I am doing now is to download all the images from s3 to the EC2 machine because it's need the files locally.
I have read that EFS can be mounted to an EC2 and then I can access it like it was a local storage.
I did not found any example of uploading directly to EFS with nodejs (or other lang) but I found a way to transfer file from S3 to EFS - "DataSync".
https://docs.aws.amazon.com/efs/latest/ug/transfer-data-to-efs.html
So I have 3 questions about it:
It is true that I can't upload directly to EFS from my application? (nodesjs + express)
After I move files to EFS, will I able to use it exactly like it in the local storage of the ec2?
Is it a good idea to move file from s3 to efs all the time or there is other solution to the problem I described?
For this exact situation, we use https://github.com/kahing/goofys
It's very reliable, and additionally, offers the ability to mount S3 buckets as folders on any device - windows, mac as well as of course Linux.
Works outside of the AWS cloud 'boundary' too - great for developer laptops.
Downside is that it does /not/ work in a Lambda context, but you can't have everything!
Trigger a Lambda to call an ECS task when the file is uploaded to s3. The ECS task starts and mounts the EFS volume and copies the file from s3 to the EFS.
This wont run into problems with really large files with Lambda getting timed out.
I dont have the code but would be interested if someone already has this solution coded.

Nodejs API, Docker (Swarm), scalability and storage

I programmed an API with nodejs and express like million others out there and it will go live in a few weeks. The API currently runs on a single docker host with one volume for persistent data containing images uploaded by users.
I'm now thinking about scalability and a high availability setup where the question about network volumes come in. I've read a lot about NFS volumes and potentially the S3 Driver for a docker swarm.
From the information I gathered, I sorted out two possible solutions for the swarm setup:
Docker Volume Driver
I could connect each docker host either to an S3 Bucket or EFS Storage with the compose file
Connection should work even if I move VPS Provider
Better security if I put a NFS storage on the private part of the network (no S3 or EFS)
API Multer S3 Middleware
No attached volume required since the S3 connection is done from within the container
Easier swarm and docker management
Things have to be re-programmed and a few files needs to be migrated
On a GET request, the files will be provided by AWS directly instead of the API
Please, tell me your opposition on this. Am I getting this right or do I miss something? Which route should I take? Is there something to consider with latency or permissions when mounting from different hosts?
Tipps on S3, EFS are definitely welcome, since I have no knowledge yet.
I would not recommend saving to disk, instead use S3 API directly - create buckets and write in your app code.
If you're thinking of mounting a single S3 bucket as your drive there are severe limitations with that. The 5Gb limit. Anytime you modify contents in any way the driver will reupload the entire bucket. If there's any contention it'll have to retry. Years ago when I tried this the fuse drivers weren't stable enough to use as part of a production system, they'd crash and you have to remount. It was a nice idea but could only be used as an ad hoc kind of thing on the command line.
As far as NFS for the love of god don't do this to yourself you're taking on responsibility for this on yourself.
EFS can't really comment, by the time it was available most people just learned to use S3 and it is cheaper.

AWS S3 File Sync

I am trying to do file syncing from local source to a S3 bucket where I am uploading the files to S3 bucket by calculating MD5 checksum and putting it in the metadata for each file. The issue is that while doing so I also checked the files which are already there at destination to avoid duplicate upload. This I do by creating a list of files for upload which doesn't match on name and MD5 both. This operation of fetching the metadata for S3 files and computing MD5 for local files on the fly and then matching them is taking lot of time as I have around 200000 to 500000 files for matching.
Is there any better way to achieve this either by using multithreading or anything else. I have not much idea how to achieve it in multithreading environment as I eventually need one list and multiple threads doing the processing and adding to the same list. Any code sample or help is much appreciated.
This Windows job application is written in C#, using .NET 4.6.1 framework.
You could use the AWS Command-Line Interface (CLI), which has a aws s3 sync command that performs very similar to what you describe. However, with several hundreds thousand files, it is going to perform slowly on the matching, too.
Or, you could use Amazon S3 Inventory - Amazon Simple Storage Service to obtain a daily listing of the files in the S3 bucket (including MD5 checksum) and then compare your files against that.

What is the best approach in nodejs to switch between the local filesystem and Amazon S3 filesystem?

I'm looking for the best way to switch between using the local filesystem and the Amazon S3 filesystem.
I think ideally I would like a wrapper to both filesystems that I can code against. A configuration change would tell the wrapper which filesystem to use. This is useful to me because a developer can use their local filesystem, but our hosted environments can use Amazon S3 by just changing a configuration option.
Are there any existing wrappers that do this? Should I write my own wrapper?
Is there another approach I am not aware of that would be better?
There's a project named s3fs that offers a subset of POSIX file system function on top of S3. There's no native Amazon-provided way to do this.
However, you should think long and hard about whether or not this is a sensible option. S3 is an object store, not a regular file system, and it has quite different performance and latency characteristics.
If you're looking for high iops, NAS-style storage then Amazon EFS (in preview) would be more appropriate. Or roll your own NFS/CIFS solution using EBS volumes or SoftNAS or Gluster.
I like your idea to build a wrapper that can use either the local file system or S3. I'm not aware of anything existing that would provide that for you, but would certainly be interested to hear if you find anything.
An alternative would be to use some sort of S3 file system mount, so that your application can always use standard file system I/O but the data might be written to S3 if your system has that location configured as an S3 mount. I don't recommend this approach because I've never heard of an S3 mounting solution that didn't have issues.
Another alternative is to only design your application to use S3, and then use some sort of S3 compatible local object storage in your development environment. There are several answers to this question that could provide an S3 compatible service during development.
There's a service called JuiceFS that can do what you want.
According to their documentation:
JuiceFS is a POSIX-compatible shared filesystem specifically designed
to work in the cloud.
It is designed to run in the cloud so you can utilize the cheap price
of object storage service to store your data economically.
It is a
POSIX-compatible filesystem so you can access your data seamlessly as
accessing local files.
It is a shared filesystem so you can share your
files across multiple machines.
s3 is one of the backends supported, you can even configure it to replicate files to a different object storage system on another cloud.

Resources