Mount s3 objects in Lambda fs - node.js

I need to read s3 objects in lambda functions as native files without first downloading them. So as soon as a program requests those files from the fs it begins to read them from the bucket but is unaware of that and thinks it's a native file.
My issue is that I'm spawning a program (from binary) which reads all the (several hundred) input URLS synchronously and as a result the accumulation of all the HTTP connection latency is multiplied by the number of files (hundreds) which becomes very significant. If the URLs were to local files I'd save minutes just from the HTTP issue so I'm looking for a solution which would make all the connections asynchronous which then the program can call on-demand without delay.
Perhaps there might be a way to mount a file on the linux fs which consumes from a nodejs stream object? So it's not writing to disk or keeping it in a buffer in-memory but it's available for consumption as a stream.

Related

Boto3 Pre-Sign Upload URL takes very long to generate

I have a Python application that needs to generate many Pre-Sign Upload URLs for a single bucket on our AWS cloud.
I noticed a strange behavior - it seems that every so often generating the URL would take around 5 seconds instead of the nano seconds it usually takes. I vaguely remember that generating a URL is a local operation and I couldn't find any documentation on this.
I'm running a Falcon app over Gunicorn with boto3~=1.20.54. The Gunicorn is configured with multiple sync workers and same number of threads.
It should be noted that this is a rare occurrence but it is very random and I can't explain it. I did find some explanations that the boto3 client is somewhat lazy in that it loads meta data on the S3 bucket on the first operation (see Warning section). I added a noop call to list_buckets as the first thing when the app loads but it didn't affect the behavior.
Any help will be apprecicated.

Media conversion on AWS

I have an API written in nodeJS (/api/uploadS3) which is a PUT request and accepts a video file and a URL (AWS s3 URL in this case). Once called its task is to upload the file on the s3 URL.
Now, users are uploading files to this node API in different formats (thanks to the different browsers recording videos in different formats) and I want to convert all these videos to mp4 and then store them in s3.
I wanted to know what is the best approach to do this?
I have 2 solutions till now
1. Convert on node server using ffmpeg -
The issue with this is that ffmpeg can only execute a single operation at a time. And since I have only one server I will have to implement a queue for multiple requests which can lead to longer waiting times for users who are at the end of the queue. Apart from that, I am worried that during any ongoing video conversion if my node's traffic handling capability will be affected.
Can someone help me understand what will be the effect of other requests coming to my server while video conversion is going on? How will it impact the RAM, CPU usage and speed of processing other requests?
2. Using AWS lambda function -
To avoid load on my node server I was thinking of using an AWS lambda server where my node API will upload the file to S3 in the format provided by the user. Once, done s3 will trigger a lambda function which can then take that s3 file and convert it into .mp4 using ffmpeg or AWS MediaConvert and once done it uploads the mp4 file to a new s3 path. Now I don't want the output path to be any s3 path but the path that was received by the node API in the first place.
Moreover, I want the user to wait while all this happens as I have to enable other UI features based on the success or error of this upload.
The query here is that, is it possible to do this using just a single API like /api/uploadS3 which --> uploads to s3 --> triggers lambda --> converts file --> uploads the mp4 version --> returns success or error.
Currently, if I upload to s3 the request ends then and there. So is there a way to defer the API response until and unless all the operations have been completed?
Also, how will the lambda function access the path of the output s3 bucket which was passed to the node API?
Any other better approach will be welcomed.
PS - the s3 path received by the node API is different for each user.
Thanks for your post. The output S3 bucket generates File Events when a new file arrives (i.e., is delivered from AWS MediaConvert).
This file event can trigger a second Lambda Function which can move the file elsewhere using any of the supported transfer protocols, re-try if necessary; log a status to AWS CloudWatch and/or AWS SNS; and then send a final API response based on success/completion of them move.
AWS has a Step Functions feature which can maintain state across successive lambda functions, for automating simple workflows. This should work for what you want to accomplish. see https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-creating-lambda-state-machine.html
Note, any one lambda function has a 15 minute maximum runtime, so any one transcoding or file copy operation must complete within 15min. The alternative is to run on EC2.
I hope this helps you out!

How to stream file using curl from one server to another(limited server resources)

My API server has a very limited disk space(500MB) and memory(1GB). One of the API calls that it gets, is to receive a file. The consumer calls API and passes URL to be downloaded.
The "goal" of my server is to upload this file to Amazon S3. Unfortunately, I can't ask the consumer to upload the file directly to S3(part of requirements).
The problem is, sometimes those are huge files(10GB) and saving them to disk and then uploading it to S3 is not an option(500MB disk space limit).
My question is, how can I "pipe" the file from the input url to S3 using curl Linux program ?
Note: I was able to pipe it in different ways but, either it first tries to download the whole file and fails or I hit memory error and curl quits. My guess is that the download is much faster than the upload, so the pipe buffer/memory grows and explodes(1GB memory on server) when I get 10GB files.
Is there a way to achieve what I'm trying to do using curl and piping ?
Thank you,
- Jack
Another SO user asked a similar question about curl posts from stdin. See use pipe for curl data.
Once you are able to post your upload stream from the output of the first curl process's standard output, if you are running out of memory because you are downloading faster than you can upload, have a look at the mbuffer utility. I haven't used it myself, but it seems to be designed for exactly this sort of problem.
Lastly, if all else fails, I guess you could use curl's --limit-rate option to lock the transfer rates of the upload and download to some identical and sustainable values. This potentially underutilizes bandwidth, and won't scale well with multiple parallel download/upload streams, but for some one-off batch process, it might be good enough.

Image resizing AWS Lambda with threads

I have 20000 images in S3 Bucket. I want to resize all of them using AWS Lambda. For doing this I am downloading the image in the tmp folder of Lambda and then uploading it back to S3.
I want to optimize it so I implemented threading in it. My code is working fine when I am using 15 threads, but when I am using more than 15-16 threads it is creating issues like connection pool is full. I would like to mention that I have explicitly taken care of waiting for the termination of already running threads.
What can I do to optimize the code? If more threads can be created then whats the best way of creating threads inside Lambda?
call a lambda method 20k times passing the filename it needs to work with... don't need to wait. each lambda call will process each file. you can have 20k threads that way.
you can create a rule, so when a new file is in S3, a lambda method is called. but the first batch will need to be processed manually.

How does upload/download of a large file affect the event loop in Node.js

I'm just learning Node.js and am confused about (at least) one thing. It appears that Node.js has no problem servicing multiple requests that upload or download large files. Each request might take minutes to complete. Why is it that the event loop is not frozen while one of these requests is being serviced?
-brian
In Node.js even if you work only on a single thread, at the OS level it uses async non-blocking events. That means that the task of writing/reading the buffer is multiplexed with other events from other connections.
If you noticed the IncomingMessage implements ReadbleStream
http://nodejs.org/api/http.html#http_http_incomingmessage
That's because it doesn't read all the data of once, it reads blocks. Each block of data from the HTTP request will be sent to you as an event which you have to handle.
The event loop is not frozen cause async read from a file or a socket is a service provided by the OS. You tell the OS you want to read a file and you can check from time to time how many bytes have bean read from that file, or if the whole file has bean read (loaded into memory).
These async functions are provided by the OS in our days because the reading of a file or from a socket is usually slower than reading from RAM or cache. So while your process keeps doing other tasks, the OS will have a thread which takes care of the reading.

Resources