Image resizing AWS Lambda with threads - multithreading

I have 20000 images in S3 Bucket. I want to resize all of them using AWS Lambda. For doing this I am downloading the image in the tmp folder of Lambda and then uploading it back to S3.
I want to optimize it so I implemented threading in it. My code is working fine when I am using 15 threads, but when I am using more than 15-16 threads it is creating issues like connection pool is full. I would like to mention that I have explicitly taken care of waiting for the termination of already running threads.
What can I do to optimize the code? If more threads can be created then whats the best way of creating threads inside Lambda?

call a lambda method 20k times passing the filename it needs to work with... don't need to wait. each lambda call will process each file. you can have 20k threads that way.
you can create a rule, so when a new file is in S3, a lambda method is called. but the first batch will need to be processed manually.

Related

Listen to aws s3 bucket when upload file

I have a problem that I need to know if anyone has an idea how to solve it..
I need to create something that listens to the s3 bucket when a file is uploaded there and actually take the file they uploaded and manipulate it in my website with all kinds of processes that I already have
So basically, is there something like this that lets me listen to uploads that have been made in s3 and then manipulate it?
Thank you
There are many ways to achieve this.
First enable S3 Notification that will be triggered on s3 PutObject, and trigger any of these -
Lambda - gets the object and processes (not for large files, lambda can run for 15 mint)
put new object notifications in a SQS queue. Then launch ec2 instances to process the files. You can use autoscaling and cloudwatch alarm with it. Get some ideas from here.
Or some more.
My suggestion would be this -
s3 notification -> Trigger Lambda -> get object key and run ec2 instance -> ec2 does the hard work
No ideas are perfect, it highly depends on your system. Look for better solution that meets your need.
Best wishes.

Media conversion on AWS

I have an API written in nodeJS (/api/uploadS3) which is a PUT request and accepts a video file and a URL (AWS s3 URL in this case). Once called its task is to upload the file on the s3 URL.
Now, users are uploading files to this node API in different formats (thanks to the different browsers recording videos in different formats) and I want to convert all these videos to mp4 and then store them in s3.
I wanted to know what is the best approach to do this?
I have 2 solutions till now
1. Convert on node server using ffmpeg -
The issue with this is that ffmpeg can only execute a single operation at a time. And since I have only one server I will have to implement a queue for multiple requests which can lead to longer waiting times for users who are at the end of the queue. Apart from that, I am worried that during any ongoing video conversion if my node's traffic handling capability will be affected.
Can someone help me understand what will be the effect of other requests coming to my server while video conversion is going on? How will it impact the RAM, CPU usage and speed of processing other requests?
2. Using AWS lambda function -
To avoid load on my node server I was thinking of using an AWS lambda server where my node API will upload the file to S3 in the format provided by the user. Once, done s3 will trigger a lambda function which can then take that s3 file and convert it into .mp4 using ffmpeg or AWS MediaConvert and once done it uploads the mp4 file to a new s3 path. Now I don't want the output path to be any s3 path but the path that was received by the node API in the first place.
Moreover, I want the user to wait while all this happens as I have to enable other UI features based on the success or error of this upload.
The query here is that, is it possible to do this using just a single API like /api/uploadS3 which --> uploads to s3 --> triggers lambda --> converts file --> uploads the mp4 version --> returns success or error.
Currently, if I upload to s3 the request ends then and there. So is there a way to defer the API response until and unless all the operations have been completed?
Also, how will the lambda function access the path of the output s3 bucket which was passed to the node API?
Any other better approach will be welcomed.
PS - the s3 path received by the node API is different for each user.
Thanks for your post. The output S3 bucket generates File Events when a new file arrives (i.e., is delivered from AWS MediaConvert).
This file event can trigger a second Lambda Function which can move the file elsewhere using any of the supported transfer protocols, re-try if necessary; log a status to AWS CloudWatch and/or AWS SNS; and then send a final API response based on success/completion of them move.
AWS has a Step Functions feature which can maintain state across successive lambda functions, for automating simple workflows. This should work for what you want to accomplish. see https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-creating-lambda-state-machine.html
Note, any one lambda function has a 15 minute maximum runtime, so any one transcoding or file copy operation must complete within 15min. The alternative is to run on EC2.
I hope this helps you out!

Spawning hundreds of Lambda processes but waiting for them all to finish

I'm currently using AWS Step Functions in a "queue watcher" setup.
I have an initial Lambda that spawns hundreds of ID's that are added to an SQS queue, which is then consumed by a "Worker" Lambda. When the "Worker" lambda has consumed the queue I need to run a "Logout" Lambda to expire a ticket.
The problem I'm having is sometimes the logout happens before the queue is empty.
Is there a better solution to this? I've looked into callbacks but it doesn't seem usable in this scenario? Passing the payload through Step Functions instead of SQS isn't possible either due to payload limits.
Thanks,
Step Function:
Flow Chart of Lambdas:

Mount s3 objects in Lambda fs

I need to read s3 objects in lambda functions as native files without first downloading them. So as soon as a program requests those files from the fs it begins to read them from the bucket but is unaware of that and thinks it's a native file.
My issue is that I'm spawning a program (from binary) which reads all the (several hundred) input URLS synchronously and as a result the accumulation of all the HTTP connection latency is multiplied by the number of files (hundreds) which becomes very significant. If the URLs were to local files I'd save minutes just from the HTTP issue so I'm looking for a solution which would make all the connections asynchronous which then the program can call on-demand without delay.
Perhaps there might be a way to mount a file on the linux fs which consumes from a nodejs stream object? So it's not writing to disk or keeping it in a buffer in-memory but it's available for consumption as a stream.

AWS Lambda async concurrency limits

I'm working on an AWS Lambda function that currently makes hundreds of API calls but when going into production it will make hundreds of thousands. The problem is that I can't test at that scale.
I'm using the async module to execute my api calls with async.eachLimit so that I can limit the concurrency (I currently set it a 300).
The thing that I don't understand is the limits on AWS Lambda. Here's what the docs say:
AWS Lambda Resource Limits per Invocation
Number of file descriptors: 1,024
Number of processes and threads (combined total): 1,024
As I understand it, Node.js is single threaded so I don't think I would exceed that limit. I'm not using child processes and the async library doesn't either so OK on that front too.
Now about those file descriptors, my function strictly calls the rest of AWS's API and I'm never writing to disk so I don't think I'm using them.
The other important AWS Lambda limits are execution time and memory consumed. Those are very clearly reported on each execution and I am perfectly aware when I'm close to reaching them or not, so let's ignore these for now.
A little bit of context:
The exact nature of my function is that every time a sports match starts I need to subscribe all mobile devices to the appropriate SNS topics, so basically I'm calling our own MySQL database and then the AWS SNS endpoint repeatedly.
So the question is...
How far can I push async's concurrency in AWS Lambda in this context? Are there any practical limits or something else that might come into play that I'm not considering?
As I understand it, Node.js is single threaded so I don't think I
would exceed that limit. I'm not using child processes and the async
library doesn't either so OK on that front too.
Node.js is event driven, not single threaded.
The Javascript engine runs on a single thread (the event loop) and delegates I/O operation to an internal library (libuv) which handles its thread pool and asynchronous operations.
async doesn't open a child process on its own, but behind the scenes, whether you're making an HTTP request or interacting with the file system, you're delegating these operations to libuv.
In other words, you've answered your own question well with the resources limits:
How far can I push async's concurrency in AWS Lambda in this context? Are there any practical limits or something else that might come into play that I'm not considering?
AWS Lambda Resource Limits per Invocation
Number of file descriptors: 1,024
Number of processes and threads (combined total): 1,024
It's hard to say whether libuv would open a new thread for each I/O operation, so you might get away with a little more than the numbers listed above. But you will probably run out or memory way before reaching those limits anyway.
The bottom line is no, you won't be able to make hundreds of thousands of calls in a single lambda execution.
Regarding the context of your function, depending on how often your job needs to run, you might want to refactor your lambda to multiple executions (it would also run faster), or have it on an EC2 with auto scaling triggered by lambda.

Resources