Media conversion on AWS - node.js

I have an API written in nodeJS (/api/uploadS3) which is a PUT request and accepts a video file and a URL (AWS s3 URL in this case). Once called its task is to upload the file on the s3 URL.
Now, users are uploading files to this node API in different formats (thanks to the different browsers recording videos in different formats) and I want to convert all these videos to mp4 and then store them in s3.
I wanted to know what is the best approach to do this?
I have 2 solutions till now
1. Convert on node server using ffmpeg -
The issue with this is that ffmpeg can only execute a single operation at a time. And since I have only one server I will have to implement a queue for multiple requests which can lead to longer waiting times for users who are at the end of the queue. Apart from that, I am worried that during any ongoing video conversion if my node's traffic handling capability will be affected.
Can someone help me understand what will be the effect of other requests coming to my server while video conversion is going on? How will it impact the RAM, CPU usage and speed of processing other requests?
2. Using AWS lambda function -
To avoid load on my node server I was thinking of using an AWS lambda server where my node API will upload the file to S3 in the format provided by the user. Once, done s3 will trigger a lambda function which can then take that s3 file and convert it into .mp4 using ffmpeg or AWS MediaConvert and once done it uploads the mp4 file to a new s3 path. Now I don't want the output path to be any s3 path but the path that was received by the node API in the first place.
Moreover, I want the user to wait while all this happens as I have to enable other UI features based on the success or error of this upload.
The query here is that, is it possible to do this using just a single API like /api/uploadS3 which --> uploads to s3 --> triggers lambda --> converts file --> uploads the mp4 version --> returns success or error.
Currently, if I upload to s3 the request ends then and there. So is there a way to defer the API response until and unless all the operations have been completed?
Also, how will the lambda function access the path of the output s3 bucket which was passed to the node API?
Any other better approach will be welcomed.
PS - the s3 path received by the node API is different for each user.

Thanks for your post. The output S3 bucket generates File Events when a new file arrives (i.e., is delivered from AWS MediaConvert).
This file event can trigger a second Lambda Function which can move the file elsewhere using any of the supported transfer protocols, re-try if necessary; log a status to AWS CloudWatch and/or AWS SNS; and then send a final API response based on success/completion of them move.
AWS has a Step Functions feature which can maintain state across successive lambda functions, for automating simple workflows. This should work for what you want to accomplish. see https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-creating-lambda-state-machine.html
Note, any one lambda function has a 15 minute maximum runtime, so any one transcoding or file copy operation must complete within 15min. The alternative is to run on EC2.
I hope this helps you out!

Related

Listen to aws s3 bucket when upload file

I have a problem that I need to know if anyone has an idea how to solve it..
I need to create something that listens to the s3 bucket when a file is uploaded there and actually take the file they uploaded and manipulate it in my website with all kinds of processes that I already have
So basically, is there something like this that lets me listen to uploads that have been made in s3 and then manipulate it?
Thank you
There are many ways to achieve this.
First enable S3 Notification that will be triggered on s3 PutObject, and trigger any of these -
Lambda - gets the object and processes (not for large files, lambda can run for 15 mint)
put new object notifications in a SQS queue. Then launch ec2 instances to process the files. You can use autoscaling and cloudwatch alarm with it. Get some ideas from here.
Or some more.
My suggestion would be this -
s3 notification -> Trigger Lambda -> get object key and run ec2 instance -> ec2 does the hard work
No ideas are perfect, it highly depends on your system. Look for better solution that meets your need.
Best wishes.

NodeJS app best way to process large logic in azure vm

I have created a NodeJS express post api which process the file stored in azure blob. In the post call to this api, I am just sending filename in body. This call will get the data from blob, create a local file with that data, run command on that file and store it in blob again. But that command is taking 10-15 mins to process one file. That's why the request is getting timed out. I had 2 questions here:
Is there way to respond to the call before processing starts. Like respond to the api call and then start file processing.
If 1 is not possible, which is the best solution for this problem
Thank you in advance.
You must use queue for long running tasks. For that you can choose any library like agendajs, bull, bee-queue, kue etc

How to create end connection in AWS s3 out node in node-red?

I have a usecase where I want to upload my local file to s3 bucket and further continue next process. So, I am using amazon s3 out node. it uploads file properly but it's not allowing me to perform further operations. Because the node has one was connection.
How to perform next operation using this node? Any hint would be appreciable.
You have 3 options
Modify the node to have an output. If you do this you should consider raising a pull request against the project so the authors can decide if they want to include your changes.
Split the flow into 2 parts. And then use the AWS watch node (included in the same package as the AWS s3 output node you are using) to watch for the event that matches the file being uploaded. You will have to store any data that is included with the file you are uploading in the context so it can be retrieved by the second flow.
Is pretty much the same as option 2, but you can use the Status node instead to watch for changes in status of the AWS output node to trigger the later steps.
Out of three options I'd probably follow them in the same order

Scan files in AWS S3 bucket for virus using lambda

We've a requirement to scan the files uploaded by the user and check if it has virus and then tag it as infected. I checked few blogs and other stackoverflow answers and got to know that we can use calmscan for the same.
However, I'm confused on what should be the path for virus scan in clamscan config. Also, is there tutorial that I can refer to. Our application backend is in Node.js.
I'm open to other libraries/services as well
Hard to say without further info (i.e the architecture your code runs on, etc).
I would say the easiest possible way to achieve what you want is to hook up a trigger on every PUT event on your S3 Bucket. I have never used any virus scan tool, but I believe that all of them run as a daemon within a server, so you could subscribe an SQS Queue to your S3 Bucket event and have a server (which could be an EC2 instance or an ECS task) with a virus scan tool installed poll the SQS queue for new messages.
Once the message is processed and a vulnerability is detected, you could simply invoke the putObjectTagging API on the malicious object.
We have been doing something similar, but in our case, its before the file storing in S3. Which is OK, I think, solution would still works for you.
We have one EC2 instance where we have installed the clamav. Then written a web-service that accepts Multi-part file and take that file content and internally invokes ClamAv command for scanning that file. In response that service returns whether the file is Infected or not.
Your solution, could be,
Create a web-service as mentioned above and host it on EC2(lets call it, virus scan service).
On Lambda function, call the virus scan service by passing the content.
Based on the Virus Scan service response, tag your S3 file appropriately.
If your open for paid service too, then in above the steps, #1 won't be applicable, replace the just the call the Virus-Scan service of Symantec or other such providers etc.
I hope it helps.
You can check this solution by AWS, it will give you an idea of a similar architecture: https://aws.amazon.com/blogs/developer/virus-scan-s3-buckets-with-a-serverless-clamav-based-cdk-construct/

How to run python script using S3 data in AWS

I have a CSV file in S3. I want to run a python script using data present in S3. The S3 file will change once in a week. I need to pass an input argument to my python script which loads my S3 file into Pandas and do some calculation to return the result.
Currently I am loading this S3 file using Boto3 in my server for each input argument. This process takes more time to return the result, and my nginx returns with 504 Gateway timeout.
I am expecting some AWS service to do it in cloud. Can anyone point me in a right direction which AWS service is suitable to use here
You have several options:
Use AWS Lambda, but Lambda has limited local storage (500mb) and memory (3gb) with 15 run time.
Since you mentioned Pandas I recommend using AWS Glue which has ability:
Detect new file
Large Mem, CPU supported
Visual data flow
Support Spark DF
Ability to query data from your CSV files
Connect to different database engines.
We currently use AWS Glue for our data parser processes

Resources