Running Background Process using FFMPEG on Google Cloud Run stopping in middle - node.js

I have an external bash script that transcodes audio files using FFmpeg and then uploads the files to google cloud storage. I am using the google cloud run platform for this process but the process is stopping in the middle and not getting any clue from the logs. I am using the node js spawn command to execute the bash script
const createHLSVOD = spawn('/bin/bash', [script, file.path, file.destination, contentId, EPPO_MUSIC_HSL_URL, 'Content', speed]);
createHLSVOD.stdout.on('data', d => console.log(`stdout info: ${d}`));
createHLSVOD.stderr.on('data', d => console.log(`stderr error: ${d}`));
createHLSVOD.on('error', d => console.log(`error: ${d}`));
createHLSVOD.on('close', code => console.log(`child process ended with code ${code}`));
on cloud run beginning the process itself taking a lot of time but in my local machine transcoding and uploading is very fast. after some time transcoding logs are being stopped and no new logs appear. I have no clue what is happening
so what is happening here? why it is very slow in the first place and why the process is being stopped in middle without any error
node js script
Transcoding script
Dockerfile

The issue with spawn is that you create an asynchronous process that run in background. It's a problem, because Cloud Run allow CPU to the container only when a request is being processed. And in your case, you have that
A request arrive
A spawn is created
The spawned script run in background
An HTTP answer is sent on the request. Cloud Run throttles the CPU
Your spawned script continue to run
There is 2 consequences
Your script processing is very very long because the throttling limit the CPU under 5%
After a period with activity (i.e. request received by the instance), Cloud Run kill the unused instance to same resource on its side. it's about 15 minutes, but it's subject to change, it's Google Cloud internal sauce
I recommend you to wait the end of the spawned script or to use a synchronous call; such as execSync, instead of async spawn.

I think the best solution for this issue is to change the approach of the architecture. Instead of using the client to upload the file to Cloud Storage through Cloud Run I would suggest using the signed URL. Signed URL provides permission and time to make a request, allowing users without credentials to perform specific action on a resource, for example you can give the user write access to a bucket for a limited time so they can upload a file. From the client side, this is very similar to your current process, changing the Cloud Run URL for the signed URL.
You can check this link to read more about signed URLs, and here is a quick guide of how to generate it.
For example, you can create a signed URL to test with this command:
First create a private key
gcloud iam service-accounts keys create "file where you want to storage the key" \
--iam-account="name of the service account"#projectID.iam.gserviceaccount.com
Now you authenticate the service account
gcloud auth activate-service-account --key-file KEY_FILE_LOCATION/KEY_FILE_NAME
And now create the signed URL
gsutil signurl -m PUT -d 1h -c CONTENT_TYPE -u gs://BUCKET_NAME/OBJECT_NAME
After that, the client uploads the file using the signed URL to the bucket in Cloud Storage. I would use Pub/Sub notification so you can know that the file was uploaded without any problem, and with that notification you can use it to trigger other operations in Cloud Run.
Basically Pub/Sub notifications are sent when an object changes in a specific bucket, you can follow this guide that could help you configure that notification.
For example, to get notified about newly upload objects, you can configure it this way:
gsutil notification create -e OBJECT_FINALIZE -f json gs://<bucket name>
To use this notification to trigger a Cloud Run process you can see this link.

Related

Media conversion on AWS

I have an API written in nodeJS (/api/uploadS3) which is a PUT request and accepts a video file and a URL (AWS s3 URL in this case). Once called its task is to upload the file on the s3 URL.
Now, users are uploading files to this node API in different formats (thanks to the different browsers recording videos in different formats) and I want to convert all these videos to mp4 and then store them in s3.
I wanted to know what is the best approach to do this?
I have 2 solutions till now
1. Convert on node server using ffmpeg -
The issue with this is that ffmpeg can only execute a single operation at a time. And since I have only one server I will have to implement a queue for multiple requests which can lead to longer waiting times for users who are at the end of the queue. Apart from that, I am worried that during any ongoing video conversion if my node's traffic handling capability will be affected.
Can someone help me understand what will be the effect of other requests coming to my server while video conversion is going on? How will it impact the RAM, CPU usage and speed of processing other requests?
2. Using AWS lambda function -
To avoid load on my node server I was thinking of using an AWS lambda server where my node API will upload the file to S3 in the format provided by the user. Once, done s3 will trigger a lambda function which can then take that s3 file and convert it into .mp4 using ffmpeg or AWS MediaConvert and once done it uploads the mp4 file to a new s3 path. Now I don't want the output path to be any s3 path but the path that was received by the node API in the first place.
Moreover, I want the user to wait while all this happens as I have to enable other UI features based on the success or error of this upload.
The query here is that, is it possible to do this using just a single API like /api/uploadS3 which --> uploads to s3 --> triggers lambda --> converts file --> uploads the mp4 version --> returns success or error.
Currently, if I upload to s3 the request ends then and there. So is there a way to defer the API response until and unless all the operations have been completed?
Also, how will the lambda function access the path of the output s3 bucket which was passed to the node API?
Any other better approach will be welcomed.
PS - the s3 path received by the node API is different for each user.
Thanks for your post. The output S3 bucket generates File Events when a new file arrives (i.e., is delivered from AWS MediaConvert).
This file event can trigger a second Lambda Function which can move the file elsewhere using any of the supported transfer protocols, re-try if necessary; log a status to AWS CloudWatch and/or AWS SNS; and then send a final API response based on success/completion of them move.
AWS has a Step Functions feature which can maintain state across successive lambda functions, for automating simple workflows. This should work for what you want to accomplish. see https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-creating-lambda-state-machine.html
Note, any one lambda function has a 15 minute maximum runtime, so any one transcoding or file copy operation must complete within 15min. The alternative is to run on EC2.
I hope this helps you out!

Is it possible to set ACK deadline for an eventarc pubsub subscription?

I am using an eventarc trigger to send messages to my cloud run instances. However, the issue is that I am unable to set an ACK deadline since there is no way to set any attributes. I have also tried to use just purely Pubsub however, I am unable to get the endpoint URL since it is only retrievable after it is created. I want to avoid having to split this into two different modules since it adds additional complexity, but I am unsure what else to do at this point.
Any help would be greatly appreciated!
Eventarc is backed on PubSub, and the max timeout of PubSub is 10 minutes. Therefore, even if the timeout was customizable on Eventarc, it wouldn't match your requirements.
You have several solutions
You can invoke a Cloud Run/Functions to create a Cloud Task, and the task call your Cloud Run (async) to perform the process in Background
You can invoke a Cloud Workflow that your Cloud Run (async) to perform the process in Background
You can invoke a Cloud Run/Functions that invoke the brand new Cloud Run job (with your processing code) to perform the process in Background

NodeJS app best way to process large logic in azure vm

I have created a NodeJS express post api which process the file stored in azure blob. In the post call to this api, I am just sending filename in body. This call will get the data from blob, create a local file with that data, run command on that file and store it in blob again. But that command is taking 10-15 mins to process one file. That's why the request is getting timed out. I had 2 questions here:
Is there way to respond to the call before processing starts. Like respond to the api call and then start file processing.
If 1 is not possible, which is the best solution for this problem
Thank you in advance.
You must use queue for long running tasks. For that you can choose any library like agendajs, bull, bee-queue, kue etc

Scan files in AWS S3 bucket for virus using lambda

We've a requirement to scan the files uploaded by the user and check if it has virus and then tag it as infected. I checked few blogs and other stackoverflow answers and got to know that we can use calmscan for the same.
However, I'm confused on what should be the path for virus scan in clamscan config. Also, is there tutorial that I can refer to. Our application backend is in Node.js.
I'm open to other libraries/services as well
Hard to say without further info (i.e the architecture your code runs on, etc).
I would say the easiest possible way to achieve what you want is to hook up a trigger on every PUT event on your S3 Bucket. I have never used any virus scan tool, but I believe that all of them run as a daemon within a server, so you could subscribe an SQS Queue to your S3 Bucket event and have a server (which could be an EC2 instance or an ECS task) with a virus scan tool installed poll the SQS queue for new messages.
Once the message is processed and a vulnerability is detected, you could simply invoke the putObjectTagging API on the malicious object.
We have been doing something similar, but in our case, its before the file storing in S3. Which is OK, I think, solution would still works for you.
We have one EC2 instance where we have installed the clamav. Then written a web-service that accepts Multi-part file and take that file content and internally invokes ClamAv command for scanning that file. In response that service returns whether the file is Infected or not.
Your solution, could be,
Create a web-service as mentioned above and host it on EC2(lets call it, virus scan service).
On Lambda function, call the virus scan service by passing the content.
Based on the Virus Scan service response, tag your S3 file appropriately.
If your open for paid service too, then in above the steps, #1 won't be applicable, replace the just the call the Virus-Scan service of Symantec or other such providers etc.
I hope it helps.
You can check this solution by AWS, it will give you an idea of a similar architecture: https://aws.amazon.com/blogs/developer/virus-scan-s3-buckets-with-a-serverless-clamav-based-cdk-construct/

Process Redis KUE jobs within multiple kubernetes pods/instances

I'm using Sails.js for an API which I deploy from a Dockerfile in a Google Cloud kubernetes cluster and scale the workload with 3-5 pods. The API provides endpoints to upload single image files and bigger zip files which I directly extract on the current API pod/instance.
Both, single image files and the extracted archive content (100-1000 files with all together 15-85mb of content), I have to upload to various Storage buckets. This is where redis kue comes into play. To make sure the API is not blocking the request for the uploads for too long, I create delayed kue jobs to move all the uploaded files and folders to storage buckets or chain jobs and create thumbnails with the help of ImageMagick first.
All this can take some time, depending on the current workload of the cluster, sometimes more and sometimes less.
All this works pretty fine with one single instance but within a cluster, it's a different story. Since the kubernetes instance for the API can change from request to request, the uploads can land on instance A, but the job for the files is being processed and handled by instance B (The worker, as well as the API itself, are running on the same instance!) which might won't have the uploads available which leads into a failed job.
It takes time for Google to keep the pods in sync and to spread the uploads to all the other pods.
What I have tried is the following:
Since the name of the current pod is available via env variable HOSTNAME, I'm storing the HOSTNAME with all kue jobs and check within the worker if the HOSTNAME from the jobs matches with the HOSTNAME of the current environment and only allow to process the jobs if both HOSTNAMEs are matching.
Uploads need to be available ASAP; why I can't add a job delay of a few minutes and hope that by the time the job is going to be processed, Google has synchronized its pods.
Pending jobs which don't match the HOSTNAME, I push back to the queue and add delay to it.
What I want to have is a queue which doesn't have to take care of hostnames and conditional checks to successfully process its jobs in a cluster like mine.
for this one "which might won't have the uploads available which leads into a failed job" could you please consider using "Persistent Volumes".
In this case your jobs could work independly looking for extracted archive content into shared storage.
Hope this help. Please share with your findings.

Resources