NodeJS app best way to process large logic in azure vm - node.js

I have created a NodeJS express post api which process the file stored in azure blob. In the post call to this api, I am just sending filename in body. This call will get the data from blob, create a local file with that data, run command on that file and store it in blob again. But that command is taking 10-15 mins to process one file. That's why the request is getting timed out. I had 2 questions here:
Is there way to respond to the call before processing starts. Like respond to the api call and then start file processing.
If 1 is not possible, which is the best solution for this problem
Thank you in advance.

You must use queue for long running tasks. For that you can choose any library like agendajs, bull, bee-queue, kue etc

Related

How to poll another server from Node.js?

I'm currently developing a Shopify app with Node/Express and a Postgres database. When a user registers an account and connects their Shopify store, I'll need to download all of their store's orders. They could have 100,000s of orders, so I'd like to use a Shopify GraphQL Bulk Operation. While Shopify is handling this, my Node server will need to poll the Shopify server to check on the progress, and when the operation is complete, Shopify will send me a link where I can download all of the data. Once the data is processed and stored in my database, I'll send the user an email to say that their account is now set up.
How should I handle polling the Shopify server? The process could take anywhere from a few mins to hours. Using setInterval() would be a bad idea right? Because if the server restarts for whatever reason, It will lose the interval? So, should I use some sort of background task? And would I need to store anything in my database? I've researched cron jobs, child processes, worker threads, the bull package -- and it's left me a little confused.
(I also know that I could use a webhook, but Shopify offers no guarantees that my app will receive the webhook.)
Upon installation, launch a background job labeled "GetCustomerOrders". As you know, background jobs are mature, and nicely handle problems. For example, they can retry themselves if something goes wrong.
The Background job itself just sets up the Bulk Download and then settles into Poll. Polling is no big deal and just happens. As you said, could be minutes, could take hours. Nevertheless, a poll gets status on a bulk download, and that can even be hot-rodded. For example, you poll with an ID. So you poll till that ID completes. Regardless of restarts.
At the end of that rather simple setup, you get an URL to download and parse JSON. Spawn another job even for that. Endless fun. Why sweat it? Background jobs are the way to go.
The Webhook idea is OK but as the documentation says, they are not 100% and CRON is bush-league in that it misses out on the mature development of jobs in queues and is more like a simple trigger. Relying on CRON to start something is fine, but gives you zero management over what it starts.
I am guessing NodeJS has a decent background job system by this time. When you look at Sidekiq for Ruby you realize what awesome is. Surely you can find a copycat in Node that comes close anyway.

Media conversion on AWS

I have an API written in nodeJS (/api/uploadS3) which is a PUT request and accepts a video file and a URL (AWS s3 URL in this case). Once called its task is to upload the file on the s3 URL.
Now, users are uploading files to this node API in different formats (thanks to the different browsers recording videos in different formats) and I want to convert all these videos to mp4 and then store them in s3.
I wanted to know what is the best approach to do this?
I have 2 solutions till now
1. Convert on node server using ffmpeg -
The issue with this is that ffmpeg can only execute a single operation at a time. And since I have only one server I will have to implement a queue for multiple requests which can lead to longer waiting times for users who are at the end of the queue. Apart from that, I am worried that during any ongoing video conversion if my node's traffic handling capability will be affected.
Can someone help me understand what will be the effect of other requests coming to my server while video conversion is going on? How will it impact the RAM, CPU usage and speed of processing other requests?
2. Using AWS lambda function -
To avoid load on my node server I was thinking of using an AWS lambda server where my node API will upload the file to S3 in the format provided by the user. Once, done s3 will trigger a lambda function which can then take that s3 file and convert it into .mp4 using ffmpeg or AWS MediaConvert and once done it uploads the mp4 file to a new s3 path. Now I don't want the output path to be any s3 path but the path that was received by the node API in the first place.
Moreover, I want the user to wait while all this happens as I have to enable other UI features based on the success or error of this upload.
The query here is that, is it possible to do this using just a single API like /api/uploadS3 which --> uploads to s3 --> triggers lambda --> converts file --> uploads the mp4 version --> returns success or error.
Currently, if I upload to s3 the request ends then and there. So is there a way to defer the API response until and unless all the operations have been completed?
Also, how will the lambda function access the path of the output s3 bucket which was passed to the node API?
Any other better approach will be welcomed.
PS - the s3 path received by the node API is different for each user.
Thanks for your post. The output S3 bucket generates File Events when a new file arrives (i.e., is delivered from AWS MediaConvert).
This file event can trigger a second Lambda Function which can move the file elsewhere using any of the supported transfer protocols, re-try if necessary; log a status to AWS CloudWatch and/or AWS SNS; and then send a final API response based on success/completion of them move.
AWS has a Step Functions feature which can maintain state across successive lambda functions, for automating simple workflows. This should work for what you want to accomplish. see https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-creating-lambda-state-machine.html
Note, any one lambda function has a 15 minute maximum runtime, so any one transcoding or file copy operation must complete within 15min. The alternative is to run on EC2.
I hope this helps you out!

http api on google cloud using app engine or cloud function

i want to build an api using python and host it on google cloud. api will basically read some data in bucket and then do some processing on it and return the data back. i am hoping that i can read the data in memory and when request comes, just process it and send response back to serve it with low latency. assume i will read few thousand records from some database/storage and when request comes process and send 10 back based on request parameters. i dont want to make connection/read storage when the request comes as it will take time and i want to serve as fast as possible.
will google cloud function work for this need? or shoudl i go with app engine. (basically i want to be able to read the data once and hold it for incoming requests). data mostly will be less than 1-2 gb (max)
thanks,
Manish
You have to have the static data with your function code. Increase the Cloud Functions memory to allow it to load in memory the data to keep them warm and for having a very fast access to it.
Then, you have 2 way to achieve it:
Load the data at startup. You load them only once, the first call has a high latency to download (from GCS for instance) and load the data in memory. The advantage is: in case of data update, you don't have to redeploy your function, only update the data in their location. At the next function start, the new data will be loaded
Deploy the function with the static data in the deployment. This time the startup time is much faster (no download), only the data to load in memory. But when you want to update the data, you have to redeploy your function.
A final word if you have 2 set of static data, you must have 2 functions. The responsibility is different, so the deployment is different.

Generate big data in excel or pdf using REST API

I'm trying to generate the excel report file in micro-service using REST API.
On REST API if the generation process may take long time, connection would give time out for the users.
Is there any best practice or architecture pattern for this purpose?
EX: If data includes 10 column with 1 million rows the generation process should spend 30 seconds. Also it might depends on what technical resources we have.
You should do heavy task in asynchronous way. Client should just trigger the process and should not wait for the completion. Now question come how Client will get updated copy of Excel. There are 2 ways:-
In response of initiate call, server return a job Id. Client will keep polling for the status of job Id. Whenever job get completed, it will get the file.
Some notification mechanism like Socket.io, where server will notify whenever job is done. After getting notification, client may download the processed file.

Waiting for task to complete nodejs

I have an express route that make a thumbnail of an image, writes it locally to a file and streams it back to the user.
I'm running in cluster mode, and if consecutive requests come they all re-size the image together and write on each other,
I would like the first request to start re-sizing, and the other requests to know there is already as job running and wait for it to complete and then stream it's results.
How can I achieve that ?
You could give locker a try: https://github.com/bobrik/locker & https://github.com/bobrik/node-locker

Resources