Possible ways to handle long running process in nodeJS

Possible ways to handle long running process in nodeJS - node.js

I'm trying to find best way to handle long running process in express app.
I have tried forks up till now. Is there any other good way to handle long running process in node.
Use Cases:
read files, parse data and store to DB loops every 15 min.
Polling : After x time period loop on approx 1000 records and read data from external api and save data.
I'm using setInterval for starting process again need solutions that can prevent memory leaks.
Thanks in advance.

Try using https://www.npmjs.com/package/node-cron, create a cron worker with the interval.
This should address your both concerns.
You can spin up multiple cron workers.
Poll the external services in the cron and update the db, so the service layer always looks for the data in db instead of contacting 3rd party services.
Hope this helps you. Cheers.

Related

How to poll another server from Node.js?

I'm currently developing a Shopify app with Node/Express and a Postgres database. When a user registers an account and connects their Shopify store, I'll need to download all of their store's orders. They could have 100,000s of orders, so I'd like to use a Shopify GraphQL Bulk Operation. While Shopify is handling this, my Node server will need to poll the Shopify server to check on the progress, and when the operation is complete, Shopify will send me a link where I can download all of the data. Once the data is processed and stored in my database, I'll send the user an email to say that their account is now set up.
How should I handle polling the Shopify server? The process could take anywhere from a few mins to hours. Using setInterval() would be a bad idea right? Because if the server restarts for whatever reason, It will lose the interval? So, should I use some sort of background task? And would I need to store anything in my database? I've researched cron jobs, child processes, worker threads, the bull package -- and it's left me a little confused.
(I also know that I could use a webhook, but Shopify offers no guarantees that my app will receive the webhook.)

Upon installation, launch a background job labeled "GetCustomerOrders". As you know, background jobs are mature, and nicely handle problems. For example, they can retry themselves if something goes wrong.
The Background job itself just sets up the Bulk Download and then settles into Poll. Polling is no big deal and just happens. As you said, could be minutes, could take hours. Nevertheless, a poll gets status on a bulk download, and that can even be hot-rodded. For example, you poll with an ID. So you poll till that ID completes. Regardless of restarts.
At the end of that rather simple setup, you get an URL to download and parse JSON. Spawn another job even for that. Endless fun. Why sweat it? Background jobs are the way to go.
The Webhook idea is OK but as the documentation says, they are not 100% and CRON is bush-league in that it misses out on the mature development of jobs in queues and is more like a simple trigger. Relying on CRON to start something is fine, but gives you zero management over what it starts.
I am guessing NodeJS has a decent background job system by this time. When you look at Sidekiq for Ruby you realize what awesome is. Surely you can find a copycat in Node that comes close anyway.

Calling external API only when new data is available

I am serving my users with data fetched from an external API. Now, I don't know when this API will have new data, how would be the best approach to do that using Node, for example?
I have tried setInterval's and node-schedule to do that and got it working, but isn't it expensive for the CPU? For example, over a day I would hit this endpoint to check for new data every minute, but it could have new data every five minutes or more.
The thing is, this external API isn't ran by me. Would the only way to check for updates hitting it every minute? Is there any module that can do that in Node or any approach that fits better?
Use case 1 : Call a weather API for every city of the country and just save data to my db when it is going to rain in a given city.
Use case 2 : Send notification to the user when a given Philips Hue lamp is turned on at the time it is turned on without having to hit the endpoint to check if it is on or not.
I appreciate the time to discuss this.

If this external API has no means of notifying you when there's new data, then the only thing you can do is to "poll" it to check for new data.
You will have to decide what an "efficient design" for polling is in your specific application and given the type of data and the needs of the client (what is an acceptable latency for new data).
You also need to be sure that your service is not violating any terms of service with your polling scheme or running afoul of rate limiting that may deny you access to the server if you use it "too much".
Would the only way to check for updates hitting it every minute?
Unless the API offers some notification feature, there is no other scheme other than polling at some interval. Polling every minute is fairly quick. Do your clients really need information that is less than a minute old? Or would it really make no difference if the information was as much as 5 minutes old.
For example, in your example of weather, a client wouldn't really need temperature updates more often than probably every 10-15 minutes.
Is there any module that can do that in Node or any approach that fits better?
No. Not really. You'll probably just use some sort of timer (either repeated setTimeout() or setInterval() in a node.js app to repeatedly carry out your API operations.
Use case: Call a weather API for every city of the country and just save data to my db when it is going to rain in a given city.
Trying to pre-save every possible piece of data from an external API is probably a losing proposition. You're essentially trying to "scrape" all the data from the external API. That is likely against the terms of service and will likely also run afoul of rate limits. And, it's just not very practical.
Instead, you will probably want to fetch data upon demand (when a client requests data for Phoenix, then, and only then, do you start collecting data for Phoenix) and then once a demand for a certain type of data (temperatures in a particular city) is established, then you might want to pre-cache that data more regularly so you can notify clients of changes. If, after awhile, no clients are asking for data from Phoenix, you stop requesting updates for Phoenix any more until a client establishes demand again.
I have tried setInterval's and node-schedule to do that and got it working, but isn't it expensive for the CPU? For example, over a day I would hit this endpoint to check for new data every minute, but it could have new data every five minutes or more.
Making a remote network request is not a CPU intensive operation, even if you're doing it every minute. node.js uses non-blocking networking so most of the time during a network request, node.js isn't doing anything and isn't using the CPU at all. The only time the CPU would be briefly used is when you first send the API request and then when you receive back the result from the API call and need to process it.
Whether you really need to "poll" every minute depends upon the data and the needs of the client. I'd ask yourself if your app will work just fine if you check for new data every 5 minutes.

The method I would use to update would be contained outside of the code in a scheduled batch/powershell/bash file. In windows you can schedule tasks based upon time of day or duration since last run, so what you could do is run a simple command that will kill your application for five minutes, run npm update, and then restart your application before closing the shell.
That way you're staying out of your API and keeping code to a minimum, and if your code is inside that Node package in the update, it'll be there and ready once you make serious application changes or you need to take the server down for maintenance and updates to the low-level code.
This is a light-weight solution for you and it's a method I've used once or twice at my workplace. There are lots of options out there, and if this isn't what you're looking for I can keep looking out for you.

Google Cloud Platform : Running several hours scraping script

I have a NodeJS script, that scrapes URLs everyday.
The requests are throttled to be kind to the server. This results in my script running for a fairly long time (several hours).
I have been looking for a way to deploy it on GCP. And because it was previously done in cron, I naturally had a look at how to have a cronjob running on Google Cloud. However, according to the docs, the script has to be exposed as an API and http calls to that API can only run for up to 60 minutes, which doesn't fit my needs.
I had a look at this S.O question, which recommends to use a Cloud Function. However, I am unsure this approach would be suitable in my case, as my script requires a lot more processing than the simple server monitoring job described there.
Has anyone experience in doing this on GCP ?
N.B : To clarify, I want to to avoid deploying it on a VPS.
Edit :
I reached out to google, here is their reply :
Thank you for your patience. Currently, it is not possible to run cron
script for 6 to 7 hours in a row since the current limitation for cron
in App Engine is 60 minutes per HTTP
request.
If it is possible for your use case, you can spread the 7 hours to
recurrring tasks, for example, every 10 minutes or 1 hour. A cron job
request is subject to the same limits as those for push task
queues. Free
applications can have up to 20 scheduled tasks. You may refer to the
documentation
for cron schedule format.
Also, it is possible to still use Postgres and Redis with this.
However, kindly take note that Postgres is still in beta.
As I a can't spread the task, I had to keep on managing a dokku VPS for this.

I would suggest combining two services, GAE Cron Jobs and Cloud Tasks.
Use GAE Cron jobs to publish a list of sites and ranges to scrape to Cloud Tasks. This initialization process doesn't need to be 'kind' to the server yet, and can simple publish all chunks of works to the Cloud Task queue, and consider itself finished when completed.
Follow that up with a Task Queue, and use the queue rate limiting configuration option as the method of limiting the overall request rate to the endpoint you're scraping from. If you need less than 1 qps add a sleep statement in your code directly. If you're really queueing millions or billions of jobs follow their advice of having one queue spawn to another.
Large-scale/batch task enqueues
When a large number of tasks, for
example millions or billions, need to be added, a double-injection
pattern can be useful. Instead of creating tasks from a single job,
use an injector queue. Each task added to the injector queue fans out
and adds 100 tasks to the desired queue or queue group. The injector
queue can be sped up over time, for example start at 5 TPS, then
increase by 50% every 5 minutes.
That should be pretty hands off, and only require you to think through the process of how the cron job pulls the next desired sites and pages, and how small it should break down the work loads into.

I'm also working on this task. I need to crawl website and have the same problem.
Instead of running the main crawler task on the VM, I move the task to Google Cloud Functions. The task is consist of add get the target url, scrape the web, and save the result to Datastore, then return the result to caller.
This is how it works, I have a long run application that call be called a master. The master know what URL we are going to access in to. But instead of access the target website by itself, it sends the url to a crawler function in GCF. Then the crawling tasked is done and send result back to the master. In this case, the master only request and get a small amount of data and never touch the target website, let the rest to GCF. You can off load your master and crawl the website in parallel via GCF. Or you can use other method to trigger GCF instead of HTTP request too.

Node js read/write concurrency with mongoose/mongodb

I'm developing an API for sending SMS with an Http request. I use node js and mongoose. So I have a problem like the one with multi thread application.
The fact is that when a user send a sms, I verify the number of sms he has already sent in database (using mongoose) and if the number doesn't exceed a limit his sms is sent and the number of sms he has sent is increment in the database (there is a value for the number of sms he has sent in the hour,day,week and month in the schema). But the fact is that I use a callbacks for the process of read value and increment value and many other operation in my code.
So the problem (I think) is that when user send requests very quickly the server different callbacks read the same count of the sms sent, authorize user to sent sms, increment and save the same value so that the count of sms is false.
In a multi thread application that access to a variable the solution would be to prevent other threads to read a variable before the actual thread has done all of it works.
With Node js event system and access to data in mongoDB I just don't know how to solve my problem.
Thank you in advance for the answers.
PS: I don't know the solution but it will be good if it works also with clusters that allow node js to use multi core.

I think you should try some cache approach.
now I meet same situation with you.
I will try to use cache to store the record_id that is in process.
When new request come, the coming process need check cache. If the record_id is in cache that means that record is using by other thread. So that thread need wait or do something else until finish. And when the process finish that will remove the record_id in cache in callback function

Thanks Cristy, I have solved the main part of my problem using async queue.
My application works well when I run it the default way of node js.
But there is an other problem. I intend to run my code on a server that has 4 cores so I want to use the node cluster module. But when I used this... because it runs code like 4 differents process (I used a server with 4 cores) they use differents queues and the error I mention earlier always occured, they read and write to the database without waiting for other thread to finish processing verifications + update.
So I would like to know what should I do to have an optimal and fast application.
Should I stop to use the cluster module and don't take benefit of multi core server (I don't think it is the best answer)?
Should I store it in my mongodb (maybe try to not persist the queue but store it in the memory in other to make it faster) ?
Is there a way to share the queue in the code when I use cluster?
What is my best choice?

Child-process for cpu intensive task?

So I'm starting to use node.js for a project I'm doing.
When a client makes a request, My node.js server fetches from another server a json and then reformats it into a new json that gets served to this client. However, the json that the node server got from the other server can potentially be pretty big and so that "massaging" of data is pretty cpu intensive.
I've been reading for the past few hours how node.js isn't great for cpu tasks and the main response that I've seen is to spawn a child-process (basically a .js file running through a different instance of node) that deals with any cpu intensive tasks that might block the main event loop.
So let's say I have 20,000 concurrent users, that would mean it would spawn 20,000 os-level jobs as it's running these child-processes.
Does this sound like a good idea? (A different web server would just create 20,000 threads on the same process.)
I'm not sure if I should be running a child-process. But I do need to make a non-blocking cpu intensive task. Any ideas of what I should do?

The people who say that don't know how to architect solutions.
NodeJS is exactly what it says, It is a node, and should be treated like such.
In your example, your node instance connects to an external api and grabs json to process and send back.
i.e.
1. Get // server.com/getJSON
2. Process the json
3. Post // server.com/postJSON
So what do you do?
Ask yourself is time an issue? if so then node isnt the solution
However if you are more interested in raw processing power so instead of 1 request done in 4 seconds
You are interested in 200 requests finishing in 10 seconds, but each individual one taking about the full 10 seconds.
Diagnose how long your JSON should take to massage, if it is less than 1 second.
Just run 4 node instances instead of 1.
However if its more complex than that, Break the json into segments to process. And use asynchronous callbacks to process each segment
process.nextTick(function( doprocess(segment1); process.nextTick(function() {doprocess(segment2)
each doProcess calls the next doProcess
Node js will trade time between requests.
Now Take that solution and scale it too 4 node instances per server, and 2-5 servers
and suddenly you have an extremely scaleable and cost effective solution.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string