I have an server side nodejs express app that responds to requests from front end clients. I need to implement a batch job that will run every hour. If I implement the batch job in the same service, does it mean that the service is 'occupied' until the cron job is completed and it will not be able to serve any requests?
Should I create a separate service instead that will run the batch job?
If your batch job does not occupy the cpu 100%, then the server will still serve requests. Every time you do async io or wait for timers, there is plenty of time for the express root routine to deal with requests.
I don't know what your job and server are doing, look at it from Separation of Concerns. If the job scheduler and server belong together, then implement them together, otherwise I recommend to make two services out of it.
Have worked with a lot batch jobs using JAVA programming language and have processed and moved Millions of rows data as ETL tool. And have experience with Core NODE framework with Linux crone schedulers Its works much more faster then previous JAVA operations with low level servers while with java have used heavy configuration server still was facing memory issues.
You can implement crone jobs using NODEJS you have to make SH files and command with node to run JS files directly and call your DBs connection or CSVs to load or transfer.
Related
I am working on a nestjs app that makes heavy use of task scheduling using the #nestjs/schedule package that integrates with the node-cron npm lib.
At the moment, the app has been in development for over 6 months and has over 30 cron tasks running in the background simultaneously. although most of them have distinct intervals, some crons have the same interval (e.g. runs EVERY 30 SECONDS).
All cron tasks more or less follow the same behavior;
send request to external APIs to get data.
query mongo db to run some checks and update records accordingly.
some crons emit events to the client when certain condition is met by other cron tasks.
my question is:
How can I measure the performance of the node process while running all these background tasks in my local development PC? and what effect it might have on requests that comes from the client?
another point is: Is it possible to detect a memory leak before it happen?
basically I have concerns about the app performance and I want to try to prevent the problem before it happen.
Thanks.
I have to implement scheduled push notifications in backend (Node.js) which are triggered only at certain time. So I need to query DB (PostgreSQL) in interval 1-2 minutes and find only those notifications which need to be triggered.
What is the better solution?
Use internal setTimeout query function
or
External CRON script which will trigger querying function in Node.js?
Thank you
If the second option is just a cron that makes an HTTP request to your service, its pretty equivalent. If, instead the solution is packaged as a script and the cron drives that script directly it has a couple of tradeoffs, mainly based around operations:
Use internal setTimeout query function or
This means you have to launch a long running service, and to keep it running. Things like memory leaks may become an issue.
External CRON script which will trigger querying function in Node.js?
This is the strategy that google GCP uses for its cron offering. If the function just pings a web url the solutions are pretty equivalent.
IMO, The biggest issue with both of these is being careful about coupling a background (async) workload with an online workload. If your service is servicing real time live HTTP requests, but is also running these background workloads that takes resources away from servicing synchronous HTTP requests. If they are two fundamentally different workloads than it also makes sense to separate them for scaling purposes.
I've been in a situation where monitoring has actually informed this decision. The company used prometheus and didn't have push gateway installed. So a cron based solution had 0 metric visibility, but the service version was trivial to add metrics / alerting.
I am running a nodejs server on heroku, using the throng package to spin up workers to process api requests. I am also using more than on dyno on heroku. For cron and job processing, I use bull queue to distribute the load across servers, but I came across something I am not sure how to do in this distributed environment.
I want to have only one server execute code immediately on startup. In this case, I want to open up a change stream listener for mongodb. But I only want to do this on one worker on one server, not every server.
I am not sure how to do this running in heroku, any suggestions?
My node application currently has two main modules:
a scraper module
an express server
The former is very server intensive task which indefinately runs in a loop. It scrapes information from over more than 100 urls, crunches the data and puts it into a mongodb database (using mongoose). This process runs over and over and over. :P
The latter part, my express server, responds to http/socket get requests and returns the crunched data which was written to the db by the scraper to the requesting client.
I'd like to optimize the performance of my server so that the express requests and responds get prioritized over the server intensive task(s). A client should be able to get the requested data asap, without having the scraper eat up all of my server resources.
I though about putting the server intensive task or the express server into its own thread, but then I stumbled upon cluster, and child processes; and now I'm totally confused which approach would be the right one for my situation.
One of the benefits I'm having is that there is a clear seperation between the writing part of my application and the reading part. The scraper writes stuff to the db, express reads from the db (no post/put/delete/...) calls are exposed. So, I -guess- I won't run into threading problems with different threads trying to write to the same db.
Any good suggestions? Thanks in advance!
Resources like cpu and memory required by processes are managed by the operative system. You should not waste your time writing that logic within your source code.
I think you should look at the problem from outside your source code files. Once they ran they are processes. Processes are managed, as I said, by the OS.
Firstly I would split that on two separate commands.
One being the scraper module (eg npm run scraper, that runs something like node scraper.js).
The other one being your express server (eg npm start, that runs something like node server.js).
This approach will let you configure that within your OS or your cluster.
A rapid approach for that will be to use docker.
With two docker containers running your projects with cpu usage limitations. This is fairly easy to do and does not require for you to lift a new server... and at the same time it provides the
isolation level you need to scale it to many servers in the future.
Steps to do this:
Learn a little about docker and docker compose and install them in your server
Build a docker image for your application (you can upload it to a free private image that docker hub gives you for free)
Build a docker compose for your two services using that image, with the cpu configuration you need (you can set both cpu and memory limits easily)
An alternative to that would be running the two commands (npm run scraper and npm start) with some tool like cpulimit, nice/niceness and ionice, or something else like namespaces and cgroups manually (but docker does that for you).
PD: Also, I would recommend to rethink your backend process. Maybe it's better to run it every 12 hours or something like that, instead of all the time, and you may run it from within cron instead of a loop.
I have a simple nodejs webserver running, it:
Accepts requests
Spawns separate thread to perform background processing
Background thread returns results
App responds to client
Using Apache benchmark "ab -r -n 100 -c 10", performing 100 requests with 10 at a time.
Average response time of 5.6 seconds.
My logic for using nodejs is that is typically quite resource efficient, especially when the bulk of the work is being done by another process. Seems like the most lightweight webserver option for this scenario.
The Problem
With 10 concurrent requests my CPU was maxed out, which is no surprise since there is CPU intensive work going on the background.
Scaling horizontally is an easy thing to, although I want to make the most out of each server for obvious reasons.
So how with nodejs, either raw or some framework, how can one keep that under control as to not go overkill on the CPU.
Potential Approach?
Could accepting the request storing it in a db or some persistent storage and having a separate process that uses an async library to process x at a time?
In your potential approach, you're basically describing a queue. You can store incoming messages (jobs) there and have each process get one job at the time, only getting the next one when processing the previous job has finished. You could spawn a number of processes working in parallel, like an amount equal to the number of cores in your system. Spawning more won't help performance, because multiple processes sharing a core will just run slower. Keeping one core free might be preferred to keep the system responsive for administrative tasks.
Many different queues exist. A node-based one using redis for persistence that seems to be well supported is Kue (I have no personal experience using it). I found a tutorial for building an implementation with Kue here. Depending on the software your environment is running in though, another choice might make more sense.
Good luck and have fun!