Run scheduled tasks on huge mongo items - node.js

I know this might be a broad questions, but I've been trying to find the right way to do this and I don't seem to be going anywhere.
Basically, I have a bunch of Objects saved in mongo that contain events, like below :
{
"date" : "2020-09-09",
"day" : 1599573600000 // epoch time
"from" : 1599595200000 // epoch time
"to" : 1599695200000 // epoch time
}
I need to fire some events, like sending a reminded SMS etc, before the date that is specified in from field.
I know I can write a cron job and regularly check on my entire mongo collection, find all the ones that are due and the rest is obvious.
However, somehow I feel like there must be a better way, because this can be extremely slow after our database grows with millions of events.
So the question that I have is,
1- What are some other options, beside cron jobs.
2- Is there any difference between running Cron jobs in NodeJS, and running Cron Jobs in Google App Engine ( server-less instance), which one is better?
3- Is there any service out there that anyone has used?
Any direction would be appreciated.

I'm assuming you're trying to stay in the GCP ecosystem.
For scalability you could use cron to kick off a Google Dataflow pipeline. With this pipeline you can define a pipeline step to be executed for each record that matches the given query. Dataflow will ramp up the number of workers as it goes to handle the scale.
If you're not at that level of scale, Dataflow can be a bit heavy and may feel like overkill for your current use case. If that's the case, then you can use a combination of cron and google cloud tasks where you'd enqueue/launch a task per record. For large amounts of records, you could launch a task per batch of records (i.e. an injector pattern)
https://cloud.google.com/tasks/docs/manage-cloud-task-scaling#large-scalebatch_task_enqueues
Another option is just using google cloud tasks, using the 'schedule_time' field. Here you'd enqueue the tasks when you originally write the record into the DB, instead of periodically querying to see which ones need to be run
https://cloud.google.com/tasks/docs/creating-http-target-tasks
2- Is there any difference between running Cron jobs in NodeJS, and running Cron Jobs in Google App Engine ( server-less instance), which one is better?
I wasn't sure what you meant by your second question because you can run node.js in app engine. In my experience things do work better when you keep everything within GCP.

Related

Looking for time based persistent scheduler - node js

I have been looking for a time based persistent scheduler. I looked into some applications (Agenda, node-cron, node-schedule). But I couldn't find anything that satisfies my criteria.
So my applications sends out reminders to our customers based on their event timings. I am hesitating to run a regular cronjob because I have to run every 15 mins or so in this case. And for each cronjob, I have to make a database call. I am trying not to use resources unnecessarily.
In addition to that, I am already running a lot of cronjobs. But in my case, when the job is completed, I want the cron to get cancelled/finished; not live on memory until the server restart happens.
I tried using the above specified applications by setting exact timestamps (agenda, node-cron, node-schedule). But the cron lives on forever even after the job is completed, and if i restart the server, all the scheduled jobs are cron. So persistence is also an issue I am facing.
My server uses node js. If there are any other languages/tools to make this work, I am all ears.
Looking forward to your help.
I tried following this solution. But this solution is for one predefined event. In my case, the number of reminders to be sent out are dynamic and jobs are to be scheduled on the fly.

Schedule jobs in database with nodejs

I have a web application, where users can schedule different jobs.
I'am not sure how to proceed with this.
All the nodejs schedulers out there basically reads the schedule from within the code. I can of course implement this cron like schedule to be read from a database, but I'am not sure if its the most effective way?
If I back the solution with a database I would need to query that database, let say each second, to see if there is any schduled jobs that needs to be handled. I can't read them once a day, because new jobs might be added on a regular basis.
Keeping them in memory dosen't seem very efficient either?
Am I looking for a different kind of technology to handle this, than a scheduler+database?
We are talkning around 10.000 jobs for the time being (as a maximum). They are mostly related to sending emails and/or giving notifications within the application itself.

Tasks that need to be performed on a certain date in Azure

I am developing an application using Azure Cloud Service and web api. I would like to allow users that create a consultation session the ability to change the price of that session, however I would like to allow all users 30 days to leave the session before the new price affects the price for all members currently signed up for the session. My first thought is to use queue storage and set the visibility timeout for the 30 day time limit, but this seems like this could grow the queue really fast over time, especially if the message should not run for 30 days; not to mention the ordering issues. I am looking at the task scheduler as well but the session pricing changes are not a recurring concept but more random. Is the queue idea a good approach or is there a better and more efficient way to accomplish this?
The stuff you are trying to do should be done with a relational database. You can use timestamps to record when prices for session changed. I wouldn't use a queue at all for this. A queue is more for passing messages in a distributed system. Your problem is just about tracking what prices changed on what sessions and when. That data should be modeled in a database.
I think this scenario is more suitable to use Azure Scheduler. Programatically create a Job with one time recurrence with set date as 30 days later to run once. Once this job gets triggered automatically by scheduler, assign an action to callback to one of your API/Service to do the price & other required updates and also remove this Job from the scheduler as part of this action to have a clean jobs list. Anyways premium plan of Azure Scheduler Job Collection will give you unlimited number of jobs to run.
Hope this is exactly what you were looking for...
I would consider using Azure WebJobs. A WebJob basically gives you the ability to run a .NET console application within the context of an Azure Web App. It can be run on demand, continuously, or in response to a reoccurring schedule. If your processing requirements are low and allow for it they can also run in the same process that your Web App is running in to save you $$$ as they are free that way.
You could schedule the WebJob to run once or twice per day and examine the situation and react as is appropriate. Since it's really just a .NET worker role you have ultimate flexibility.

Running HDInsight jobs howto

Few questions regarding HDInsight jobs approach.
1) How to schedule HDInsight job? Is there any ready solution for it? For example if my system will constantly get a large number of new input files collected that we need to run map/reduce job upon, what is the recommended way to implemented on-going processing?
2) From the price perspective, it is recommended to remove the HDInsight cluster for the time when there is no job running. As I understand there is no way to automate this process if we decide to run the job daily? Any recommendations here?
3) Is there a way to ensure that the same files are not processed more than once? How do you solve this issue?
4) I might be mistaken, but it looks like every hdinsight job requires a new output storage folder to store reducer results into. What is the best practice for merging of those results so that reporting always works on the whole data set?
Ok, there's a lot of questions in there! Here are I hope a few quick answers.
There isn't really a way of scheduling job submission in HDInsight, though of course you can schedule a program to run the job submissions for you. Depending on your workflow, it may be worth taking a look at Oozie, which can be a little awkward to get going on HDInsight, but should help.
On the price front, I would recommend that if you're not using the cluster, you should destroy it and bring it back again when you need it (those compute hours can really add up!). Note that this will lose anything you have in the HDFS, which should be mainly intermediate results, any output or input data held in the asv storage will persist in and Azure Storage account. You can certainly automate this by using the CLI tools, or the rest interface used by the CLI tools. (see my answer on Hadoop on Azure Create New Cluster, the first one is out of date).
I would do this by making sure I only submitted the job once for each file, and rely on Hadoop to handle the retry and reliability side, so removing the need to manage any retries in your application.
Once you have the outputs from your initial processes, if you want to reduce them to a single output for reporting the best bet is probably a secondary MapReduce job with the outputs as its inputs.
If you don't care about the individual intermediate jobs, you can just chain these directly in the one MapReduce job (which can contain as many map and reduce steps as you like) through Job chaining see Chaining multiple MapReduce jobs in Hadoop for a java based example. Sadly the .NET api does not currently support this form of job chaining.
However, you may be able to just use the ReducerCombinerBase class if your case allows for a Reducer->Combiner approach.

One-time jobs on Azure workers

What is the best way to do a one-time job on Azure?
Say we want to extend a table in the associated database with a double column. All the new entries will have this value computed by the worker(s) at insertion, but somebody has to take care of the entries that are already in the table. I thought of two alternatives:
a method called by the worker only if a database entry (say, "JobRun") is set to true, and the method would flip the entry to false.
a separate app that does the job, and which is downloaded and run manually using remote desktop (I cannot connect the local app to the Azure SQL server).
The first alternative is messy (how should I deal with the code at the next deployment? delete it? comment it? leave it there? also, what if I will have another job in the future? create a new database entry "Job2Run"?). The second one looks like a cheap hack. I am sure that there is a better way I could not think of.
If you want to run a job once you'll need to take into account the following:
Concurrency: While the job is running, make sure no other worker picks up the job and runs it at the same time (you can use leases for this. More info here).
Once the job is done, you'll need to keep track (in Table Storage, SQL Azure, ...) that the job completed successfully. The next time a worker tries to pick up the job, it will look in Table Storage / SQL Azure / ..., it will see that the job completed and skip the job.
Failure: Maybe your worker crashes during the job which should allow another worker to pick up the job without any issue.
In your specific use case I would also consider using a tool like dbup to manage updates to your schema and existing data with SQL Scripts. Tools like these keep track of which scripts have been executed by adding them in a table in the database.

Resources