Nested Cron Job in AWS Lambda Function - node.js

Requirement: To send reminder to n users at their appropriate time. E.g user 1 at 9:10AM, user 2 at 10:50PM, user 3 at 4:20 AM and so on.
Solution in Nodejs
I have a Nodejs Cron job which runs at every 55 min (i.e. 9:55, 10:55, 11:55). At first it deletes all the child cron job and then fetch data from database and check for reminder settings for users. Based on reminder settings in database, it creates child cron jobs for all users to send the reminders.
Solution in AWS Lambda
I created lambda function and schedule it for 55 min. Inside lambda, I am doing the same thing as it was done in nodejs but since lambda's execution is finished, the child cron job are not getting executed.
I thought about step functions but not sure as how to achieve this since it is dynamic. Also someone suggested to trigger SNS but this will also not work in my scenario.
Someone please help me in achieving this with AWS Lambda.

Why not have 1 cron job that runs every minute that sends all reminders that need to be sent based on the database information? I don't really see why you need nested cron jobs?
In any case, you could also use DynamoDB's time to live attribute and a stream that triggers a Lambda function. Create a record to send a reminder at X every Y, with X being the expiration time. The Lambda function triggers, and when done you create a new DDB record with as expiration time X+Y. You might not even need a cron job Lambda in this case.

I suppose you could use aws-sdk to create cloudwatch rules dynamically in your nodejs cron job.
for better contrast, create 2 separate functions.
Main Cron Job (delete child cron job in cloud-watch, retrieve data from database, create cloudwatch rules that invoke the child cron job at specific time)
Child Cron Job (only send reminders)
To read more: Nodejs Create CloudWatch

Related

Adding schedule job to nodejs microservicse best practice

We are using microservicse approach in our backend
We have a nodejs service which provide a REST endpoint that grap some data from mongodb and apply some business logic to it.
We would need to add a schedule job every 15 min to sync the mongodb data with some 3rd party data source.
The question here is - dose adding to this microservicse a schedule job that would do that, consider anti pattern?
I was thinking from the other point of having a service that just do the sync job will create some over engineering for simple thing, another repo, build cycle deployment etc hardware, complicated maintenance etc
Would love to hear more thoughts around it
You can use an AWS CloudWatch event rule to schedule CloudWatch to generate an event every 15 minutes. Make a Lambda function a target of the CloudWatch event so it executes every 15 minutes to sync your data. Be aware of the VPC/NAT issues if calling your 3rd party resources from Lambda if they are external to your VPC/account.
Ideally, if it is like an ETL job, you can offload it to a Lambda function (if you are using AWS) or a serverless function to do the same.
Also, look into MongoDB Stitch that can do something similar.

Schedule a task to run at some point in the future (architecture)

So we have a Python flask app running making use of Celery and AWS SQS for our async task needs.
One tricky problem that we've been facing recently is creating a task to run in x days, or in 3 hours for example. We've had several needs for something like this.
For now we create events in the database with timestamps that store the time that they should be triggered. Then, we make use of celery beat to run a scheduled task every second to check if there are any events to process (based on the trigger timestamp) and then process them. However, this is querying the database every second for events which we feel could be bettered somehow.
We looked into using the eta parameter in celery (http://docs.celeryproject.org/en/latest/userguide/calling.html) that lets you schedule a task to run in x amount of time. However it seems to be bad practice to have large etas and also AWS SQS has a visibility timeout of about two hours and so anything more than this time would cause a conflict.
I'm scratching my head right now. On the one had this works, and pretty decent in that things have been separated out with SNS, SQS etc. to ensure scaling-tolerance. However, it just doesn't feel write to query the database every second for events to process. Surely there's an easier way or a service provided by Google/AWS to schedule some event (pub/sub) to occur at some time in the future (x hours, minutes etc.)
Any ideas?
Have you taken a look at AWS Step Functions, specifically Wait State? You might be able to put together a couple of lambda functions with the first one returning a timestamp or the number of seconds to wait to the Wait State and the last one adding the message to SQS after the Wait returns.
Amazon's scheduling solution is the use of CloudWatch to trigger events. Those events can be placing a message in an SQS/SNS endpoint, triggering an ECS task, running a Lambda, etc. A lot of folks use the trick of executing a Lambda that then does something else to trigger something in your system. For example, you could trigger a Lambda that pushes a job onto Redis for a Celery worker to pick up.
When creating a Cloudwatch rule, you can specify either a "Rate" (I.e., every 5 minutes), or an arbitrary time in CRON syntax.
So my suggestion for your use case would be to drop a cloudwatch rule that runs at the time your job needs to kick off (or a minute before, depending on how time sensitive you are). That rule would then interact with your application to kick off your job. You'll only pay for the resources when CloudWatch triggers.
Have you looked into Amazon Simple Notification Service? It sounds like it would serve your needs...
https://aws.amazon.com/sns/
From that page:
Amazon SNS is a fully managed pub/sub messaging service that makes it easy to decouple and scale microservices, distributed systems, and serverless applications. With SNS, you can use topics to decouple message publishers from subscribers, fan-out messages to multiple recipients at once, and eliminate polling in your applications. SNS supports a variety of subscription types, allowing you to push messages directly to Amazon Simple Queue Service (SQS) queues, AWS Lambda functions, and HTTP endpoints. AWS services, such as Amazon EC2, Amazon S3 and Amazon CloudWatch, can publish messages to your SNS topics to trigger event-driven computing and workflows. SNS works with SQS to provide a powerful messaging solution for building cloud applications that are fault tolerant and easy to scale.
You could start the job with apply_async, and then use a countdown, like:
xxx.apply_async(..., countdown=TTT)
It is not guaranteed that the job starts exactly at that time, depending on how busy the queue is, but that does not seem to be an issue in your use case.

Google Cloud Platform : Running several hours scraping script

I have a NodeJS script, that scrapes URLs everyday.
The requests are throttled to be kind to the server. This results in my script running for a fairly long time (several hours).
I have been looking for a way to deploy it on GCP. And because it was previously done in cron, I naturally had a look at how to have a cronjob running on Google Cloud. However, according to the docs, the script has to be exposed as an API and http calls to that API can only run for up to 60 minutes, which doesn't fit my needs.
I had a look at this S.O question, which recommends to use a Cloud Function. However, I am unsure this approach would be suitable in my case, as my script requires a lot more processing than the simple server monitoring job described there.
Has anyone experience in doing this on GCP ?
N.B : To clarify, I want to to avoid deploying it on a VPS.
Edit :
I reached out to google, here is their reply :
Thank you for your patience. Currently, it is not possible to run cron
script for 6 to 7 hours in a row since the current limitation for cron
in App Engine is 60 minutes per HTTP
request.
If it is possible for your use case, you can spread the 7 hours to
recurrring tasks, for example, every 10 minutes or 1 hour. A cron job
request is subject to the same limits as those for push task
queues. Free
applications can have up to 20 scheduled tasks. You may refer to the
documentation
for cron schedule format.
Also, it is possible to still use Postgres and Redis with this.
However, kindly take note that Postgres is still in beta.
As I a can't spread the task, I had to keep on managing a dokku VPS for this.
I would suggest combining two services, GAE Cron Jobs and Cloud Tasks.
Use GAE Cron jobs to publish a list of sites and ranges to scrape to Cloud Tasks. This initialization process doesn't need to be 'kind' to the server yet, and can simple publish all chunks of works to the Cloud Task queue, and consider itself finished when completed.
Follow that up with a Task Queue, and use the queue rate limiting configuration option as the method of limiting the overall request rate to the endpoint you're scraping from. If you need less than 1 qps add a sleep statement in your code directly. If you're really queueing millions or billions of jobs follow their advice of having one queue spawn to another.
Large-scale/batch task enqueues
When a large number of tasks, for
example millions or billions, need to be added, a double-injection
pattern can be useful. Instead of creating tasks from a single job,
use an injector queue. Each task added to the injector queue fans out
and adds 100 tasks to the desired queue or queue group. The injector
queue can be sped up over time, for example start at 5 TPS, then
increase by 50% every 5 minutes.
That should be pretty hands off, and only require you to think through the process of how the cron job pulls the next desired sites and pages, and how small it should break down the work loads into.
I'm also working on this task. I need to crawl website and have the same problem.
Instead of running the main crawler task on the VM, I move the task to Google Cloud Functions. The task is consist of add get the target url, scrape the web, and save the result to Datastore, then return the result to caller.
This is how it works, I have a long run application that call be called a master. The master know what URL we are going to access in to. But instead of access the target website by itself, it sends the url to a crawler function in GCF. Then the crawling tasked is done and send result back to the master. In this case, the master only request and get a small amount of data and never touch the target website, let the rest to GCF. You can off load your master and crawl the website in parallel via GCF. Or you can use other method to trigger GCF instead of HTTP request too.

SQS: Know remaining jobs

I'm creating an app that uses a JobQueue using Amazon SQS.
Every time a user logs in, I create a bunch of jobs for that specific user, and I want him to wait until all his jobs have been processed before taking the user to a specific screen.
My problem is that I don't know how to query the queue to see if there are still pending jobs for a specific user, or how is the correct way to implement such solution.
Everything regarding the queue (Job creation and processing is working as expected). But I am missing that final step.
Just for the record:
In my previous implementation I was using Redis + Kue and I had created a key with the user Id and the job count, every time a job was added that job count was incremented, and every time a job finished or failed I decremented that count. But now I want to move away from Redi + Kue and I am not sure how to implement this step.
Amazon SQS is not the ideal tool for the scenario you describe. A queueing system is normally used in a "Send and Forget" situation, where the sending system doesn't remain interested in later processing.
You could investigate Amazon Simple Workflow (SWF), which allows work to be monitored as it goes through several processes. Your existing code could mostly be re-used, just with the SWF framework added. Or even power it from Lambda, since you are already using node.js.

How to handle cron execution time and user wait

I have scheduled a cron job that is executed every minute.
This cron job generates a pdf file using a distant web service. This operation alone takes a few seconds (something like 3 seconds), that means the cron job will be able to generate 20 pdf files per minute approximately.
If the visitor requests 60 documents, that means it will take 3 minutes for the server to generate all the pdf files.
Executing parallel cron jobs to do this task is not possible as all the files request must be handled individually for database relationships and integrity reasons. Basically, each file can only be handle one by one.
Therefore, is there any logic I could apply in order to :
execute multiple occurrences of the same cron job to speed up the process and decrease the user waiting time
and make the file creation process handled by one cron job only so that a specific creation process is not handled by another cron job doing the same task.
Thank you

Resources