Schedule a task to run at some point in the future (architecture) - python-3.x

So we have a Python flask app running making use of Celery and AWS SQS for our async task needs.
One tricky problem that we've been facing recently is creating a task to run in x days, or in 3 hours for example. We've had several needs for something like this.
For now we create events in the database with timestamps that store the time that they should be triggered. Then, we make use of celery beat to run a scheduled task every second to check if there are any events to process (based on the trigger timestamp) and then process them. However, this is querying the database every second for events which we feel could be bettered somehow.
We looked into using the eta parameter in celery (http://docs.celeryproject.org/en/latest/userguide/calling.html) that lets you schedule a task to run in x amount of time. However it seems to be bad practice to have large etas and also AWS SQS has a visibility timeout of about two hours and so anything more than this time would cause a conflict.
I'm scratching my head right now. On the one had this works, and pretty decent in that things have been separated out with SNS, SQS etc. to ensure scaling-tolerance. However, it just doesn't feel write to query the database every second for events to process. Surely there's an easier way or a service provided by Google/AWS to schedule some event (pub/sub) to occur at some time in the future (x hours, minutes etc.)
Any ideas?

Have you taken a look at AWS Step Functions, specifically Wait State? You might be able to put together a couple of lambda functions with the first one returning a timestamp or the number of seconds to wait to the Wait State and the last one adding the message to SQS after the Wait returns.

Amazon's scheduling solution is the use of CloudWatch to trigger events. Those events can be placing a message in an SQS/SNS endpoint, triggering an ECS task, running a Lambda, etc. A lot of folks use the trick of executing a Lambda that then does something else to trigger something in your system. For example, you could trigger a Lambda that pushes a job onto Redis for a Celery worker to pick up.
When creating a Cloudwatch rule, you can specify either a "Rate" (I.e., every 5 minutes), or an arbitrary time in CRON syntax.
So my suggestion for your use case would be to drop a cloudwatch rule that runs at the time your job needs to kick off (or a minute before, depending on how time sensitive you are). That rule would then interact with your application to kick off your job. You'll only pay for the resources when CloudWatch triggers.

Have you looked into Amazon Simple Notification Service? It sounds like it would serve your needs...
https://aws.amazon.com/sns/
From that page:
Amazon SNS is a fully managed pub/sub messaging service that makes it easy to decouple and scale microservices, distributed systems, and serverless applications. With SNS, you can use topics to decouple message publishers from subscribers, fan-out messages to multiple recipients at once, and eliminate polling in your applications. SNS supports a variety of subscription types, allowing you to push messages directly to Amazon Simple Queue Service (SQS) queues, AWS Lambda functions, and HTTP endpoints. AWS services, such as Amazon EC2, Amazon S3 and Amazon CloudWatch, can publish messages to your SNS topics to trigger event-driven computing and workflows. SNS works with SQS to provide a powerful messaging solution for building cloud applications that are fault tolerant and easy to scale.

You could start the job with apply_async, and then use a countdown, like:
xxx.apply_async(..., countdown=TTT)
It is not guaranteed that the job starts exactly at that time, depending on how busy the queue is, but that does not seem to be an issue in your use case.

Related

Publishing over 6000 messages to SNS from a lambda

I have a script that we wanted to run in AWS as a Lambda, written in Python. This will run once ever 24 hours. It describes all the ec2 instances in an environment and I'm checking some details for compliance. After that, we want to send all the instances details to an SNS topic that other similar compliance scripts are using, which are sent to a lambda and from there, into sumologic. I'm currently testing with just one environment, but there is over 1000 instances and so over 1000 messages to be published to SNS, one by one. The lambda times out long before it can send all of these. Once all environments and their instances are checked, it could be close to 6000 messages to publish to SNS.
I need some advice on how to architect this to work with all environments. I was thinking maybe putting all the records into an S3 bucket from my lambda, then creating another lambda that would read each record from the bucket say, 50 at a time, and push them one by one into the SNS topic. I'm not sure though how that would work either.
Any ideas appreciated!
I would recommend to use Step Functions for this problem. Even if you fix the issue for now, sooner or later, with a rising number of instances or additional steps you want to perform the 900 second maximum runtime duration of a single Lambda won't be sufficient anymore.
A simple approach with Step Functions could be:
Step 1: Create a list of EC2 instances to "inspect" with this first Lambda. Could be all instances or just instances with a certain tag etc. You can be as creative as you want.
Step 2: Process this list of instances with a parallel step that calls one Lambda per instance id.
Step 3: The triggered Lambda reads the details from the provided EC2 instance and then publishes the result to SNS. There might already be a pre-defined step for the publication to SNS, so you don't need to program it yourself.
With the new Workflow Studio this should be relatively easy to implement.
Note: This might not be faster than a single Lambda, but it will scale better with more EC2 instances to scan. The only bottleneck here might become the Lambda in Step 1. If that Lambda needs more than 15 minutes to "find" all the EC2 instances to "scan", then you need to become a bit creative. But this is solvable.
I think you could you a SQS queue to solve you problem.
From SNS send the message to a SQS queue. Then from the SQS you lambda can poll for messaged ( default is 10 but you can decrees it though cli ).

Combine SQS messages that arrive within milliseconds of each other

I am faced with a situation that I am not quite sure how to solve. Basically my system receives data from a third-party source via API gateway, publishes this data to an SNS topic which triggers a lambda function. Based on the message parameters, the lambda function pushes the message to one of three different SQS queues. These queues trigger one of three lambda functions which perform one of three possible actions - create, update or delete items in that order in another third-party system through their API endpoints.
The usual flow would be to first create an entity on the destination system and then each subsequent action should be to update/delete this entity. The problem is, sometimes I receive data for the same entity from the source within milliseconds, thus my system is unable to create the entity on the destination due to the fact that their API requires at least 300-400ms to do so. So when my system tries to update the entity, it's not existing yet, thus my system creates it. But since I have a create action in the process of executing, it creates a duplicate entry on my destination.
So my question is, what is the best practice to consolidate messages for the same entity that arrive within less than a second of each other?
My Thoughts so far:
I am thinking of using redis to consolidate messages that are for the same entity before pushing them to the SNS topic, but I was hoping there would be a more straight-forward approach as I don't want to introduce another layer of logic.
Any help would be much appreciated. Thank you.
The best option would be to use an Amazon SQS FIFO queue, with each message using a Message Group ID that is set to the unique ID of the item that is being created.
In a FIFO queue, SQS will ensure that messages are processed in-order, and will only allow one message per Message Group ID to be received at a time. Thus, any subsequent messages for the same Message Group ID will wait until an existing message has been fully processed.
If this is not acceptable, then AWS Lambda now supports batch windows of up to 5 minutes for functions with Amazon SQS as an event source:
AWS Lambda now allows customers using Amazon Simple Queue Service (Amazon SQS) as an event source to define a wait period, called MaximumBatchingWindowInSeconds, to allow messages to accumulate in their SQS queue before invoking a Lambda function. In addition to Batch Size, this is a second option to send records in batches, to reduce the number of Lambda invokes. This option is ideal for workloads that are not time-sensitive, and can choose to wait to optimize cost.
Previously, Lambda functions polling from an SQS queue would send messages in batches of up to 10 before invoking the function. Now, customers can also define a time window that Lambda should wait to poll messages from their SQS queue before invoking their function. Lambda will wait for up to 300 seconds to poll messages from the SQS queue. When a batch window is defined, Lambda will also allow customers to define a batch size of up to 10,000 messages.
To get started, when creating a new Lambda function or updating an existing function with SQS as an event source, customers can set the MaximumBatchingWindowInSeconds field to any value between 0 and 300 seconds on the AWS Management Console, the AWS CLI, AWS SAM or AWS SDK for Lambda. This feature is available in all AWS Regions where AWS Lambda and Amazon SQS are available, and requires no additional charge to use.
the lambda function pushes the message to one of three different SQS queues
...
So when my system tries to update the entity, it's not existing yet, thus my system creates it. But since I have a create action in the process of executing, it creates a duplicate entry on my destination
By using multiple queue you created yourself a thread race and now you are trying to patch it.
Based on the provided information and context - as already answered - a single fifo queue with context id could be more appropriate (do you really need 3 queues?)
If latency is critical, then a streaming could be a solution as well.
As you described your issue, I think you don't need to combine the messages (indeed you could use Redis, AWS Kinesis Analytics, DynamoDB..), but rather not to create the issue at thecfirst place
Options
having a single fifo queue
having an idempotent and thread-safe backend service able handling concurrent updates (transactions, atomic updates,..)
As well if you can create "duplicate" entries, it means the unique indexes are not enforced. They exist exactly for that reason.
You did not specify the backend service (RDBMS, DynamoDB, MongoDB, other?) each has an option to handle the problem somehow.

Adding schedule job to nodejs microservicse best practice

We are using microservicse approach in our backend
We have a nodejs service which provide a REST endpoint that grap some data from mongodb and apply some business logic to it.
We would need to add a schedule job every 15 min to sync the mongodb data with some 3rd party data source.
The question here is - dose adding to this microservicse a schedule job that would do that, consider anti pattern?
I was thinking from the other point of having a service that just do the sync job will create some over engineering for simple thing, another repo, build cycle deployment etc hardware, complicated maintenance etc
Would love to hear more thoughts around it
You can use an AWS CloudWatch event rule to schedule CloudWatch to generate an event every 15 minutes. Make a Lambda function a target of the CloudWatch event so it executes every 15 minutes to sync your data. Be aware of the VPC/NAT issues if calling your 3rd party resources from Lambda if they are external to your VPC/account.
Ideally, if it is like an ETL job, you can offload it to a Lambda function (if you are using AWS) or a serverless function to do the same.
Also, look into MongoDB Stitch that can do something similar.

Notify Lambda on CloudFront Distribution Creation End

At the moment, we are calling cloudfront.listDistributions() every minute to identify a change in the status of the distribution we are deploying. This cause Lambda to timeout because CloudFront never deploys faster than 30 minutes (where Lambda timeouts after 15 min).
I would like to notify a Lambda function after a CloudFront Distribution is successfully created. This would allow us to execute the post-creation actions while saving valuable Lambda exec time.
Creating a Rule on CloudWatch does not offer the option to chose CloudFront. Nevertheless, it seems to accept creating a Custom Event Pattern with the source aws.cloudformation.
Considering options:
Trigger a lambda every 5 minutes to list distributions and compare states with previous states stored in DynamoDB.
Anybody with an idea to overcome this lack of feature from AWS?
If you want and have time, there's a trickier and a bit more complex solution for doing that leveraging CloudTrail.
Disclaimer
CloudTrail is not a real-time log system, but ensure that all API calls will be reported on the console within 15 minutes (as stated here under the CloudTrail FAQs). Due to this, what's following makes sense only for long-running tasks like creating a CloudFront distribution, taking up an Aurora DB ans so on.
You can create a CloudWatch event based rule (let's call it CW-r1)
on specific pattern like CreateDistribution or
UpdateDistribution.
CW-r1 triggers a Lambda (LM-1) which enables another CloudWatch
event base rule (CW-r2).
CW-r2 on a scheduled base, triggers a Lambda (LM-2) which via API
request the state of specific distribution. Once distribution is
"Deployed", LM-2 can send a notification via SNS for example (you can
send EMAIL, SMS, Push Notification whatever is supported on SNS).
Once everything is finished, LM-2 can disable the CW-r2 rule in order
to stop processing information.
In this way you can have an automatic notification system based on which API call you desire.

Time triggered Azure Function and queue processing

I have a Function called once a day processing all messages in a queue. But I would like to also have the retries and poison messages logic as with the Queue Trigger. Is this somehow possible?
So at that point you function is purely a timer triggered function and from there you are no different than a console app in terms of how you would have to process messages from a queue with your own client connection, message loop, retrying and dead lettering (poison) logic. It's honestly just not the right tool for that job.
One approach I suppose you could consider if you wanted to be creative so that you could benefit from using an Azure Function's built in queue trigger behavior while still controlling what time the queue is processed is actually starting and stopping the function instance itself via something like Azure Scheduler. Scheduling the starting of the function is pretty straightforward and, once started, it will immediately begin draining the queue. The challenge is knowing how to stop it. The Azure Function runtime with the queue binding won't ever stop on its own as it's reading off the queue using a pull model so it's just gonna sit there waiting for new messages to arrive, right? So stopping it is really a question of understanding the need of the business. Do you stop once there's no more messages left for that day? Do you stop at a specific time of day? Etc. There's no correct answer here, it's totally domain specific, but whatever that answer is will dictate the exact approach taken.
Honestly, as I said earlier on, I'm not sure this is the right tool for the job. Yeah, it's nice that you get the retry and poison handling, but you're really going against the grain of the runtime. I would personally suggest you look into scheduling this simple console executable, executed in a task-like fashion using Azure Container Instances (some docs on that approach here). If you were counting on the auto-scale of Azure Functions, that's totally something you can get from Azure Container Instances as well.

Resources