Publishing over 6000 messages to SNS from a lambda - python-3.x

I have a script that we wanted to run in AWS as a Lambda, written in Python. This will run once ever 24 hours. It describes all the ec2 instances in an environment and I'm checking some details for compliance. After that, we want to send all the instances details to an SNS topic that other similar compliance scripts are using, which are sent to a lambda and from there, into sumologic. I'm currently testing with just one environment, but there is over 1000 instances and so over 1000 messages to be published to SNS, one by one. The lambda times out long before it can send all of these. Once all environments and their instances are checked, it could be close to 6000 messages to publish to SNS.
I need some advice on how to architect this to work with all environments. I was thinking maybe putting all the records into an S3 bucket from my lambda, then creating another lambda that would read each record from the bucket say, 50 at a time, and push them one by one into the SNS topic. I'm not sure though how that would work either.
Any ideas appreciated!

I would recommend to use Step Functions for this problem. Even if you fix the issue for now, sooner or later, with a rising number of instances or additional steps you want to perform the 900 second maximum runtime duration of a single Lambda won't be sufficient anymore.
A simple approach with Step Functions could be:
Step 1: Create a list of EC2 instances to "inspect" with this first Lambda. Could be all instances or just instances with a certain tag etc. You can be as creative as you want.
Step 2: Process this list of instances with a parallel step that calls one Lambda per instance id.
Step 3: The triggered Lambda reads the details from the provided EC2 instance and then publishes the result to SNS. There might already be a pre-defined step for the publication to SNS, so you don't need to program it yourself.
With the new Workflow Studio this should be relatively easy to implement.
Note: This might not be faster than a single Lambda, but it will scale better with more EC2 instances to scan. The only bottleneck here might become the Lambda in Step 1. If that Lambda needs more than 15 minutes to "find" all the EC2 instances to "scan", then you need to become a bit creative. But this is solvable.

I think you could you a SQS queue to solve you problem.
From SNS send the message to a SQS queue. Then from the SQS you lambda can poll for messaged ( default is 10 but you can decrees it though cli ).

Related

Combine SQS messages that arrive within milliseconds of each other

I am faced with a situation that I am not quite sure how to solve. Basically my system receives data from a third-party source via API gateway, publishes this data to an SNS topic which triggers a lambda function. Based on the message parameters, the lambda function pushes the message to one of three different SQS queues. These queues trigger one of three lambda functions which perform one of three possible actions - create, update or delete items in that order in another third-party system through their API endpoints.
The usual flow would be to first create an entity on the destination system and then each subsequent action should be to update/delete this entity. The problem is, sometimes I receive data for the same entity from the source within milliseconds, thus my system is unable to create the entity on the destination due to the fact that their API requires at least 300-400ms to do so. So when my system tries to update the entity, it's not existing yet, thus my system creates it. But since I have a create action in the process of executing, it creates a duplicate entry on my destination.
So my question is, what is the best practice to consolidate messages for the same entity that arrive within less than a second of each other?
My Thoughts so far:
I am thinking of using redis to consolidate messages that are for the same entity before pushing them to the SNS topic, but I was hoping there would be a more straight-forward approach as I don't want to introduce another layer of logic.
Any help would be much appreciated. Thank you.
The best option would be to use an Amazon SQS FIFO queue, with each message using a Message Group ID that is set to the unique ID of the item that is being created.
In a FIFO queue, SQS will ensure that messages are processed in-order, and will only allow one message per Message Group ID to be received at a time. Thus, any subsequent messages for the same Message Group ID will wait until an existing message has been fully processed.
If this is not acceptable, then AWS Lambda now supports batch windows of up to 5 minutes for functions with Amazon SQS as an event source:
AWS Lambda now allows customers using Amazon Simple Queue Service (Amazon SQS) as an event source to define a wait period, called MaximumBatchingWindowInSeconds, to allow messages to accumulate in their SQS queue before invoking a Lambda function. In addition to Batch Size, this is a second option to send records in batches, to reduce the number of Lambda invokes. This option is ideal for workloads that are not time-sensitive, and can choose to wait to optimize cost.
Previously, Lambda functions polling from an SQS queue would send messages in batches of up to 10 before invoking the function. Now, customers can also define a time window that Lambda should wait to poll messages from their SQS queue before invoking their function. Lambda will wait for up to 300 seconds to poll messages from the SQS queue. When a batch window is defined, Lambda will also allow customers to define a batch size of up to 10,000 messages.
To get started, when creating a new Lambda function or updating an existing function with SQS as an event source, customers can set the MaximumBatchingWindowInSeconds field to any value between 0 and 300 seconds on the AWS Management Console, the AWS CLI, AWS SAM or AWS SDK for Lambda. This feature is available in all AWS Regions where AWS Lambda and Amazon SQS are available, and requires no additional charge to use.
the lambda function pushes the message to one of three different SQS queues
...
So when my system tries to update the entity, it's not existing yet, thus my system creates it. But since I have a create action in the process of executing, it creates a duplicate entry on my destination
By using multiple queue you created yourself a thread race and now you are trying to patch it.
Based on the provided information and context - as already answered - a single fifo queue with context id could be more appropriate (do you really need 3 queues?)
If latency is critical, then a streaming could be a solution as well.
As you described your issue, I think you don't need to combine the messages (indeed you could use Redis, AWS Kinesis Analytics, DynamoDB..), but rather not to create the issue at thecfirst place
Options
having a single fifo queue
having an idempotent and thread-safe backend service able handling concurrent updates (transactions, atomic updates,..)
As well if you can create "duplicate" entries, it means the unique indexes are not enforced. They exist exactly for that reason.
You did not specify the backend service (RDBMS, DynamoDB, MongoDB, other?) each has an option to handle the problem somehow.

Notify Lambda on CloudFront Distribution Creation End

At the moment, we are calling cloudfront.listDistributions() every minute to identify a change in the status of the distribution we are deploying. This cause Lambda to timeout because CloudFront never deploys faster than 30 minutes (where Lambda timeouts after 15 min).
I would like to notify a Lambda function after a CloudFront Distribution is successfully created. This would allow us to execute the post-creation actions while saving valuable Lambda exec time.
Creating a Rule on CloudWatch does not offer the option to chose CloudFront. Nevertheless, it seems to accept creating a Custom Event Pattern with the source aws.cloudformation.
Considering options:
Trigger a lambda every 5 minutes to list distributions and compare states with previous states stored in DynamoDB.
Anybody with an idea to overcome this lack of feature from AWS?
If you want and have time, there's a trickier and a bit more complex solution for doing that leveraging CloudTrail.
Disclaimer
CloudTrail is not a real-time log system, but ensure that all API calls will be reported on the console within 15 minutes (as stated here under the CloudTrail FAQs). Due to this, what's following makes sense only for long-running tasks like creating a CloudFront distribution, taking up an Aurora DB ans so on.
You can create a CloudWatch event based rule (let's call it CW-r1)
on specific pattern like CreateDistribution or
UpdateDistribution.
CW-r1 triggers a Lambda (LM-1) which enables another CloudWatch
event base rule (CW-r2).
CW-r2 on a scheduled base, triggers a Lambda (LM-2) which via API
request the state of specific distribution. Once distribution is
"Deployed", LM-2 can send a notification via SNS for example (you can
send EMAIL, SMS, Push Notification whatever is supported on SNS).
Once everything is finished, LM-2 can disable the CW-r2 rule in order
to stop processing information.
In this way you can have an automatic notification system based on which API call you desire.

Handling time consuming operations using Nodejs and AWS

The current setup of the project I am working on is based on Nodejs/Express and AWS. AWS Lambda is triggered on a daily basis and is used to call an API endpoint which is expected to fire a varying number of emails via Sendgrid (hundreds to thousands). With a lower number of emails it worked fine but when the number of emails was around 1000 the Lambda timed out and the API crashed.
The limit on Lambda was 1 minute. Raising it up to 5 minutes might make this case of 1000 emails pass but might fail when the number is several thousands. Apart from that we would like to avoid keeping the server busy for several minutes because of which it was set to 1 minute initially.
We are now looking for better solutions to this specific situation. What would be a better approach, is it an option to use SNS Queue, or Serverless with moving all the code that sends the emails to Lambda?
Thanks for any inputs in advance and if more information is required please let me know.
Lambdas are not designed for long running operations. You can use Elastic Beanstalk Workers https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html
Briefly, the lambda publish the task to an SQS queue and an elastic beanstalk app handles it.

Schedule a task to run at some point in the future (architecture)

So we have a Python flask app running making use of Celery and AWS SQS for our async task needs.
One tricky problem that we've been facing recently is creating a task to run in x days, or in 3 hours for example. We've had several needs for something like this.
For now we create events in the database with timestamps that store the time that they should be triggered. Then, we make use of celery beat to run a scheduled task every second to check if there are any events to process (based on the trigger timestamp) and then process them. However, this is querying the database every second for events which we feel could be bettered somehow.
We looked into using the eta parameter in celery (http://docs.celeryproject.org/en/latest/userguide/calling.html) that lets you schedule a task to run in x amount of time. However it seems to be bad practice to have large etas and also AWS SQS has a visibility timeout of about two hours and so anything more than this time would cause a conflict.
I'm scratching my head right now. On the one had this works, and pretty decent in that things have been separated out with SNS, SQS etc. to ensure scaling-tolerance. However, it just doesn't feel write to query the database every second for events to process. Surely there's an easier way or a service provided by Google/AWS to schedule some event (pub/sub) to occur at some time in the future (x hours, minutes etc.)
Any ideas?
Have you taken a look at AWS Step Functions, specifically Wait State? You might be able to put together a couple of lambda functions with the first one returning a timestamp or the number of seconds to wait to the Wait State and the last one adding the message to SQS after the Wait returns.
Amazon's scheduling solution is the use of CloudWatch to trigger events. Those events can be placing a message in an SQS/SNS endpoint, triggering an ECS task, running a Lambda, etc. A lot of folks use the trick of executing a Lambda that then does something else to trigger something in your system. For example, you could trigger a Lambda that pushes a job onto Redis for a Celery worker to pick up.
When creating a Cloudwatch rule, you can specify either a "Rate" (I.e., every 5 minutes), or an arbitrary time in CRON syntax.
So my suggestion for your use case would be to drop a cloudwatch rule that runs at the time your job needs to kick off (or a minute before, depending on how time sensitive you are). That rule would then interact with your application to kick off your job. You'll only pay for the resources when CloudWatch triggers.
Have you looked into Amazon Simple Notification Service? It sounds like it would serve your needs...
https://aws.amazon.com/sns/
From that page:
Amazon SNS is a fully managed pub/sub messaging service that makes it easy to decouple and scale microservices, distributed systems, and serverless applications. With SNS, you can use topics to decouple message publishers from subscribers, fan-out messages to multiple recipients at once, and eliminate polling in your applications. SNS supports a variety of subscription types, allowing you to push messages directly to Amazon Simple Queue Service (SQS) queues, AWS Lambda functions, and HTTP endpoints. AWS services, such as Amazon EC2, Amazon S3 and Amazon CloudWatch, can publish messages to your SNS topics to trigger event-driven computing and workflows. SNS works with SQS to provide a powerful messaging solution for building cloud applications that are fault tolerant and easy to scale.
You could start the job with apply_async, and then use a countdown, like:
xxx.apply_async(..., countdown=TTT)
It is not guaranteed that the job starts exactly at that time, depending on how busy the queue is, but that does not seem to be an issue in your use case.

How to optimize AWS Lambda?

I'm currently building web API using AWS Lambda with Serverless Framework.
In my lambda functions, each of them connects to Redis (elasticache) and RDB (Aurora, RDS) or DynamoDB to retrieve data or write new data.
And all my lambda functions are running in my VPC.
Everything works fine except that when a lambda function is first executed or executed a while after last execution, it takes quite a long time (1-3 seconds) to execute the lambda function, or sometimes it even respond with a gateway timeout error (around 30 seconds), even though my lambda functions are configured to 60 seconds timeout.
As stated in here, I assume 1-3 seconds is for initializing a new container. However, I wonder if there is a way to reduce this time, because 1-3 seconds or gateway timeout is not really an ideal for production use.
You've go two issues:
The 1-3 second delay. This is expected and well-documented when using Lambda. As #Nick mentioned in the comments, the only way to prevent your container from going to sleep is using it. You can use Lambda Scheduled Events to execute your function as often as every minute using a rate expression rate(1 minute). If you add some parameters to your function to help you distinguish between a real request and one of these ping requests you can immediately return on the ping requests and then you've worked around your problem. It will cost you more, but we're probably talking pennies per month if anything. Lambda has a generous free tier.
The 30 second delay is unusual. I would definitely check your CloudWatch logs. If you see logs from when your function is working normally but no logs from when you see the 30 second timeout then I would assume the problem is with API Gateway and not with Lambda. If you do see logs then maybe they can help you troubleshoot. Another place to check is the AWS Status Page. I've seen sometimes where Lambda functions timeout and respond intermittently and I pull my hair out only to realize that there's a problem on Amazon's end and they're working on it.
Here's a blog post with additional information on Lambda Container Reuse that, while a little old, still has some good information.

Resources