How to use Amazon SQS as Celery broker, without creating / listing queues? - python-3.x

For a project, my organisation is planning on shifting the Celery broker to SQS from redis. Can somebody please guide me as to how to tweak the Celery settings so that I can use a predefined SQS queue, without celery trying to create / list queues (since I do not have those permissions).
I have tried the settings below:
CELERY_BROKER_URL = 'sqs://'
CELERY_BROKER_TRANSPORT_OPTIONS = {
'predefined_queues':{
'MyQueue' : {
'url' : '<SQS Queue URL>',
}
}
}
CELERY_TASK_DEFAULT_QUEUE = 'MyQueue'
CELERY_ROUTES = {
'tasks.*':{
'queue' : 'MyQueue'
}
}
After applying these settings I still get the following error whenever I tried to send a message to SQS queue through celery: An error occurred (AccessDenied) when calling the CreateQueue operation: Access to the resource https://queue.amazonaws.com/ is denied.
Why is celery still attempting to create a queue even when I have passed the predefined_queues setting: https://docs.celeryproject.org/en/stable/getting-started/brokers/sqs.html#predefined-queues ?
Thanks in advance!

Celery worker(s) need to be associated to an IAM role which has CreateQueue action allowed. If your Celery workers run on EC2 instances then the simplest thing to do is to use instance-profile, and let the instance role be able to execute CreateQueue actions.
Even though my company is a heavy user of AWS (literally everything is on AWS) I suggest you think twice when you decide to use an AWS service you can't have on your workstation, and SQS is one such service.

Related

Google Pub/Sub with distributed subscribers in Node.js

We are attempting to migrate a message processing app from Kafka to Google Pub/Sub and it's just not working as expected.
We are running in Kubernetes (Google Cloud) where there may be multiple pods processing messages on the same subscription. Topics and subscriptions are all created using terraform and are more or less permanent. They are not created/destroyed on the fly by the application.
In our development environment, where message throughput is rather low, everything works just fine. But when we scale up to production levels, everything seems to fall apart. We get big backlogs of unacked messages, and yet some pods are not receiving any messages at all. And then, all of a sudden, the backlog will just go away, but then climb again.
We are using the nodejs client library provided by google: #google-cloud/pubsub:3.1.0
Each instance of the application subscribes to the same named subscription, and according to the documentation, messages should be distributed to each subscriber. But that is not happening. Some pods will be consuming messages rapidly, while others sit idle.
Every message is processed in a try/catch block and we are not observing any errors being thrown. So, as far as we know, every received message is getting acked.
I am suspicious that, as pods are terminated with autoscaling or updated deployments, that we are not properly closing subscriptions, but there are no examples addressing a distributed environment and I have not found any document that specifically addresses how to properly manage resources. It is also worth mentioning that the app has multiple subscriptions to different topics.
When a pod shuts down, what actions should be taken on the Subscription object and the PubSub client object? Maybe that's not even the issue, but it seems like a reasonable place to start.
When we start a subscription we do something like this:
private exampleSubscribe(): Subscription {
// one suggestion for having multiple subscriptions in the same app
// was to use separate clients for each
const pubSubClient = new PubSub({
// use a regional endpoint for message ordering
apiEndpoint: 'us-central1-pubsub.googleapis.com:443',
});
pubSubClient.projectId = 'my-project-id';
const sub = pubSubClient.subscription('my-subscription-name', {
// have tried various values for maxMessage from 5 to the default of 1000
flowControl: { maxMessages: 250, allowExcessMessages: false },
ackDeadline: 30,
});
sub.on('message', async (message) => {
await this.exampleMessageProcessing(message);
});
return sub;
}
private async exampleMessageProcessing(message: Message): Promise<void> {
try {
// do some cool stuff
} catch (error) {
// log the error
} finally {
message.ack();
}
}
Upon termination of a pod, we do this:
private async exampleCloseSub(sub: Subscription) {
try {
sub.removeAllListeners('message');
await sub.close();
// note that we do nothing with the PubSub
// client object -- should it also be closed?
} catch (error) {
// ignore error, we are shutting down
}
}
When running with Kafka, we can easily keep up with the message pace with usually no more than 2 pods. So I know that we are not running into issues of it simply taking too long to process each message.
Why are messages being left unacked? Why are pods not receiving messages when there is clearly a large backlog? What is the correct way to shut down one subscriber on a shared subscription?
It turns out that the issue was an improper implementation of message ordering.
The official docs for message ordering in Pub/Sub are rather brief:
https://cloud.google.com/pubsub/docs/ordering
Not much there regarding how to implement an ordering key or the implications of message ordering on horizontal scaling.
Though they do link to some external resources, one of which is this blog post:
https://medium.com/google-cloud/google-cloud-pub-sub-ordered-delivery-1e4181f60bc8
In our case, we did not have enough distinct ordering keys to allow for proper distribution of messages across subscribers/pods.
So this was definitely an RTFM situation, or more accurately: Read The Fine Blog Post Referred To By The Manual. I would have much preferred that the important details were actually in the official documentation. Is that to much to ask for?

aws lambda node js not starting ec2 instance

I am writing a ec2 scheduler logic to start and stop ec2 instances.
The lambda works for stopping instances. However the start function is not initiating ec2 start.
The logic is to filter based on tags and status of ec2 and start or stop based on current status.
Below is the code snippet to start EC2 instances. But this isn't starting the instances.
The filtering happens correctly and pushes the instances to "stopParams" object.
The same code works if I change the logic to ec2.stopInsatnces by filtering the running state instances. The role has permissions to start and stop .
Any ideas why its not triggering start ?
if (instances.length > 0){
var stopParams = { InstanceIds: instances };
ec2.startInstances(stopParams, function(err,data) {
if (err) {
console.log(err, err.stack);
} else {
console.log(data);
}
context.done(err,data);
});
Finally got this working. There were no issues with the nodejs lambda code. Even though was able to stop instances but start instances were not invoking the start method. Found that all volumes are encrypted.
To start an instance using API call the lambda role used by lambda should have permission to kms key which is used for encrypting the volume. After adding the lambda role arn in the principal section of kms key policy permission the lambda was able to start instances. But key permission is not necessary for stopping the instance. Hope this helps

How to increase the AWS lambda to lambda connection timeout or keep the connection alive?

I am using boto3 lambda client to invoke a lambda_S from a lambda_M. My code looks something like
cfg = botocore.config.Config(retries={'max_attempts': 0},read_timeout=840,
connect_timeout=600) # tried also by including the ,
# region_name="us-east-1"
lambda_client = boto3.client('lambda', config=cfg) # even tried without config
invoke_response = lambda_client.invoke (
FunctionName=lambda_name,
InvocationType='RequestResponse',
Payload=json.dumps(request)
)
Lambda_S is supposed to run for like 6 minutes and I want lambda_M to be still alive to get the response back from lambda_S but lambda_M is timing out, after giving a CW message like
"Failed to connect to proxy URL: http://aws-proxy..."
I searched and found someting like configure your HTTP client, SDK, firewall, proxy or operating system to allow for long connections with timeout or keep-alive settings. But the issue is I have no idea how to do any of these with lambda. Any help is highly appreciated.
I would approach this a bit differently. Lambdas charge you by second, so in general you should avoid waiting in them. One way you can do that is create an sns topic and use that as the messenger to trigger another lambda.
Workflow goes like this.
SNS-A -> triggers Lambda-A
SNS-B -> triggers lambda-B
So if you lambda B wants to send something to A to process and needs the results back, from lambda-B you send a message to SNS-A topic and quit.
SNS-A triggers lambda, which does its work and at the end sends a message to SNS-B
SNS-B triggers lambda-B.
AWS has example documentation on what policies you should put in place, here is one.
I don't know how you are automating the deployment of native assets like SNS and lambda, assuming you will use cloudformation,
you create your AWS::Lambda::Function
you create AWS::SNS::Topic
and in its definition, you add 'subscription' property and point it to you lambda.
So in our example, your SNS-A will have a subscription defined for lambda-A
lastly you grant SNS permission to trigger the lambda: AWS::Lambda::Permission
When these 3 are in place, you are all set to send messages to SNS topic which will now be able to trigger the lambda.
You will find SO answers to questions on how to do this cloudformation (example) but you can also read up on AWS cloudformation documentation.
If you are not worried about automating the stuff and manually tests these, then aws-cli is your friend.

Timeout when writing custom metric data to CloudWatch with AWS lambda

I'm running a vanilla AWS lambda function to count the number of messages in my RabbitMQ task queue:
import boto3
from botocore.vendored import requests
cloudwatch_client = boto3.client('cloudwatch')
def get_queue_count(user="user", password="password", domain="<my domain>/api/queues"):
url = f"https://{user}:{password}#{domain}"
res = requests.get(url)
message_count = 0
for queue in res.json():
message_count += queue["messages"]
return message_count
def lambda_handler(event, context):
metric_data = [{'MetricName': 'RabbitMQQueueLength', "Unit": "None", 'Value': get_queue_count()}]
print(metric_data)
response = cloudwatch_client.put_metric_data(MetricData=metric_data, Namespace="RabbitMQ")
print(response)
Which returns the following output on a test run:
Response:
{
"errorMessage": "2020-06-30T19:50:50.175Z d3945a14-82e5-42e5-b03d-3fc07d5c5148 Task timed out after 15.02 seconds"
}
Request ID:
"d3945a14-82e5-42e5-b03d-3fc07d5c5148"
Function logs:
START RequestId: d3945a14-82e5-42e5-b03d-3fc07d5c5148 Version: $LATEST
/var/runtime/botocore/vendored/requests/api.py:72: DeprecationWarning: You are using the get() function from 'botocore.vendored.requests'. This dependency was removed from Botocore and will be removed from Lambda after 2021/01/30. https://aws.amazon.com/blogs/developer/removing-the-vendored-version-of-requests-from-botocore/. Install the requests package, 'import requests' directly, and use the requests.get() function instead.
DeprecationWarning
[{'MetricName': 'RabbitMQQueueLength', 'Value': 295}]
END RequestId: d3945a14-82e5-42e5-b03d-3fc07d5c5148
You can see that I'm able to interact with the RabbitMQ API just fine--the function hangs when trying to post the metric.
The lambda function uses the IAM role put-custom-metric, which uses the policies recommended here, as well as CloudWatchFullAccess for good measure.
Resources on my internal load balancer, where my RabbitMQ server lives, are protected by a VPN, so it's necessary for me to associate this function with the proper VPC/security group. Here's how it's setup right now (I know this is working, because otherwise the communication with RabbitMQ would fail):
I read this post where multiple contributors suggest increasing the function memory and timeout settings. I've done both of these, and the timeout persists.
I can run this locally without any issue to create the metric on CloudWatch in less than 5 seconds.
#noxdafox has written a brilliant plugin that got me most of the way there, but at the end of the day I ended going for a pure lambda-based solution. It was surprisingly tricky getting the cloud watch plugin running with docker, and after I had trouble with the container shutting down its services and stopping processing of the message queue. Additionally, I wanted to be able to normalize queue count by the number of worker services in my ECS cluster, so I was going to need to connect to at least one AWS resource from within my VPC anyhow. I figured best to keep everything simple and in the same place.
import os
import boto3
from botocore.vendored import requests
USER = os.getenv("RMQ_USER")
PASSWORD = os.getenv("RMQ_PASSWORD")
cloudwatch_client = boto3.client(service_name='cloudwatch', endpoint_url="https://MYCLOUDWATCHURL.monitoring.us-east-1.vpce.amazonaws.com")
ecs_client = boto3.client(service_name='ecs', endpoint_url="https://vpce-MYECSURL.ecs.us-east-1.vpce.amazonaws.com")
def get_message_count(user=USER, password=PASSWORD, domain="rabbitmq.stockbets.io/api/queues"):
url = f"https://{user}:{password}#{domain}"
res = requests.get(url)
message_count = 0
for queue in res.json():
message_count += queue["messages"]
print(f"message count: {message_count}")
return message_count
def get_worker_count():
worker_data = ecs_client.describe_services(cluster="prod", services=["worker"])
worker_count = worker_data["services"][0]["runningCount"]
print(f"worker_count count: {worker_count}")
return worker_count
def lambda_handler(event, context):
message_count = get_message_count()
worker_count = get_worker_count()
print(f"msgs per worker: {message_count / worker_count}")
metric_data = [
{'MetricName': 'MessagesPerWorker', "Unit": "Count", 'Value': message_count / worker_count},
{'MetricName': 'NTasks', "Unit": "Count", 'Value': worker_count}
]
cloudwatch_client.put_metric_data(MetricData=metric_data, Namespace="RabbitMQ")
Creating the VPC endpoints was easier that I thought it would be. For Cloudwatch, you want to search for the "monitoring" VPC endpoint during the creation step (not "cloudwatch" or "logs". Searching for "ecs" gets you what you need for the ECS connect.
Once your lambda is us you need to configure the metric and accompanying alerts, and then relate those to an auto-scaling policy, but that's probably beyond the scope of this post. Leave a comment if you have questions on how I worked that out.
Only reason you might want to use a Lambda function to achieve your goal is if you do not own the RabbitMQ cluster. The fact your logic is hanging during communication suggests a network issue mostly due to mis-configured security groups.
If you can change the cluster configuration, I'd suggest you to install and configure the CloudWatch metrics exporter plugin which does most of the heavy-lifting work for you.
If your cluster runs on Docker, I believe the custom Docker file to be the best solution. If you run your Docker instances in AWS via ECS/Fargate, the plugin should be able to automatically infer the credentials from the Task Role through ExAws. Otherwise, just follow the README instructions on how to set the credentials yourself.

Setup webjob ServiceBusTriggers or queue names at runtime (without hard-coded attributes)?

Is there any way to configure triggers without attributes? I cannot know the queue names ahead of time.
Let me explain my scenario here.. I have one service bus queue, and for various reasons (complicated duplicate-suppression business logic), the queue messages have to be processed one at a time, so I have ServiceBusConfiguration.OnMessageOptions.MaxConcurrentCalls set to 1. So processing a message holds up the whole queue until it is finished. Needless to say, this is suboptimal.
This 'one at a time' policy isn't so simple. The messages could be processed in parallel, they just have to be divided into groups (based on a field in message), say A and B. Group A can process its messages one at a time, and group B can process its own one at a time, etc. A and B are processed in parallel, all is good.
So I can create a queue for each group, A, B, C, ... etc. There are about 50 groups, so 50 queues.
I can create a queue for each, but how to make this work with the Azure Webjobs SDK? I don't want to copy-paste a method for each queue with a different ServiceBusTrigger for the SDK to discover, just to enforce one-at-a-time per queue/group, then update the code with another copy-paste whenever another group is needed. Fetching a list of queues at startup and tying to the function is preferable.
I have looked around and I don't see any way to do what I want. The ITypeLocator interface is pretty hard-set to look for attributes. I could probably abuse the INameResolver, but it seems like I'd still have to have a bunch of near-duplicate methods around. Could I somehow create what the SDK is looking for at startup/runtime?
(To be clear, I know how to use INameResolver to get queue name as at How to set Azure WebJob queue name at runtime? but though similar this isn't my problem. I want to setup triggers for multiple queues at startup for the same function to get the one-at-a-time per queue processing, without using the trigger attribute 50 times repeatedly. I figured I'd ask again since the SDK repo is fairly active and it's been a year..).
Or am I going about this all wrong? Being dumb? Missing something? Any advice on this dilemma would be welcome.
The Azure Webjob Host discovers and indexes the functions with the ServiceBusTrigger attribute when it starts. So there is no way to set up the queues to trigger at the runtime.
The simpler solution for you is to create a long time running job and implement it manually:
public class Program
{
private static void Main()
{
var host = new JobHost();
host.CallAsync(typeof(Program).GetMethod("Process"));
host.RunAndBlock();
}
[NoAutomaticTriggerAttribute]
public static async Task Process(TextWriter log, CancellationToken token)
{
var connectionString = "myconnectionstring";
// You can also get the queue name from app settings or azure table ??
var queueNames = new[] {"queueA", "queueA" };
var messagingFactory = MessagingFactory.CreateFromConnectionString(connectionString);
foreach (var queueName in queueNames)
{
var receiver = messagingFactory.CreateMessageReceiver(queueName);
receiver.OnMessage(message =>
{
try
{
// do something
....
// Complete the message
message.Complete();
}
catch (Exception ex)
{
// Log the error
log.WriteLine(ex.ToString());
// Abandon the message so that it can be retry.
message.Abandon();
}
}, new OnMessageOptions() { MaxConcurrentCalls = 1});
}
// await until the job stop or restart
await Task.Delay(Timeout.InfiniteTimeSpan, token);
}
}
Otherwise, if you don't want to deal with multiple queues, you can have a look at azure servicebus topic/subscription and create SqlFilter to send your message to the right subscription.
Another option could be to create your own trigger: The azure webjob SDK provides extensibility points to create your own trigger binding :
Binding Extensions Overview
Good Luck !
Based on my understanding, your needs seems to be building a message batch system in parallel. The #Thomas solution is good, but I think Azure Batch service with Table storage may be better and could be instead of the complex solution of ServiceBus queue + WebJobs with a trigger.
Using Azure Batch with Table storage, you can control the task creation and execute the task in parallel and at scale, even monitor these tasks, please refer to the tutorial to know how to.

Resources