Timeout when writing custom metric data to CloudWatch with AWS lambda - python-3.x

I'm running a vanilla AWS lambda function to count the number of messages in my RabbitMQ task queue:
import boto3
from botocore.vendored import requests
cloudwatch_client = boto3.client('cloudwatch')
def get_queue_count(user="user", password="password", domain="<my domain>/api/queues"):
url = f"https://{user}:{password}#{domain}"
res = requests.get(url)
message_count = 0
for queue in res.json():
message_count += queue["messages"]
return message_count
def lambda_handler(event, context):
metric_data = [{'MetricName': 'RabbitMQQueueLength', "Unit": "None", 'Value': get_queue_count()}]
print(metric_data)
response = cloudwatch_client.put_metric_data(MetricData=metric_data, Namespace="RabbitMQ")
print(response)
Which returns the following output on a test run:
Response:
{
"errorMessage": "2020-06-30T19:50:50.175Z d3945a14-82e5-42e5-b03d-3fc07d5c5148 Task timed out after 15.02 seconds"
}
Request ID:
"d3945a14-82e5-42e5-b03d-3fc07d5c5148"
Function logs:
START RequestId: d3945a14-82e5-42e5-b03d-3fc07d5c5148 Version: $LATEST
/var/runtime/botocore/vendored/requests/api.py:72: DeprecationWarning: You are using the get() function from 'botocore.vendored.requests'. This dependency was removed from Botocore and will be removed from Lambda after 2021/01/30. https://aws.amazon.com/blogs/developer/removing-the-vendored-version-of-requests-from-botocore/. Install the requests package, 'import requests' directly, and use the requests.get() function instead.
DeprecationWarning
[{'MetricName': 'RabbitMQQueueLength', 'Value': 295}]
END RequestId: d3945a14-82e5-42e5-b03d-3fc07d5c5148
You can see that I'm able to interact with the RabbitMQ API just fine--the function hangs when trying to post the metric.
The lambda function uses the IAM role put-custom-metric, which uses the policies recommended here, as well as CloudWatchFullAccess for good measure.
Resources on my internal load balancer, where my RabbitMQ server lives, are protected by a VPN, so it's necessary for me to associate this function with the proper VPC/security group. Here's how it's setup right now (I know this is working, because otherwise the communication with RabbitMQ would fail):
I read this post where multiple contributors suggest increasing the function memory and timeout settings. I've done both of these, and the timeout persists.
I can run this locally without any issue to create the metric on CloudWatch in less than 5 seconds.

#noxdafox has written a brilliant plugin that got me most of the way there, but at the end of the day I ended going for a pure lambda-based solution. It was surprisingly tricky getting the cloud watch plugin running with docker, and after I had trouble with the container shutting down its services and stopping processing of the message queue. Additionally, I wanted to be able to normalize queue count by the number of worker services in my ECS cluster, so I was going to need to connect to at least one AWS resource from within my VPC anyhow. I figured best to keep everything simple and in the same place.
import os
import boto3
from botocore.vendored import requests
USER = os.getenv("RMQ_USER")
PASSWORD = os.getenv("RMQ_PASSWORD")
cloudwatch_client = boto3.client(service_name='cloudwatch', endpoint_url="https://MYCLOUDWATCHURL.monitoring.us-east-1.vpce.amazonaws.com")
ecs_client = boto3.client(service_name='ecs', endpoint_url="https://vpce-MYECSURL.ecs.us-east-1.vpce.amazonaws.com")
def get_message_count(user=USER, password=PASSWORD, domain="rabbitmq.stockbets.io/api/queues"):
url = f"https://{user}:{password}#{domain}"
res = requests.get(url)
message_count = 0
for queue in res.json():
message_count += queue["messages"]
print(f"message count: {message_count}")
return message_count
def get_worker_count():
worker_data = ecs_client.describe_services(cluster="prod", services=["worker"])
worker_count = worker_data["services"][0]["runningCount"]
print(f"worker_count count: {worker_count}")
return worker_count
def lambda_handler(event, context):
message_count = get_message_count()
worker_count = get_worker_count()
print(f"msgs per worker: {message_count / worker_count}")
metric_data = [
{'MetricName': 'MessagesPerWorker', "Unit": "Count", 'Value': message_count / worker_count},
{'MetricName': 'NTasks', "Unit": "Count", 'Value': worker_count}
]
cloudwatch_client.put_metric_data(MetricData=metric_data, Namespace="RabbitMQ")
Creating the VPC endpoints was easier that I thought it would be. For Cloudwatch, you want to search for the "monitoring" VPC endpoint during the creation step (not "cloudwatch" or "logs". Searching for "ecs" gets you what you need for the ECS connect.
Once your lambda is us you need to configure the metric and accompanying alerts, and then relate those to an auto-scaling policy, but that's probably beyond the scope of this post. Leave a comment if you have questions on how I worked that out.

Only reason you might want to use a Lambda function to achieve your goal is if you do not own the RabbitMQ cluster. The fact your logic is hanging during communication suggests a network issue mostly due to mis-configured security groups.
If you can change the cluster configuration, I'd suggest you to install and configure the CloudWatch metrics exporter plugin which does most of the heavy-lifting work for you.
If your cluster runs on Docker, I believe the custom Docker file to be the best solution. If you run your Docker instances in AWS via ECS/Fargate, the plugin should be able to automatically infer the credentials from the Task Role through ExAws. Otherwise, just follow the README instructions on how to set the credentials yourself.

Related

Disable logging on AWS Lambda Python 3.9

I have some code that is using logger package to log to a file on a Lambda function. The problem is that it also sends everything to CloudWatch, and since there are a lot of logs, is very expensive.
When I was using Python 3.7, this was working and logging only to the file:
import os
import sys
import logging
LOG_LEVEL = os.getenv('LOG_LEVEL', 'INFO')
root_logger = logging.getLogger()
root_logger.disabled = True
# create custom logger
logger = logging.getLogger('my_logger')
logger.removeHandler(sys.stdout)
logger.setLevel(logging.getLevelName(LOG_LEVEL))
for handler in logger.handlers:
logger.removeHandler(handler)
handler = logging.FileHandler('file.log', encoding='utf-8')
handler.setLevel(logging.getLevelName(LOG_LEVEL))
logger.addHandler(handler)
sys.stdout = open(os.devnull, 'w')
run_code_that_logs_stuff()
But after upgrading to Python 3.9, the logs started to show in CloudWatch again.
I have tried changing the last 2 lines with:
with open(os.devnull, 'w') as f, contextlib.redirect_stdout(f):
run_code_that_logs_stuff()
but same result:
START RequestId: 6ede89b6-26b0-4fac-872a-48ecf64c41d1 Version: $LATEST
2022-05-29T05:46:50.949+02:00 [INFO] 2022-05-29T03:46:50.949Z 6ede89b6-26b0-4fac-872a-48ecf64c41d1 Printing stuff I don't want to appear
The easiest way to prevent Lambda from writing to CloudWatch Logs is to remove the IAM policy that allows it to do so.
If you look at your Lambda's configuration, you'll see an execution role. If you look at that role's configuration, you'll see that it references the managed policy AWSLambdaBasicExecutionRole (or, if it's running inside a VPC, AWSLambdaVPCAccessExecutionRole). If you look at those managed policies, you'll see that they grant permissions logs:CreateLogGroup, logs:CreateLogStream, and logs:PutLogEvents.
Replace those managed policies with your own managed policies that grant everything except those three permissions.
Or, alternatively, create a managed policy that explicitly denies those three permissions (and nothing else), and attach it to the Lambda's execution role.
Update:
I noted in the comments that you can remove all handlers from the logging module at the start of your Lambda code. However, I think that's an incredibly bad idea, as it means that you will have no visibility into your function if things go wrong.
A much better approach would be to leverage the capabilities of the Python logging framework, and log your output at different levels. At the top of your function, use this code to set the level of the logger:
LOGGER = logging.getLogger(__name__)
LOGGER.setLevel(os.environ.get('LOG_LEVEL', 'WARNING'))
Then, in your code, use the appropriate level when logging:
LOGGER.debug("this message won't normally appear")
LOGGER.warning("this message will", exc_info=True)
This should trim your log messages considerably, while still giving you the ability to see what's happening in your function when things go wrong.
Docs: https://docs.python.org/3/library/logging.html
Update 2:
If you're determined to disable logging entirely (imo a bad idea), then I recommend doing it within the logging framework, rather than hacking streams:
import logging
logging.root.handlers = [logging.NullHandler()]
def lambda_handler(event, context):
logging.warn("this message won't appear")

How to increase the AWS lambda to lambda connection timeout or keep the connection alive?

I am using boto3 lambda client to invoke a lambda_S from a lambda_M. My code looks something like
cfg = botocore.config.Config(retries={'max_attempts': 0},read_timeout=840,
connect_timeout=600) # tried also by including the ,
# region_name="us-east-1"
lambda_client = boto3.client('lambda', config=cfg) # even tried without config
invoke_response = lambda_client.invoke (
FunctionName=lambda_name,
InvocationType='RequestResponse',
Payload=json.dumps(request)
)
Lambda_S is supposed to run for like 6 minutes and I want lambda_M to be still alive to get the response back from lambda_S but lambda_M is timing out, after giving a CW message like
"Failed to connect to proxy URL: http://aws-proxy..."
I searched and found someting like configure your HTTP client, SDK, firewall, proxy or operating system to allow for long connections with timeout or keep-alive settings. But the issue is I have no idea how to do any of these with lambda. Any help is highly appreciated.
I would approach this a bit differently. Lambdas charge you by second, so in general you should avoid waiting in them. One way you can do that is create an sns topic and use that as the messenger to trigger another lambda.
Workflow goes like this.
SNS-A -> triggers Lambda-A
SNS-B -> triggers lambda-B
So if you lambda B wants to send something to A to process and needs the results back, from lambda-B you send a message to SNS-A topic and quit.
SNS-A triggers lambda, which does its work and at the end sends a message to SNS-B
SNS-B triggers lambda-B.
AWS has example documentation on what policies you should put in place, here is one.
I don't know how you are automating the deployment of native assets like SNS and lambda, assuming you will use cloudformation,
you create your AWS::Lambda::Function
you create AWS::SNS::Topic
and in its definition, you add 'subscription' property and point it to you lambda.
So in our example, your SNS-A will have a subscription defined for lambda-A
lastly you grant SNS permission to trigger the lambda: AWS::Lambda::Permission
When these 3 are in place, you are all set to send messages to SNS topic which will now be able to trigger the lambda.
You will find SO answers to questions on how to do this cloudformation (example) but you can also read up on AWS cloudformation documentation.
If you are not worried about automating the stuff and manually tests these, then aws-cli is your friend.

How to call Cluster API and start cluster from within Databricks Notebook?

Currently we are using a bunch of notebooks to process our data in azure databricks using mainly python/pyspark.
What we want to achieve is make sure that our clusters are started (warmed up) before initiating the data processing. For that reason we are exploring ways to get access to the Cluster API from within databricks notebooks.
So far we tried running the following:
import subprocess
cluster_id = "XXXX-XXXXXX-XXXXXXX"
subprocess.run(
[f'databricks clusters start --cluster-id "{cluster_id}"'], shell=True
)
which however returns below and nothing really happens afterwards. Cluster is not started.
CompletedProcess(args=['databricks clusters start --cluster-id "0824-153237-ovals313"'], returncode=127)
Is there any convenient and smart way to call the ClusterAPI from within databricks notebook or maybe call a curl command and how is this achieved?
Most probably the error is coming from the incorrectly configured credentials.
Instead of using command-line application it's better to use the Start command of Clusters REST API. This could be done with something like this:
import requests
ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
host_name = ctx.tags().get("browserHostName").get()
host_token = ctx.apiToken().get()
cluster_id = "some_id" # put your cluster ID here
requests.post(
f'https://{host_name}/api/2.0/clusters/get',
json = {'cluster_id': cluster_id},
headers={'Authorization': f'Bearer {host_token}'}
)
and then you can monitor the status using the Get endpoint until it gets into the RUNNING state:
response = requests.get(
f'https://{host_name}/api/2.0/clusters/get?cluster_id={cluster_id}',
headers={'Authorization': f'Bearer {host_token}'}
).json()
status = response['state']

How to use Amazon SQS as Celery broker, without creating / listing queues?

For a project, my organisation is planning on shifting the Celery broker to SQS from redis. Can somebody please guide me as to how to tweak the Celery settings so that I can use a predefined SQS queue, without celery trying to create / list queues (since I do not have those permissions).
I have tried the settings below:
CELERY_BROKER_URL = 'sqs://'
CELERY_BROKER_TRANSPORT_OPTIONS = {
'predefined_queues':{
'MyQueue' : {
'url' : '<SQS Queue URL>',
}
}
}
CELERY_TASK_DEFAULT_QUEUE = 'MyQueue'
CELERY_ROUTES = {
'tasks.*':{
'queue' : 'MyQueue'
}
}
After applying these settings I still get the following error whenever I tried to send a message to SQS queue through celery: An error occurred (AccessDenied) when calling the CreateQueue operation: Access to the resource https://queue.amazonaws.com/ is denied.
Why is celery still attempting to create a queue even when I have passed the predefined_queues setting: https://docs.celeryproject.org/en/stable/getting-started/brokers/sqs.html#predefined-queues ?
Thanks in advance!
Celery worker(s) need to be associated to an IAM role which has CreateQueue action allowed. If your Celery workers run on EC2 instances then the simplest thing to do is to use instance-profile, and let the instance role be able to execute CreateQueue actions.
Even though my company is a heavy user of AWS (literally everything is on AWS) I suggest you think twice when you decide to use an AWS service you can't have on your workstation, and SQS is one such service.

list_datasets() method does nothing in AWS Lambda

I am trying to get the list of datasets from BigQuery inside the AWS lambda. But, while executing the client.list_datasets() method it does nothing and lambda is timed out.
My code is as follows:
from google.cloud.bigquery import Client
from google.oauth2.service_account import Credentials
credentials = Credentials.from_service_account_info(
service_account_dict)
client = Client(
project=service_account_dict.get("project_id"),
credentials=credentials
)
datasets = client.list_datasets()
print(datasets)
for dataset in datasets:
print("dataset info", dataset.__dict__)
The output of first print statement is:
<google.api_core.page_iterator.HTTPIterator object at 0x7fbae4975550>
But, the second print for dataset.__dict__ is not being printed. Or, looping over the HTTPIterator object is not performed.
BTW, the code works perfectly fine in local machine.
The AWS VPC that I used in lambda function was causing this issue. The VPC blocked requests to the external API (in my case BigQuery API).
Configuring the VPC subnet and NAT Gateway to expose lambda function to the internet (0.0.0.0/0) solved the issue.

Resources