Handling AWS lambda retries - python-3.x

I'm trying to invoke hundreds of lambda functions asynchronously through looping. When I do that, almost all of them are retried even though there seems to be no problem with the code. When I try invoking them synchronously, they all run well and return status code 200.
From the reasons mentioned here, I only find the following likely to be the one:
The function experiences resource constraints, such as out-of-memory
errors or other timeouts.
How can I find the exact reason causing the retries and how I can avoid them?

Invoking a lambda and if it returns with 200, you will be also receiving unique requesid for every call.
x-amzn-RequestId
in the response header gives you the unique requestid and will help to trace logs.
You can use apilogs to query against cloudwatch logs.
apilogs get --api-id xyz123 --stage prod --start='1h ago' | grep
"6605b081-6f04-11e6-97ac-c34deb0b3dd9"
More details about request id and apilogs is documented here.
Hope it helps.

Related

How do I register 400 errors in Azure Function Apps as failures in Application Insights?

I want to treat 4xx HTTP responses from a function app (e.g. a 400 response after sending a HTTP request to my function app) as failures in application insights. The function app is being called by another service I control so a 4xx response probably means an implementation error and so I'd like to capture that to ultimately run an alert on it (so I can get an email instead of checking into Azure everytime).
If possible, I'd like it to appear here:
If not, are there any alternative approaches that might fit my use case?
Unless an unhandled exception occurs the function runtime will mark the invocation as succesful, whether the status code is actually denoting an error or not. Since this behavior is defined by the runtime there are 2 things you can do: throw an exception in the code of the function and/or remove exception handling so the invocation is marked as not succesful.
Since you ultimately want to create an alert, you better alert on this specific http status code using a "Custom log search" alert
requests
| where toint(resultCode) >= 400

GCP Cloud run (with python) logging to cloud logging

I'm trying to use cloud run for receiving millions of log messages through HTTPS (streaming) and sending them to cloud logging.
But I found there is some loss of data, the number of messages in cloud logging is less than cloud run receiving.
This is the sample code that I tried,
# unzip data
data = gzip.decompress(request.data)
# split by lines
logs = data.decode('UTF-8').split('\n')
# output the logs
log_cnt = 0
for log in logs:
try:
# output to jsonPayload
print(json.dumps(json.loads(log_str))
log_cnt += 1
except Exception as e:
logging.error(F"messsage: {str(e)}")
and If I compare the log_cnt and number of logs in cloud logging, the log_cnt is more. So some print is not finishing the delivering message.
I tried using logging API instead of print(), but the number of logs is too many for sending using logging API (12,000 calls limit for a minute), so it causes the latency very bad, and could not handle requests stably.
I dought the moving number of instances might cause it, so I test when the active instance is not changed, but still, 3-5% of messages are missing.
Is there something I can do for sending all of the messages to cloud logging without any loss?
(updated)
the line of data looks like this, (around 1kb)
{"key1": "ABCDEFGHIJKLMN","key2": "ABCDEFGHIJKLMN","key3": "ABCDEFGHIJKLMN","key4": "ABCDEFGHIJKLMN","key5": "ABCDEFGHIJKLMN","key6": "ABCDEFGHIJKLMN","key7": "ABCDEFGHIJKLMN","key8": "ABCDEFGHIJKLMN","key9": "ABCDEFGHIJKLMN","key10": "ABCDEFGHIJKLMN","key11": "ABCDEFGHIJKLMN","key12": "ABCDEFGHIJKLMN","key13": "ABCDEFGHIJKLMN","key14": "ABCDEFGHIJKLMN","key15": "ABCDEFGHIJKLMN","key16": "ABCDEFGHIJKLMN","key17": "ABCDEFGHIJKLMN","key18": "ABCDEFGHIJKLMN","key19": "ABCDEFGHIJKLMN","key20": "ABCDEFGHIJKLMN","key21": "ABCDEFGHIJKLMN","key22": "ABCDEFGHIJKLMN","key23": "ABCDEFGHIJKLMN","key24": "ABCDEFGHIJKLMN","key26": "ABCDEFGHIJKLMN","key27": "ABCDEFGHIJKLMN","key28": "ABCDEFGHIJKLMN","key29": "ABCDEFGHIJKLMN","key30": "ABCDEFGHIJKLMN","key31": "ABCDEFGHIJKLMN","key32": "ABCDEFGHIJKLMN","key33": "ABCDEFGHIJKLMN","key34": "ABCDEFGHIJKLMN","key35": "ABCDEFGHIJKLMN"}
I can recommend you to use the entries.write API that allow you to write a bulk of entries in the same time. Have a look on my test in the API explorer
It could solve the JSON format and the multiple write in the same time. Have a try with this API and let me know if it's better for you. If not, I will remove this answer!

AWS MSK Trigger - Lambda (consumer) running infinitely

I am working on a notification service which is built on AWS infra and uses MSK, lambda and SES.
The lambda is written in Nodejs for which the trigger is an MSK topic. Now the weird thing about this lambda is its getting invoked continuously even after the messages are processed. Inside lambda is the code to fetch recipients and send emails via SES.
I have ensured that there is no loop present inside the code. so my guess is for some reason the messages are not getting marked as consumed.
One reason this could happen is if the code executing throws an Error at some point. But I have no error in the logs.
Can execution time (lambda getting timed out be an issue? I don't see anything like that in the logs though), the volume of messages be responsible for this behavior ?
Lambda is setup using serverless framework:
notificationsKafkaConsumer:
handler: src/consumers/notifications.consumer
events:
- msk:
arn: ${ssm:/kafka/cluster_arn~true}
topic: "notifications"
startingPosition: LATEST
It turned out that the lambdas getting timeout out was the issue. The message got lost in the huge volume of logs and interestingly "timed out" is not an error so filtering logs with "ERROR" didn't work.

Dialogflow Webhook call failed. Error: [ResourceName error] Path '' does not match template

I am using Dialogflow ES and once I got the webhook setup, I haven't been having issues. But after a few months, I just started getting a random error. It seems to be inconsistent as in sometimes I get it for a specific web call and other times it works fine. This is from the Raw API response:
"webhookStatus": {
"code": 3,
"message": "Webhook call failed. Error: [ResourceName error] Path '' does not match template 'projects/{project_id=*}/locations/{location_id=*}/agent/environments/{environment_id=*}/users/{user_id=*}/sessions/{session_id=*}/contexts/{context_id=*}'.."
}
The webhook is in GCP Functions in the same project. I have a simple "ping" function in the same agent that calls the webhook. That works properly and pings the function, records some notes in the function log (so I know the function is being called), and returns a response fine, so I know the webhook is connected and working for other intents in the same agent before and after I get the error above.
Other intents in the same agent work (and this one WAS working), but I get this error now. I also tried recreating the intent and I get the same behavior.
The project is linked to a billing account and I have been getting charged for it, so I don't think it is an issue with being on a trial or otherwise. Though the Dialogflow itself is in "trial", but the linked webhook function is billed.
Where can I find what this error means or where to look to resolve it?
After looking at this with fresh eyes, I found out what was happening.
The issue was a mal-formed output context. I was returning the bad output context sometimes (which explained why sometimes it worked and sometimes it didn't). Specifically, I was returning the parameters directly into the output context without the output context 'name' or 'parameters'. Everything looked like it was working and I didn't get any other errors, but apparently, when Dialogflow receives a bad web response, it generates the unhelpful error above.

lambdas fail to log to CloudWatch

Situation - I have a lambda that:
is built with Node.js v8
has console.log() statements
is triggered by SQS events
works properly (the downstream system receives all messages, AWS X-Ray can see those executions)
Problem:
this lambda does not log anything!
But if the same lambda is called manually (using "Test" button) - all logging statements are visible in CloudWatch.
My lambda is based on this tutorial: https://www.jeremydaly.com/serverless-consumers-with-lambda-and-sqs-triggers/
A very similar situation occurs if the lambda was called from within another lambda (recursion). Only the first lambda logs stuff (started manually), but every next lambda in the recursion chain does not log anything.
an example can be found here:
https://theburningmonk.com/2016/04/aws-lambda-use-recursive-function-to-process-sqs-messages-part-1/
any idea how to tackle this problem will be highly appreciated.

Resources