AWS MSK Trigger - Lambda (consumer) running infinitely - node.js

I am working on a notification service which is built on AWS infra and uses MSK, lambda and SES.
The lambda is written in Nodejs for which the trigger is an MSK topic. Now the weird thing about this lambda is its getting invoked continuously even after the messages are processed. Inside lambda is the code to fetch recipients and send emails via SES.
I have ensured that there is no loop present inside the code. so my guess is for some reason the messages are not getting marked as consumed.
One reason this could happen is if the code executing throws an Error at some point. But I have no error in the logs.
Can execution time (lambda getting timed out be an issue? I don't see anything like that in the logs though), the volume of messages be responsible for this behavior ?
Lambda is setup using serverless framework:
notificationsKafkaConsumer:
handler: src/consumers/notifications.consumer
events:
- msk:
arn: ${ssm:/kafka/cluster_arn~true}
topic: "notifications"
startingPosition: LATEST

It turned out that the lambdas getting timeout out was the issue. The message got lost in the huge volume of logs and interestingly "timed out" is not an error so filtering logs with "ERROR" didn't work.

Related

AWS SQS FIFO message doesn't seem to be retrying

I've set up a function to hit an API endpoint for a newly created entity that isn't immediately available. If the endpoint returns a status of "pending", the function throws an error. If the endpoint returns a status of "active", the function then deletes the SQS message and triggers several other microservices to do their things using SNS. The SQS queue that triggers the function has a visibility timeout of 2 minutes, and the function itself has a 1 minute timeout.
What I'm expecting to happen is that if the endpoint returns a "pending" status, and the function throws an error, then after the 2 minute visibility timeout, the message would trigger the function again. This should happen every 2 minutes until the api call returns an "active" status and the message is deleted, or until the message retention period is surpassed (currently 1 hour). This seemed like a nice serverless way to poll my newly created entity to check if it was ready for other post-processing.
What's actually happening after adding a message to the SQS queue is that the CloudWatch logs are showing that the function is throwing an error like I'd expect, but the function is only being triggered one time. I can't tell if the message is just not not visible for some reason, or if it somehow was deleted. I don't know. I'm am new to using SQS for a Lambda trigger, am I thinking about this wrong?
A few possible causes here:
your Lambda function handler did not actually throw an exception to the Lambda runtime environment, so Lambda thought the function had successfully processed the message and the Lambda service then deleted the message from the queue (so that it would not get processed again)
your SQS queue has a configured DLQ with maximum receives set to 1, so the message is delivered once, the Lambda function fails, and the message is subsequently moved to the DLQ
the SQS message was re-delivered to the Lambda function and was logged but the logs were made to an earlier log stream (because this invocation was warm) and so it wasn't obvious that the Lambda function had actually been invoked multiple times with the same failed message
To verify this all works normally, I set up a simple test with both FIFO and non-FIFO queues and configured the queues to trigger a Lambda function that simply logged the SQS message and then threw an exception. As expected, I saw the same SQS message delivered to the Lambda function every 2 minutes (which is the queue's message visibility timeout). That continued until it hit the max receive count on the SQS redrive policy (defaults to 10 attempts) at which point the failed message was correctly moved to the associated DLQ.

How to increase the AWS lambda to lambda connection timeout or keep the connection alive?

I am using boto3 lambda client to invoke a lambda_S from a lambda_M. My code looks something like
cfg = botocore.config.Config(retries={'max_attempts': 0},read_timeout=840,
connect_timeout=600) # tried also by including the ,
# region_name="us-east-1"
lambda_client = boto3.client('lambda', config=cfg) # even tried without config
invoke_response = lambda_client.invoke (
FunctionName=lambda_name,
InvocationType='RequestResponse',
Payload=json.dumps(request)
)
Lambda_S is supposed to run for like 6 minutes and I want lambda_M to be still alive to get the response back from lambda_S but lambda_M is timing out, after giving a CW message like
"Failed to connect to proxy URL: http://aws-proxy..."
I searched and found someting like configure your HTTP client, SDK, firewall, proxy or operating system to allow for long connections with timeout or keep-alive settings. But the issue is I have no idea how to do any of these with lambda. Any help is highly appreciated.
I would approach this a bit differently. Lambdas charge you by second, so in general you should avoid waiting in them. One way you can do that is create an sns topic and use that as the messenger to trigger another lambda.
Workflow goes like this.
SNS-A -> triggers Lambda-A
SNS-B -> triggers lambda-B
So if you lambda B wants to send something to A to process and needs the results back, from lambda-B you send a message to SNS-A topic and quit.
SNS-A triggers lambda, which does its work and at the end sends a message to SNS-B
SNS-B triggers lambda-B.
AWS has example documentation on what policies you should put in place, here is one.
I don't know how you are automating the deployment of native assets like SNS and lambda, assuming you will use cloudformation,
you create your AWS::Lambda::Function
you create AWS::SNS::Topic
and in its definition, you add 'subscription' property and point it to you lambda.
So in our example, your SNS-A will have a subscription defined for lambda-A
lastly you grant SNS permission to trigger the lambda: AWS::Lambda::Permission
When these 3 are in place, you are all set to send messages to SNS topic which will now be able to trigger the lambda.
You will find SO answers to questions on how to do this cloudformation (example) but you can also read up on AWS cloudformation documentation.
If you are not worried about automating the stuff and manually tests these, then aws-cli is your friend.

Google Pub/Sub - No event data found from local function after published message to topic

I'm using the Functions Framework with Python alongside Google Cloud Pub/Sub Emulator. I'm having issues with an event triggered from a published message to a topic, where there's no event data found for the function. See more details below.
Start Pub/Sub Emulator under http://localhost:8085 and project_id is local-test.
Spin up function with signature-type: http under http://localhost:8006.
Given a background cloud function with signature-type: event:
Topic is created as test-topic
Function is spinned up under http://localhost:8007.
Create push subscription test-subscription for test-topic for endpoint: http://localhost:8007
When I publish a message to test-topic from http://localhost:8006 via POST request in Postman, I get back a 200 response to confirm the message was published successfully. The function representing http://localhost:8007 gets executed as an event as shown in the logs from the functions-framework. However, there's no actual data for event when debugging the triggered function.
Has anyone encountered this? Any ideas/suggestions on this?Perhaps, this is true? #23 Functions Framework does not work with the Pub/Sub emulator
Modules Installed
functions-framework==2.1.1
google-cloud-pubsub==2.2.0
python version
3.8.8
I'll close this post, since the issue is an actual bug that was reported last year.
Update: As a workaround until this bug is fixed, I copied the code below locally to functions_framework/__init__.py within view_func nested function, inside _event_view_func_wrapper function.
if 'message' in event_data:
if 'data' not in event_data:
message = event_data['message']
event_data['data'] = {
'data': message.get('data'),
'attributes': message.get('attributes')
}

Elasticsearch nodejs check if queue is full

I have the following error with elasticsearch
[remote_transport_exception] [es-0][x.x.x.x:9300][indices:data/write/bulk[s]]
Or
[remote_transport_exception] [es-0][x.x.x.x:9300][indices:data/write/bulk[s][p]]
It seems like it seems that the elasticsearch queue is full
I am using the nodejs lib https://www.npmjs.com/package/elasticsearch and this error occured after calling client.index.
I am using index as a promise into a rabbitmq consumer, the message are not coming more than 8 in the same time.
client.index().then(...)
It seems that the then is called when the update or create is still in queue, i tried to add {wait_for_active_shards: 'all'} but I have the same issue.
It was an issue because the elasticsearch server was too busy.
I added a retry system in case of 429 error code, now it works fine

lambdas fail to log to CloudWatch

Situation - I have a lambda that:
is built with Node.js v8
has console.log() statements
is triggered by SQS events
works properly (the downstream system receives all messages, AWS X-Ray can see those executions)
Problem:
this lambda does not log anything!
But if the same lambda is called manually (using "Test" button) - all logging statements are visible in CloudWatch.
My lambda is based on this tutorial: https://www.jeremydaly.com/serverless-consumers-with-lambda-and-sqs-triggers/
A very similar situation occurs if the lambda was called from within another lambda (recursion). Only the first lambda logs stuff (started manually), but every next lambda in the recursion chain does not log anything.
an example can be found here:
https://theburningmonk.com/2016/04/aws-lambda-use-recursive-function-to-process-sqs-messages-part-1/
any idea how to tackle this problem will be highly appreciated.

Resources