Increase deployVerticle Timeout - groovy

Using Vert.x I have a verticle with a very slow startup because it depends on several slow http requests.
It is completely async, but I still receive the following error because the Timeout of deployVerticle.
(TIMEOUT,-1) Timed out after waiting 30000(ms) for a reply. address: d5c134e0-53dc-4d4f-b854-1c40a7905914, repliedAddress: my.dummy.project
I am deploying the verticle as
def name = "groovy:my.dummy.verticle"
def opts = new DeploymentOptions().setConfig(config());
vertx.deployVerticle(name, opts, { res ->
if(res.failed()){
log.error("Failed to deploy verticle " + name)
}
else {
log.info("Deployed verticle " + name)
}
})
How can I increase those 30000ms to something more suitable for me? I know that the requests will take more than a minute.

The message you're seeing is not directly related to the deployment. The message is coming from the event bus that did not receive a response to the sent message within 30 seconds.
You can increase that timeout using the DeliveryOptions http://vertx.io/docs/apidocs/io/vertx/core/eventbus/DeliveryOptions.html

Related

Queue triggered based function app not getting completed

We are using queue trigger based function app on premium plan where messages contains some details like azure subscriptions name. Based on which for each subscription we do many api calls specially to azure storage accounts(around 400 to 500). Since 'list' api call to storage account is limited to 100 call/5min, we get 429 response error on 101th call. To mitigate this we have applied exponential retry logic(tried both our own or polly library) which call after certain delay of time. This works for some subscription but fails for many where the retry logic does not try after first trying(we kept 3 retries with 60 sec delay). Even while monitoring the function app through live metrics we observed that sometimes cpu usage of some function instance goes to zero(although we do some operation like logging or use for loop in delay operation so that the function can be alive) which leads to killing of that particular function instance and pushing the message back to queue and start the process again with a fresh instance.
Note that since many subscription are processed in parallel, function app automatically scale up as required. Also since we are using premium plan one VM is always on state. So killing of any instance(which call around 400 to 500 storage api call for any particular subscription) is weird since in our delay the thread sleep time is only 10 sec for around 6,12,18(Time_delay) iteration. The below delay function is used in our retry logic code.
private void Delay(int Time_delay, string requestUri, int retryCount)
{
for (int i = 0; i < Time_delay; i++)
{
_logger.LogWarning($"Sleep initiated for id: {requestUri.ToString()}, RetryCount: {retryCount} CurrentTimeDelay: {Time_delay}");
Thread.Sleep(10000);
_logger.LogWarning($"Sleep completed for id: {requestUri.ToString()}, RetryCount: {retryCount} CurrentTimeDelay: {Time_delay}");
}
}
Note** Function app is not throwing any other exception other than dependency of 429 error response.
Would it be possible for you to requeue instead of using Thread.Sleep? You can use initial visibility delay when requeuing:
public class Function1
{
[FunctionName(nameof(TryDoWork))]
public static async Task TryDoWork(
[QueueTrigger("some-queue")] SomeItem item,
[Queue("some-queue")] CloudQueue queue)
{
var result = _SomeService.SomeWork(item);
if (result == 429)
{
item.Retries++;
var json = JsonConvert.SerializeObject(item);
var message = new CloudQueueMessage(json);
var delay = TimeSpan.FromSeconds(item.Retries);
await queue.AddMessageAsync(message, null, delay, null, null);
}
}
}
It might be that the sleeping is causing some wonky function app behavior. I think I remember reading some issues pertaining to the usage of Thread.Sleep, but I can't find it right now.
Also, you might want to add some sort of handling of messages that end up retrying more than 3 times (or however many you think is reasonable).

worker thread won't respond after first message?

I'm making a server script and, to make it easier for both hosts and clients to do what they want, I made a customizable server script that runs using nw.js(with a visual interface). Said script was made using web workers since nw.js was having problems with support to worker threads.
Now that NW.js fixed their problems with worker threads, I've been trying to move all the things that were inside the web workers to worker threads, but there's a problem: When the main thread receives the answer from the second thread, the later stops responding to any subsequent message.
For example, running the following code with either NW.js or Node.js itself will return "pong" only once
const { Worker } = require('worker_threads');
const worker = new Worker('const { parentPort } = require("worker_threads");parentPort.once("message",message => parentPort.postMessage({ pong: message })); ', { eval: true });
worker.on('message', message => console.log(message));
worker.postMessage('ping');
worker.postMessage('ping');
How do I configure the worker so it will keep responding to whatever message it receives after the first one?
Because you use EventEmitter.once() method. According to the documentation this method does the next:
Adds a one-time listener function for the event named eventName. The
next time eventName is triggered, this listener is removed and then
invoked.
If you need your worker to process more than one event then use EventEmitter.on()
const worker = new Worker('const { parentPort } = require("worker_threads");' +
'parentPort.on("message",message => parentPort.postMessage({ pong: message }));',
{ eval: true });

How do I fail a specific SQS message in a batch from a Lambda?

I have a Lambda with an SQS trigger. When it gets hit, a batch of records from SQS comes in (usually about 10 at a time, I think). If I return a failed status code from the handler, all 10 messages will be retried. If I return a success code, they'll all be removed from the queue. What if 1 out of those 10 messages failed and I want to retry just that one?
exports.handler = async (event) => {
for(const e of event.Records){
try {
let body = JSON.parse(e.body);
// do things
}
catch(e){
// one message failed, i want it to be retried
}
}
// returning this causes ALL messages in
// this batch to be removed from the queue
return {
statusCode: 200,
body: 'Finished.'
};
};
Do I have to manually re-add that ones message back to the queue? Or can I return a status from my handler that indicates that one message failed and should be retried?
As per AWS documentation, SQS event source mapping now supports handling of partial failures out of the box. Gist of the linked article is as follows:
Include ReportBatchItemFailures in your EventSourceMapping configuration
The response syntax in case of failures has to be modified to have:
{
"batchItemFailures": [
{ "itemIdentifier": "id2" },
{ "itemIdentifier": "id4" }
]
}
Where id2 and id4 the failed messageIds in a batch.
Quoting the documentation as is:
Lambda treats a batch as a complete success if your function returns
any of the following
An empty batchItemFailure list
A null batchItemFailure list
An empty EventResponse
A null EventResponse
Lambda treats a batch as a complete failure if your function returns
any of the following:
An invalid JSON response
An empty string itemIdentifier
A null itemIdentifier
An itemIdentifier with a bad key name
An itemIdentifier value with a message ID that doesn't exist
SAM support is not yet available for the feature as per the documentation. But one of the AWS labs example points to its usage in SAM and it worked for me when tested
Yes you have to manually re-add the failed messages back to the queue.
What I suggest doing is setting up a fail count, so that if all messages failed you can simply return a failed status for all messages, otherwise if the fail count is < 10 then you can individually send back the failed messages to the queue.
You've to programmatically delete each message from after processing it successfully.
So you can have a flag set to true if anyone of the messages failed and depending upon it you can raise error after processing all the messages in a batch so successful messages will be deleted and other messages will be reprocessed based on retry policies.
So as per the below logic only failed and unprocessed messages will get retried.
import boto3
sqs = boto3.client("sqs")
def handler(event, context):
for message in event['records']:
queue_url = "form queue url recommended to set it as env variable"
message_body = message["body"]
print("do some processing :)")
message_receipt_handle = message["receiptHandle"]
sqs.delete_message(
QueueUrl=queue_url,
ReceiptHandle=message_receipt_handle
)
there is also another way to save successfully processed message id into a variable and perform batch delete operation based on message id
response = client.delete_message_batch(
QueueUrl='string',
Entries=[
{
'Id': 'string',
'ReceiptHandle': 'string'
},
]
)
You need to design your app iin diffrent way here is few ideas not best but will solve your problem.
Solution 1:
Create sqs delivery queues - sq1
Create delay queues as per delay requirment sq2
Create dead letter queue sdl
Now inside lambda function if message failed in sq1 then delete it on sq1 and drop it on sq2 for retry Any Lambda function invoked asynchronously is retried twice before the event is discarded. If the retries fail.
If again failed after give retry move into dead letter queue sdl .
AWS Lambda - processing messages in Batches
https://docs.aws.amazon.com/lambda/latest/dg/retries-on-errors.html
Note :When an SQS event source mapping is initially created and enabled, or first appear after a period with no traffic, then the Lambda service will begin polling the SQS queue using five parallel long-polling connections, as per AWS documentation, the default duration for a long poll from AWS Lambda to SQS is 20 seconds.
https://docs.aws.amazon.com/lambda/latest/dg/lambda-services.html#supported-event-source-sqs
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-delay-queues.html
https://nordcloud.com/amazon-sqs-as-a-lambda-event-source/
Solution 2:
Use AWS StepFunction
https://aws.amazon.com/step-functions/
StepFunction will call lambda and handle the retry logic on failure with configurable exponential back-off if needed.
https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html
https://cloudacademy.com/blog/aws-step-functions-a-serverless-orchestrator/
**Solution 3: **
CloudWatch scheduled event to trigger a Lambda function that polls for FAILED.
Error handling for a given event source depends on how Lambda is invoked. Amazon CloudWatch Events invokes your Lambda function asynchronously.
https://docs.aws.amazon.com/lambda/latest/dg/retries-on-errors.html
https://engineering.opsgenie.com/aws-lambda-performance-series-part-2-an-analysis-on-async-lambda-fail-retry-behaviour-and-dead-b84620af406
https://dzone.com/articles/asynchronous-retries-with-aws-sqs
https://medium.com/#ron_73212/how-to-handle-aws-lambda-errors-like-a-pro-e5455b013d10
AWS supports partial batch response. Here is example for Typescript code
type Result = {
itemIdentifier: string
status: 'failed' | 'success'
}
const isFulfilled = <T>(
result: PromiseFulfilledResult<T> | PromiseRejectedResult
): result is PromiseFulfilledResult<T> => result.status === 'fulfilled'
const isFailed = (
result: PromiseFulfilledResult<Result>
): result is PromiseFulfilledResult<
Omit<Result, 'status'> & { status: 'failed' }
> => result.value.status === 'failed'
const results = await Promise.allSettled(
sqsEvent.Records.map(async (record) => {
try {
return { status: 'success', itemIdentifier: record.messageId }
} catch(e) {
console.error(e);
return { status: 'failed', itemIdentifier: record.messageId }
}
})
)
return results
.filter(isFulfilled)
.filter(isFailed)
.map((result) => ({
itemIdentifier: result.value.itemIdentifier,
}))

How to specify HTTP timeout for DownloadURL() in Akavache?

I am developing an application targetting mobile devices, so I have to consider bad network connectivity. In one use case, I need to reduce the timeout for a request, because if no network is available, that's okay, and I'd fall back to default data immediately, without having the user wait for the HTTP response.
I found that HttpMixin.MakeWebRequest() has a timeout parameter (with default=null) but DownloadUrl() never makes use of it, so the forementioned function always waits for up to 15 seconds:
request.Timeout(timeout ?? TimeSpan.FromSeconds(15),
BlobCache.TaskpoolScheduler).Retry(retries);
So actually I do not have the option to use a different timeout, or am I missing something?
Thanks for considering a helpful response.
So after looking at the signature for DownloadUrl in
HttpMixin.cs
I saw what you are talking about and am not sure why it is there but, it looks like the timeout is related to building the request and not a timeout for the request itself.
That being said, in order to set a timeout with a download, you have a couple options that should work.
Via TPL aka Async Await
var timeout = 1000;
var task = BlobCache.LocalMachine.DownloadUrl("http://stackoverflow.com").FirstAsync().ToTask();
if (await Task.WhenAny(task, Task.Delay(timeout)) == task) {
// task completed within timeout
//Do Stuff with your byte data here
//var result = task.Result;
} else {
// timeout logic
}
Via Rx Observables
var obs = BlobCache.LocalMachine
.DownloadUrl("http://stackoverflow.com")
.Timeout(TimeSpan.FromSeconds(5))
.Retry(retryCount: 2);
var result = obs.Subscribe((byteData) =>
{
//Do Stuff with your byte data here
Debug.WriteLine("Byte Data Length " + byteData.Length);
}, (ex) => {
Debug.WriteLine("Handle your exceptions here." + ex.Message);
});

Amazon SQS with aws-sdk receiveMessage Stall

I'm using the aws-sdk node module with the (as far as I can tell) approved way to poll for messages.
Which basically sums up to:
sqs.receiveMessage({
QueueUrl: queueUrl,
MaxNumberOfMessages: 10,
WaitTimeSeconds: 20
}, function(err, data) {
if (err) {
logger.fatal('Error on Message Recieve');
logger.fatal(err);
} else {
// all good
if (undefined === data.Messages) {
logger.info('No Messages Object');
} else if (data.Messages.length > 0) {
logger.info('Messages Count: ' + data.Messages.length);
var delete_batch = new Array();
for (var x=0;x<data.Messages.length;x++) {
// process
receiveMessage(data.Messages[x]);
// flag to delete
var pck = new Array();
pck['Id'] = data.Messages[x].MessageId;
pck['ReceiptHandle'] = data.Messages[x].ReceiptHandle;
delete_batch.push(pck);
}
if (delete_batch.length > 0) {
logger.info('Calling Delete');
sqs.deleteMessageBatch({
Entries: delete_batch,
QueueUrl: queueUrl
}, function(err, data) {
if (err) {
logger.fatal('Failed to delete messages');
logger.fatal(err);
} else {
logger.debug('Deleted recieved ok');
}
});
}
} else {
logger.info('No Messages Count');
}
}
});
receiveMessage is my "do stuff with collected messages if I have enough collected messages" function
Occasionally, my script is stalling because I don't get a response for Amazon at all, say for example there are no messages in the queue to consume and instead of hitting the WaitTimeSeconds and sending a "no messages object", the callback isn't called.
(I'm writing this up to Amazon Weirdness)
What I'm asking is whats the best way to detect and deal with this, as I have some code in place to stop concurrent calls to receiveMessage.
The suggested answer here: Nodejs sqs queue processor also has code that prevents concurrent message request queries (granted it's only fetching one message a time)
I do have the whole thing wrapped in
var running = false;
runMonitorJob = setInterval(function() {
if (running) {
} else {
running = true;
// call SQS.receive
}
}, 500);
(With a running = false after the delete loop (not in it's callback))
My solution would be
watchdogTimeout = setTimeout(function() {
running = false;
}, 30000);
But surely this would leave a pile of floating sqs.receive's lurking about and thus much memory over time?
(This job runs all the time, and I left it running on Friday, it stalled Saturday morning and hung till I manually restarted the job this morning)
Edit: I have seen cases where it hangs for ~5 minutes and then suddenly gets messages BUT with a wait time of 20 seconds it should throw a "no messages" after 20 seconds. So a WatchDog of ~10 minutes might be more practical (depending on the rest of ones business logic)
Edit: Yes Long Polling is already configured Queue Side.
Edit: This is under (latest) v2.3.9 of aws-sdk and NodeJS v4.4.4
I've been chasing this (or a similar) issue for a few days now and here's what I've noticed:
The receiveMessage call does eventually return although only after 120 seconds
Concurrent calls to receiveMessage are serialised by the AWS.SDK library so making multiple calls in parallel have no effect.
The receiveMessage callback does not error - in fact after the 120 seconds have passed, it may contain messages.
What can be done about this? This sort of thing can happen for a number of reasons and some/many of these things can't necessarily be fixed. The answer is to run multiple services each calling receiveMessage and processing the messages as they come - SQS supports this. At any time, one of these services may hit this 120 second lag but the other services should be able to continue on as normal.
My particular problem is that I have some critical singleton services that can't afford 120 seconds of down time. For this I will look into either 1) use HTTP instead of SQS to push messages into my service or 2) spawn slave processes around each of the singletons to fetch the messages from SQS and push them into the service.
I also ran into this issue, but not when calling receiveMessage but sendMessage. I also saw hangups of exactly 120 seconds. I also saw it with a few other services, like Firehose.
That lead me to this line in the AWS SDK:
SQS Constructor
httpOptions:
timeout [Integer] — Sets the socket to timeout after timeout milliseconds of inactivity on the socket. Defaults to two minutes (120000).
to implement a fix, I override the timeout for my SQS client that performs the sendMessage to timeout after 10 seconds, and another with 25 seconds for receiving (where I long poll for 20 seconds):
var sendClient = new AWS.SQS({httpOptions:{timeout:10*1000}});
var receiveClient = new AWS.SQS({httpOptions:{timeout:25*1000}});
I've had this out in production for a week now and I've noticed that all of my SQS stalling issues have been eliminated.

Resources