504 gateway timeout azure durable function - azure

I am using this tutorial to create azure durable function https://learn.microsoft.com/en-us/azure/azure-functions/durable/quickstart-js-vscode
Note:- I am using premium plan
host.json
{
"version": "2.0",
"logging": {
"applicationInsights": {
"samplingSettings": {
"isEnabled": true,
"excludedTypes": "Request"
}
}
},
"extensionBundle": {
"id": "Microsoft.Azure.Functions.ExtensionBundle",
"version": "[3.*, 4.0.0)"
},
"functionTimeout": "00:40:00"
}
Hello/index.js
function delay(sec) {
let now = Date.now();
// run while loop for - sec
while (now + sec * 1000 > Date.now());
}
module.exports = async function (context) {
context.log("I am starting.");
const startTime = new Date().toLocaleTimeString();
// simulating a long-running task
delay(5 * 60);
const endTime = new Date().toLocaleTimeString();
return `${context.bindings.name} - ${startTime} -> ${endTime}`;
};
Sometimes call to the function HelloOrchestrator is working fine, but sometimes it responds with 504 gateway timeout.
What could be the issue?

You did not share your whole code but it looks like you may be delaying a HTTP triggered function beyond 230 seconds which is the default idle timeout of Azure Load Balancer. That will lead to a 504 gateway timeout error.
From Azure Function app timeout duration:
Regardless of the function app timeout setting, 230 seconds is the
maximum amount of time that an HTTP triggered function can take to
respond to a request. This is because of the default idle timeout of
Azure Load Balancer.
Note that you should ideally only schedule an orchestrator in the HTTP trigger which normally takes some milliseconds and return an HTTP response. When the scheduled orchestrator runs, it can call an activity function which could be a long-running background task

Related

Microsoft Azure functions (nodejs), WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT and maxConcurrentCalls not doing what they say

Using consumption plan. I created a service bus nodejs function trigger app.
I do not use sessions. Tested with two topics - partitioning enabled/disabled.
const timeout = ms => new Promise(res => setTimeout(res, ms))
module.exports = async function(context, mySbMsg) {
context.log('message start:', mySbMsg);
await timeout(60000)
context.log('message done:', mySbMsg);
};
host.json:
{
"version": "2.0",
"extensionBundle": {
"id": "Microsoft.Azure.Functions.ExtensionBundle",
"version": "[3.3.0, 4.0.0)"
},
"extensions": {
"serviceBus": {
"prefetchCount": 1,
"messageHandlerOptions": {
"autoComplete": true,
"maxConcurrentCalls": 5,
"maxAutoRenewDuration": "00:09:30"
}
}
},
"functionTimeout": "00:09:55"
}
With WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT = 1
I expect to see 5 requests/minute per VM running.
Sending 100 messages, I expect to see 5 messages/minute.
I do see 1 vm running in the live metrics. However, I am seeing 1 message a minute in the logs.
Yes #Teebu, "prefetchCount":1 is the culprit.
Example:
If prefetchCount is set to 200, maxConcurrentCalls is set to 16 (assume), then 200 messages will be prefetched to a specific instance, but only 16 messages processed at a time.
Prefetched Messages can be gotten by the MessageSender whereas the MaxConcurrency is only processed by the client code.
MaxConcurrentCalls - how many messages will a single MessageReceiver process at the same time.
PrefetchCount - When initiating a call to receive a message, a single MessageReceiver can retrieve up to how many messages.
To the same value, setting those two is counterproductive.
PrefetchCount should be larger than the number of messages processed concurrently.
In a simple:
The maximum number of messages prefetched by the underlying MessageReceiver utilized by Azure Functions SDK is determined by the prefetchCount.
To ensure that not too many messages are prefetched and locks are lost while waiting for processing, ensure that prefetchCount is properly configured with the value defined for maxConcurrentCalls.

Setting a per-call timeout in nodejs

Thanks in advance for any help here.
I am trying to set a per-call timeout in the Spanner nodejs client. I have read through the given Spanner documentation, the best of which is here: https://cloud.google.com/spanner/docs/custom-timeout-and-retry. This documentation leaves out that you need to be supplying gaxOptions. Or I'm misunderstanding, which would not be surprising given that I can't explain the behavior I'm seeing.
I created a small repo to house this reproduction here: https://github.com/brg8/spanner-nodejs-timeout-repro. Code also pasted below.
const PROJECT_ID_HERE = "";
const SPANNER_INSTANCE_ID_HERE = "";
const SPANNER_DATABASE_HERE = "";
const TABLE_NAME_HERE = "";
const { Spanner } = require("#google-cloud/spanner");
let client = null;
client = new Spanner({
projectId: PROJECT_ID_HERE,
});
const instance = client.instance(SPANNER_INSTANCE_ID_HERE);
const database = instance.database(SPANNER_DATABASE_HERE);
async function runQuery(additionalOptions = {}) {
const t1 = new Date();
try {
console.log("Launching query...");
await database.run({
sql: `SELECT * FROM ${TABLE_NAME_HERE} LIMIT 1000;`,
...additionalOptions,
});
console.log("Everything finished.");
} catch (err) {
console.log(err);
console.log("Timed out after", new Date() - t1);
}
};
// Query finishes, no timeout (as expected)
runQuery();
/*
Launching query...
Everything finished.
*/
// Query times out (as expected)
// However, it only times out after 7-8 seconds
runQuery({
gaxOptions: {
timeout: 1,
},
});
/*
Launching query...
Error: 4 DEADLINE_EXCEEDED: Deadline exceeded
at Object.callErrorFromStatus (/Users/benjamingodlove/Developer/spanner-node-repro/node_modules/#grpc/grpc-js/build/src/call.js:31:26)
at Object.onReceiveStatus (/Users/benjamingodlove/Developer/spanner-node-repro/node_modules/#grpc/grpc-js/build/src/client.js:330:49)
at /Users/benjamingodlove/Developer/spanner-node-repro/node_modules/#grpc/grpc-js/build/src/call-stream.js:80:35
at Object.onReceiveStatus (/Users/benjamingodlove/Developer/spanner-node-repro/node_modules/grpc-gcp/build/src/index.js:73:29)
at InterceptingListenerImpl.onReceiveStatus (/Users/benjamingodlove/Developer/spanner-node-repro/node_modules/#grpc/grpc-js/build/src/call-stream.js:75:23)
at Object.onReceiveStatus (/Users/benjamingodlove/Developer/spanner-node-repro/node_modules/#grpc/grpc-js/build/src/client-interceptors.js:299:181)
at /Users/benjamingodlove/Developer/spanner-node-repro/node_modules/#grpc/grpc-js/build/src/call-stream.js:145:78
at processTicksAndRejections (node:internal/process/task_queues:76:11) {
code: 4,
details: 'Deadline exceeded',
metadata: Metadata { internalRepr: Map(0) {}, options: {} }
}
Timed out after 7238
*/
And my package.json
{
"name": "spanner-node-repro",
"version": "1.0.0",
"description": "Reproducing timeout wonkiness with Spanner.",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"author": "",
"license": "ISC",
"dependencies": {
"#google-cloud/spanner": "^5.15.2"
}
}
Any insight is appreciated!
Ben Godlove
TLDR
Add retryRequestOptions: {noResponseRetries: 0}, to your gaxOptions so you get the following options:
const query = {
sql: 'SELECT ...',
gaxOptions: {
timeout: 1,
retryRequestOptions: {noResponseRetries: 0},
},
};
Longer Version
What is happening under the hood is the following:
The (streaming) query request is sent and the timeout occurs before the server returns any response.
The default retry settings include a noResponseRetries: 2 option, which means that the request will be retried twice if the client did not receive any response at all.
The retry of the request will only start after a randomized retry delay. This delay also increases for each retry attempt.
After retrying twice (so after sending in total 3 requests) the DEADLINE_EXCEEDED error will be propagated to the client. These retries take approximately 7 seconds because the first retry waits approx 2.5 seconds and the second 4.5 seconds (both values contain a random jitter value of 1 second, so the actual value will always be between 6 and 8 seconds)
Setting noResponseRetries: 0 disables the retries of requests that do not receive a response from the server.
You will also see that if you set the timeout to more 'reasonable' value that you will see that the query times out in a normal way, as the server has a chance to respond. Setting it to something like 1500 (meaning 1500ms, i.e. 1.5 seconds) causes the timeout to work as expected for me using your sample code.

Google App Engine Node Task Queue Retries before Finished

I am running a slow operation via a cloud task queue to delete objects from Google Cloud Storage. I have noticed that the task queue retries the task after two minutes have passed, even though the running task is not yet finished nor errored.
What is the best strategy to trigger valid retries, but not retry while the task is still running?
Here's my task creator:
router.get('/start-delete-old', async (req, res) => {
const task = {
appEngineHttpRequest: {
httpMethod: 'POST',
relativeUri: `/videos/delete-old`,
},
};
const request = {
parent: taskClient.parent,
task: task,
};
const [response] = await taskClient.queue.createTask(request);
res.send(response);
});
Here's my task handler:
router.post('/delete-old', async (req, res) => {
let cameras = await knex('cameras')
let date = moment().subtract(365, 'days').format('YYYY-MM-DD');
for (let i = 0; i < cameras.length; i++) {
let camera = cameras[i]
let prefix = `${camera.id}/${date}/`
try {
await bucket.deleteFiles({ prefix: prefix, force: true })
await knex.raw(`delete from videos where camera_id = ${camera.id} and cast(start_time as date) = '${date}'`)
}
catch (e) {
console.log('error deleting ' + e)
}
}
res.send({});
});
As per the documentation, a timeout in a task can vary depending on the environment you are using:
Standard environment
Automatic scaling: task processing must finish in 10 minutes.
Manual and basic scaling: requests can run up to 24 hours.
Flexible environment
For worker services running in the flex environment: all types have a 60 minute timeout.
So, if your handler misses the deadline, the queue assumes the task failed and retries it.
Also, the Task Queue is expecting to receive an status code between 200 and 299, if not, it will assume that the running task will fail. Quoting the documentation:
Upon successful completion of processing, the handler must send an HTTP status code between 200 and 299 back to the queue. Any other value indicates the task has failed and the queue retries the task.
I believe that both the bucket delete files and the knex raw query, are taking a lot of time to be processed, and this is causing the handler to return an status other than 200-299.
One good way to troubleshoot is using Stackdriver logs, you will be able to gather more information about the ongoing processes and see if any of them is returning any error.

How can the AWS Lambda concurrent execution limit be reached?

UPDATE
The original test code below is largely correct, but in NodeJS the various AWS services should be setup a bit differently as per the SDK link provided by #Michael-sqlbot
// manager
const AWS = require("aws-sdk")
const https = require('https');
const agent = new https.Agent({
maxSockets: 498 // workers hit this level; expect plus 1 for the manager instance
});
const lambda = new AWS.Lambda({
apiVersion: '2015-03-31',
region: 'us-east-2', // Initial concurrency burst limit = 500
httpOptions: { // <--- replace the default of 50 (https) by
agent: agent // <--- plugging the modified Agent into the service
}
})
// NOW begin the manager handler code
In planning for a new service, I am doing some preliminary stress testing. After reading about the 1,000 concurrent execution limit per account and the initial burst rate (which in us-east-2 is 500), I was expecting to achieve at least the 500 burst concurrent executions right away. The screenshot below of CloudWatch's Lambda metric shows otherwise. I cannot get past 51 concurrent executions no matter what mix of parameters I try. Here's the test code:
// worker
exports.handler = async (event) => {
// declare sleep promise
const sleep = (ms) => new Promise((resolve) => setTimeout(resolve, ms));
// return after one second
let nStart = new Date().getTime()
await sleep(1000)
return new Date().getTime() - nStart; // report the exact ms the sleep actually took
};
// manager
exports.handler = async(event) => {
const invokeWorker = async() => {
try {
let lambda = new AWS.Lambda() // NO! DO NOT DO THIS, SEE UPDATE ABOVE
var params = {
FunctionName: "worker-function",
InvocationType: "RequestResponse",
LogType: "None"
};
return await lambda.invoke(params).promise()
}
catch (error) {
console.log(error)
}
};
try {
let nStart = new Date().getTime()
let aPromises = []
// invoke workers
for (var i = 1; i <= 3000; i++) {
aPromises.push(invokeWorker())
}
// record time to complete spawning
let nSpawnMs = new Date().getTime() - nStart
// wait for the workers to ALL return
let aResponses = await Promise.all(aPromises)
// sum all the actual sleep times
const reducer = (accumulator, response) => { return accumulator + parseInt(response.Payload) };
let nTotalWorkMs = aResponses.reduce(reducer, 0)
// show me
let nTotalET = new Date().getTime() - nStart
return {
jobsCount: aResponses.length,
spawnCompletionMs: nSpawnMs,
spawnCompletionPct: `${Math.floor(nSpawnMs / nTotalET * 10000) / 100}%`,
totalElapsedMs: nTotalET,
totalWorkMs: nTotalWorkMs,
parallelRatio: Math.floor(nTotalET / nTotalWorkMs * 1000) / 1000
}
}
catch (error) {
console.log(error)
}
};
Response:
{
"jobsCount": 3000,
"spawnCompletionMs": 1879,
"spawnCompletionPct": "2.91%",
"totalElapsedMs": 64546,
"totalWorkMs": 3004205,
"parallelRatio": 0.021
}
Request ID:
"43f31584-238e-4af9-9c5d-95ccab22ae84"
Am I hitting a different limit that I have not mentioned? Is there a flaw in my test code? I was attempting to hit the limit here with 3,000 workers, but there was NO throttling encountered, which I guess is due to the Asynchronous invocation retry behaviour.
Edit: There is no VPC involved on either Lambda; the setting in the select input is "No VPC".
Edit: Showing Cloudwatch before and after the fix
There were a number of potential suspects, particularly due to the fact that you were invoking Lambda from Lambda, but your focus on consistently seeing a concurrency of 50 — a seemingly arbitrary limit (and a suspiciously round number) — reminded me that there's an anti-footgun lurking in the JavaScript SDK:
In Node.js, you can set the maximum number of connections per origin. If maxSockets is set, the low-level HTTP client queues requests and assigns them to sockets as they become available.
Here of course, "origin" means any unique combination of scheme + hostname, which in this case is the service endpoint for Lambda in us-east-2 that the SDK is connecting to in order to call the Invoke method, https://lambda.us-east-2.amazonaws.com.
This lets you set an upper bound on the number of concurrent requests to a given origin at a time. Lowering this value can reduce the number of throttling or timeout errors received. However, it can also increase memory usage because requests are queued until a socket becomes available.
...
When using the default of https, the SDK takes the maxSockets value from the globalAgent. If the maxSockets value is not defined or is Infinity, the SDK assumes a maxSockets value of 50.
https://docs.aws.amazon.com/sdk-for-javascript/v2/developer-guide/node-configuring-maxsockets.html
Lambda concurrency it not the only factor that decides how scalable your functions are. If your Lambda function is runnning within a VPC, it will require an ENI (Elastic Network Interface) which allows for ethernet traffic from and to the container (Lambda function).
It's possible your throttling occurred due to too many ENI's being requested (50 at a time). You can check this by viewing the logs of the Manager lambda function and looking for an error message when it's trying to invoke one of the child containers. If the error looks something like the following, you'll know ENI's is your issue.
Lambda was not able to create an ENI in the VPC of the Lambda function because the limit for Network Interfaces has been reached.

Timeouts in Azure Functions when populating queues

We have a simple ETL process to extract data from an API to a Document DB which we would like to implement using functions. In brief, the process is to take a ~16,500 line file, extract an ID from each line (Function 1), build a URL for each ID (Function 2), hit an API using the URL (Function 3), store the response in a document DB (Function 4). We are using queues for inter-function communication and are seeing problems with timeouts in the first function while doing this.
Function 1 (index.js)
module.exports = function (context, odsDataFile) {
context.log('JavaScript blob trigger function processed blob \n Name:', context.bindingData.odaDataFile, '\n Blob Size:', odsDataFile.length, 'Bytes');
const odsCodes = [];
odsDataFile.split('\n').map((line) => {
const columns = line.split(',');
if (columns[12] === 'A') {
odsCodes.push({
'odsCode': columns[0],
'orgType': 'pharmacy',
});
}
});
context.bindings.odsCodes = odsCodes;
context.log(`A total of: ${odsCodes.length} ods codes have been sent to the queue.`);
context.done();
};
function.json
{
"bindings": [
{
"type": "blobTrigger",
"name": "odaDataFile",
"path": "input-ods-data",
"connection": "connecting-to-services_STORAGE",
"direction": "in"
},
{
"type": "queue",
"name": "odsCodes",
"queueName": "ods-org-codes",
"connection": "connecting-to-services_STORAGE",
"direction": "out"
}
],
"disabled": false
}
Full code here
This function works fine when the number of ID's is in the 100's but times out when it is in the 10's of 1000's. The building of the ID array happens in milliseconds and the function completes but the adding of the items to the queue seems to take many minutes and eventually causes a timeout at the default of 5 mins.
I am surprised that the simple act of populating the queue seems to take such a long time and that the timeout for a function seems to include the time for tasks external to the function (i.e. queue population). Is this to be expected? Are there more performant ways of doing this?
We are running under the Consumption (Dynamic) Plan.
I did some testing of this from my local machine and found that it takes ~200ms to insert a message into the queue, which is expected. So if you have 17k messages to insert and are doing it sequentially, the time will take:
17,000 messages * 200ms = 3,400,000ms or ~56 minutes
The latency may be a bit quicker when running from the cloud, but you can see how this would jump over 5 minutes pretty quickly when you are inserting that many messages.
If message ordering isn't crucial, you could insert the messages in parallel. Some caveats, though:
You can't do this with node -- it'd have to be C#. Node doesn't expose the IAsyncCollector interface to you so it does it all behind-the-scenes.
You can't insert everything in parallel because the Consumption plan has a limit of 250 network connections at a time.
Here's an example of batching up the inserts 200 at a time -- with 17k messages, this took under a minute in my quick test.
public static async Task Run(string myBlob, IAsyncCollector<string> odsCodes, TraceWriter log)
{
List<Task> tasks = new List<Task>();
string[] lines = myBlob.Split(Environment.NewLine.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
int skip = 0;
int take = 200;
IEnumerable<string> batch = lines.Skip(skip).Take(take);
while (batch.Count() > 0)
{
await AddBatch(batch, odsCodes);
skip += take;
batch = lines.Skip(skip).Take(take);
}
}
public static async Task AddBatch(IEnumerable<string> lines, IAsyncCollector<string> odsCodes)
{
List<Task> tasks = new List<Task>();
foreach (string line in lines)
{
tasks.Add(odsCodes.AddAsync(line));
}
await Task.WhenAll(tasks);
}
As other answers have pointed out, because Azure Queues does not have a batch API you should consider an alternative such as Service Bus Queues. But if you are sticking with Azure Queues you need to avoid outputting the queue items sequentially, i.e. some form of constrained parallelism is necessary. One way to achieve this is to use the TPL Dataflow library.
One advantage Dataflow has to using batches of tasks and doing a WhenAll(..) is that you will never have a scenario where a batch is almost done and you are waiting for one slow execution to complete before starting the next batch.
I compared inserting 10,000 items with task batches of size 32 and dataflow with parallelism set to 32. The batch approach completed in 60 seconds, while dataflow completed in almost half that (32 seconds).
The code would look something like this:
using System.Threading.Tasks.Dataflow;
...
var addMessageBlock = new ActionBlock<string>(async message =>
{
await odscodes.AddAsync(message);
}, new ExecutionDataflowBlockOptions { SingleProducerConstrained = true, MaxDegreeOfParallelism = 32});
var bufferBlock = new BufferBlock<string>();
bufferBlock.LinkTo(addMessageBlock, new DataflowLinkOptions { PropagateCompletion = true });
foreach(string line in lines)
bufferBlock.Post(line);
bufferBlock.Complete();
await addMessageBlock.Completion;

Resources