Thanks in advance for any help here.
I am trying to set a per-call timeout in the Spanner nodejs client. I have read through the given Spanner documentation, the best of which is here: https://cloud.google.com/spanner/docs/custom-timeout-and-retry. This documentation leaves out that you need to be supplying gaxOptions. Or I'm misunderstanding, which would not be surprising given that I can't explain the behavior I'm seeing.
I created a small repo to house this reproduction here: https://github.com/brg8/spanner-nodejs-timeout-repro. Code also pasted below.
const PROJECT_ID_HERE = "";
const SPANNER_INSTANCE_ID_HERE = "";
const SPANNER_DATABASE_HERE = "";
const TABLE_NAME_HERE = "";
const { Spanner } = require("#google-cloud/spanner");
let client = null;
client = new Spanner({
projectId: PROJECT_ID_HERE,
});
const instance = client.instance(SPANNER_INSTANCE_ID_HERE);
const database = instance.database(SPANNER_DATABASE_HERE);
async function runQuery(additionalOptions = {}) {
const t1 = new Date();
try {
console.log("Launching query...");
await database.run({
sql: `SELECT * FROM ${TABLE_NAME_HERE} LIMIT 1000;`,
...additionalOptions,
});
console.log("Everything finished.");
} catch (err) {
console.log(err);
console.log("Timed out after", new Date() - t1);
}
};
// Query finishes, no timeout (as expected)
runQuery();
/*
Launching query...
Everything finished.
*/
// Query times out (as expected)
// However, it only times out after 7-8 seconds
runQuery({
gaxOptions: {
timeout: 1,
},
});
/*
Launching query...
Error: 4 DEADLINE_EXCEEDED: Deadline exceeded
at Object.callErrorFromStatus (/Users/benjamingodlove/Developer/spanner-node-repro/node_modules/#grpc/grpc-js/build/src/call.js:31:26)
at Object.onReceiveStatus (/Users/benjamingodlove/Developer/spanner-node-repro/node_modules/#grpc/grpc-js/build/src/client.js:330:49)
at /Users/benjamingodlove/Developer/spanner-node-repro/node_modules/#grpc/grpc-js/build/src/call-stream.js:80:35
at Object.onReceiveStatus (/Users/benjamingodlove/Developer/spanner-node-repro/node_modules/grpc-gcp/build/src/index.js:73:29)
at InterceptingListenerImpl.onReceiveStatus (/Users/benjamingodlove/Developer/spanner-node-repro/node_modules/#grpc/grpc-js/build/src/call-stream.js:75:23)
at Object.onReceiveStatus (/Users/benjamingodlove/Developer/spanner-node-repro/node_modules/#grpc/grpc-js/build/src/client-interceptors.js:299:181)
at /Users/benjamingodlove/Developer/spanner-node-repro/node_modules/#grpc/grpc-js/build/src/call-stream.js:145:78
at processTicksAndRejections (node:internal/process/task_queues:76:11) {
code: 4,
details: 'Deadline exceeded',
metadata: Metadata { internalRepr: Map(0) {}, options: {} }
}
Timed out after 7238
*/
And my package.json
{
"name": "spanner-node-repro",
"version": "1.0.0",
"description": "Reproducing timeout wonkiness with Spanner.",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"author": "",
"license": "ISC",
"dependencies": {
"#google-cloud/spanner": "^5.15.2"
}
}
Any insight is appreciated!
Ben Godlove
TLDR
Add retryRequestOptions: {noResponseRetries: 0}, to your gaxOptions so you get the following options:
const query = {
sql: 'SELECT ...',
gaxOptions: {
timeout: 1,
retryRequestOptions: {noResponseRetries: 0},
},
};
Longer Version
What is happening under the hood is the following:
The (streaming) query request is sent and the timeout occurs before the server returns any response.
The default retry settings include a noResponseRetries: 2 option, which means that the request will be retried twice if the client did not receive any response at all.
The retry of the request will only start after a randomized retry delay. This delay also increases for each retry attempt.
After retrying twice (so after sending in total 3 requests) the DEADLINE_EXCEEDED error will be propagated to the client. These retries take approximately 7 seconds because the first retry waits approx 2.5 seconds and the second 4.5 seconds (both values contain a random jitter value of 1 second, so the actual value will always be between 6 and 8 seconds)
Setting noResponseRetries: 0 disables the retries of requests that do not receive a response from the server.
You will also see that if you set the timeout to more 'reasonable' value that you will see that the query times out in a normal way, as the server has a chance to respond. Setting it to something like 1500 (meaning 1500ms, i.e. 1.5 seconds) causes the timeout to work as expected for me using your sample code.
Related
I am using this tutorial to create azure durable function https://learn.microsoft.com/en-us/azure/azure-functions/durable/quickstart-js-vscode
Note:- I am using premium plan
host.json
{
"version": "2.0",
"logging": {
"applicationInsights": {
"samplingSettings": {
"isEnabled": true,
"excludedTypes": "Request"
}
}
},
"extensionBundle": {
"id": "Microsoft.Azure.Functions.ExtensionBundle",
"version": "[3.*, 4.0.0)"
},
"functionTimeout": "00:40:00"
}
Hello/index.js
function delay(sec) {
let now = Date.now();
// run while loop for - sec
while (now + sec * 1000 > Date.now());
}
module.exports = async function (context) {
context.log("I am starting.");
const startTime = new Date().toLocaleTimeString();
// simulating a long-running task
delay(5 * 60);
const endTime = new Date().toLocaleTimeString();
return `${context.bindings.name} - ${startTime} -> ${endTime}`;
};
Sometimes call to the function HelloOrchestrator is working fine, but sometimes it responds with 504 gateway timeout.
What could be the issue?
You did not share your whole code but it looks like you may be delaying a HTTP triggered function beyond 230 seconds which is the default idle timeout of Azure Load Balancer. That will lead to a 504 gateway timeout error.
From Azure Function app timeout duration:
Regardless of the function app timeout setting, 230 seconds is the
maximum amount of time that an HTTP triggered function can take to
respond to a request. This is because of the default idle timeout of
Azure Load Balancer.
Note that you should ideally only schedule an orchestrator in the HTTP trigger which normally takes some milliseconds and return an HTTP response. When the scheduled orchestrator runs, it can call an activity function which could be a long-running background task
Using consumption plan. I created a service bus nodejs function trigger app.
I do not use sessions. Tested with two topics - partitioning enabled/disabled.
const timeout = ms => new Promise(res => setTimeout(res, ms))
module.exports = async function(context, mySbMsg) {
context.log('message start:', mySbMsg);
await timeout(60000)
context.log('message done:', mySbMsg);
};
host.json:
{
"version": "2.0",
"extensionBundle": {
"id": "Microsoft.Azure.Functions.ExtensionBundle",
"version": "[3.3.0, 4.0.0)"
},
"extensions": {
"serviceBus": {
"prefetchCount": 1,
"messageHandlerOptions": {
"autoComplete": true,
"maxConcurrentCalls": 5,
"maxAutoRenewDuration": "00:09:30"
}
}
},
"functionTimeout": "00:09:55"
}
With WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT = 1
I expect to see 5 requests/minute per VM running.
Sending 100 messages, I expect to see 5 messages/minute.
I do see 1 vm running in the live metrics. However, I am seeing 1 message a minute in the logs.
Yes #Teebu, "prefetchCount":1 is the culprit.
Example:
If prefetchCount is set to 200, maxConcurrentCalls is set to 16 (assume), then 200 messages will be prefetched to a specific instance, but only 16 messages processed at a time.
Prefetched Messages can be gotten by the MessageSender whereas the MaxConcurrency is only processed by the client code.
MaxConcurrentCalls - how many messages will a single MessageReceiver process at the same time.
PrefetchCount - When initiating a call to receive a message, a single MessageReceiver can retrieve up to how many messages.
To the same value, setting those two is counterproductive.
PrefetchCount should be larger than the number of messages processed concurrently.
In a simple:
The maximum number of messages prefetched by the underlying MessageReceiver utilized by Azure Functions SDK is determined by the prefetchCount.
To ensure that not too many messages are prefetched and locks are lost while waiting for processing, ensure that prefetchCount is properly configured with the value defined for maxConcurrentCalls.
I'm building a microservice application consisting of many microservices build with Node.js and running on Cloud Run. I use PubSub in several different ways:
For streaming data daily. The microservices responsible for gathering analytical data from different advertising services (Facebook Ads, LinkedIn Ads, etc.) use PubSub to stream data to a microservice responsible for uploading data to Google BigQuery. There also are services that stream a higher load of data (> 1 Gb) from CRMs and other services by splitting it into smaller chunks.
For messaging among microservices about different events that don't require an immediate response.
Earlier, I experienced some insignificant latency with PubSub. I know it's an open issue considering up to several seconds latency with low messages throughput. But in my case, we are talking about several minutes latency.
Also, I occasionally get an error message
Received error while publishing: Total timeout of API google.pubsub.v1.Publisher exceeded 60000 milliseconds before any response was received.
I this case a message is not sent at all or is highly delayed.
This is how my code looks like.
const subscriptions = new Map<string, Subscription>();
const topics = new Map<string, Topic>();
const listenForMessages = async (
subscriptionName: string,
func: ListenerCallback,
secInit = 300,
secInter = 300
) => {
let logger = new TestLogger("LISTEN_FOR_MSG");
let init = true;
const _setTimeout = () => {
let timer = setTimeout(() => {
console.log(`Subscription to ${subscriptionName} cancelled`);
subscription.removeListener("message", messageHandler);
}, (init ? secInit : secInter) * 1000);
init = false;
return timer;
};
const messageHandler = async (msg: Message) => {
msg.ack();
await func(JSON.parse(msg.data.toString()));
// wait for next message
timeout = _setTimeout();
};
let subscription: Subscription;
if (subscriptions.has(subscriptionName)) {
subscription = subscriptions.get(subscriptionName);
} else {
subscription = pubSubClient.subscription(subscriptionName);
subscriptions.set(subscriptionName, subscription);
}
let timeout = _setTimeout();
subscription.on("message", messageHandler);
console.log(`Listening for messages: ${subscriptionName}`);
};
const publishMessage = async (
data: WithAnyProps,
topicName: string,
options?: PubOpt
) => {
const serializedData = JSON.stringify(data);
const dataBuffer = Buffer.from(serializedData);
try {
let topic: Topic;
if (topics.has(topicName)) {
topic = topics.get(topicName);
} else {
topic = pubSubClient.topic(topicName, {
batching: {
maxMessages: options?.batchingMaxMessages,
maxMilliseconds: options?.batchingMaxMilliseconds,
},
});
topics.set(topicName, topic);
}
let msg = {
data: dataBuffer,
attributes: options.attributes,
};
await topic.publishMessage(msg);
console.log(`Publishing to ${topicName}`);
} catch (err) {
console.error(`Received error while publishing: ${err.message}`);
}
};
A listenerForMessage function is triggered by an HTTP request.
What I have already checked
PubSub client is created only once outside the function.
Topics and Subscriptions are reused.
I made at least one instance of each container running to eliminate the possibility of delays triggered by cold start.
I tried to increase the CPU and Memory capacity of containers.
batchingMaxMessages and batchingMaxMilliseconds are set to 1
I checked that the latest version of #google-cloud/pubsub is installed.
Notes
High latency problem occurs only in the cloud environment. With local tests, everything works well.
Timeout error sometimes occurs in both environments.
The problem was in my understanding of Cloud Run Container's lifecycle. I used to send HTTP response 202 while having PubSub working in the background. After sending the response, the container switched to the idling state, what looked like high latency in my logs.
I have a simple jest test code that creates an instance of a new winston logger using the following transport configuration:
it("using TCP protocol", async (done) => {
const sys: any = new ws.Syslog({
host: "localhost",
port: 514,
protocol: "tcp4",
path: "/dev/log",
app_name: "ESB",
facility: "local5",
eol: "\n",
});
let syslogger = winston.createLogger({
levels: winston.config.syslog.levels,
transports: [
sys
]
});
syslogger.info("test msg");
syslogger.close();
});
The test runs fine and I can see that the messages being received by syslog but the test does not seem to return and hangs after showing the test outcome. On debugging (winston-syslog.js) I see that it is trying to connect again. The event handler doesn't seem to exit. Is this intended or could this be a bug?
.on('close', () => {
//
// Attempt to reconnect on lost connection(s), progressively
// increasing the amount of time between each try.
//
const interval = Math.pow(2, this.retries);
this.connected = false;
setTimeout(() => {
this.retries++;
**this.socket.connect(this.port, this.host);**
}, interval * 1000);
})
I understand that the line exists for attempting to reconnect on lost connections. But how do I go about actually closing the connection assuming one was created successfully, messages sent, and I no longer need the connection?
If that line is commented, jest returns as expected.
The code is run on Centos 7 with node v 12.13.0 and:
"winston": "^3.2.1",
"winston-syslog": "^2.4.0",
"winston-transport": "^4.3.0"
Am I missing any configuration or clean up calls? Is there a way I can make the close() event handler exit?
Thanks!
I am running a slow operation via a cloud task queue to delete objects from Google Cloud Storage. I have noticed that the task queue retries the task after two minutes have passed, even though the running task is not yet finished nor errored.
What is the best strategy to trigger valid retries, but not retry while the task is still running?
Here's my task creator:
router.get('/start-delete-old', async (req, res) => {
const task = {
appEngineHttpRequest: {
httpMethod: 'POST',
relativeUri: `/videos/delete-old`,
},
};
const request = {
parent: taskClient.parent,
task: task,
};
const [response] = await taskClient.queue.createTask(request);
res.send(response);
});
Here's my task handler:
router.post('/delete-old', async (req, res) => {
let cameras = await knex('cameras')
let date = moment().subtract(365, 'days').format('YYYY-MM-DD');
for (let i = 0; i < cameras.length; i++) {
let camera = cameras[i]
let prefix = `${camera.id}/${date}/`
try {
await bucket.deleteFiles({ prefix: prefix, force: true })
await knex.raw(`delete from videos where camera_id = ${camera.id} and cast(start_time as date) = '${date}'`)
}
catch (e) {
console.log('error deleting ' + e)
}
}
res.send({});
});
As per the documentation, a timeout in a task can vary depending on the environment you are using:
Standard environment
Automatic scaling: task processing must finish in 10 minutes.
Manual and basic scaling: requests can run up to 24 hours.
Flexible environment
For worker services running in the flex environment: all types have a 60 minute timeout.
So, if your handler misses the deadline, the queue assumes the task failed and retries it.
Also, the Task Queue is expecting to receive an status code between 200 and 299, if not, it will assume that the running task will fail. Quoting the documentation:
Upon successful completion of processing, the handler must send an HTTP status code between 200 and 299 back to the queue. Any other value indicates the task has failed and the queue retries the task.
I believe that both the bucket delete files and the knex raw query, are taking a lot of time to be processed, and this is causing the handler to return an status other than 200-299.
One good way to troubleshoot is using Stackdriver logs, you will be able to gather more information about the ongoing processes and see if any of them is returning any error.