How does pubsub know how many messages I published at a point in time? - node.js

Code for publishing the messages here:
async function publishMessage(topicName) {
console.log(`[${new Date().toISOString()}] publishing messages`);
const pubsub = new PubSub({ projectId: PUBSUB_PROJECT_ID });
const topic = pubsub.topic(topicName, {
batching: {
maxMessages: 10,
maxMilliseconds: 10 * 1000,
},
});
const n = 5;
const dataBufs: Buffer[] = [];
for (let i = 0; i < n; i++) {
const data = `message payload ${i}`;
const dataBuffer = Buffer.from(data);
dataBufs.push(dataBuffer);
}
const results = await Promise.all(
dataBufs.map((dataBuf, idx) =>
topic.publish(dataBuf).then((messageId) => {
console.log(`[${new Date().toISOString()}] Message ${messageId} published. index: ${idx}`);
return messageId;
})
)
);
console.log('results:', results.toString());
}
As you can see, I am going to publish 5 messages. The time to publish is await Promise.all(...), I mean, for users, We can say send messages at this moment, but for internal of pubsub library maybe not. I set maxMessages to 10, so pubsub will wait for 10 seconds(maxMilliseconds), then publish these 5 messages.
The exuection result meets my expectations:
[2020-05-05T09:53:32.078Z] publishing messages
[2020-05-05T09:53:42.209Z] Message 36854 published. index: 0
[2020-05-05T09:53:42.209Z] Message 36855 published. index: 1
[2020-05-05T09:53:42.209Z] Message 36856 published. index: 2
[2020-05-05T09:53:42.209Z] Message 36857 published. index: 3
[2020-05-05T09:53:42.209Z] Message 36858 published. index: 4
results: 36854,36855,36856,36857,36858
In fact, I think topic.publish does not directly call the remote pubsub service, but pushes the message into the memory queue. And there is a window time to calculate the count of the messages, maybe in a tick or something like:
// internal logic of #google/pubsub library
setTimeout(() => {
// if user messages to be published gte maxMessages, then, publish them immediately
if(getLength(messageQueue) >= maxMessages) {
callRemotePubsubService(messageQueue)
}
}, /* window time = */ 100);
Or using setImmediate(), process.nextTick()?

Note that the conditions for sending a message to the service is an OR not an AND. In other words, if either maxMessages messages are waiting to be sent OR maxMilliseconds has passed since the library received the first outstanding message, it will send the outstanding messages to the server.
The source code for the client library is available, so you can see exactly what it does. The library has a queue that it uses to track messages that haven't been sent yet. When a message is added, if the queue is now full (based on the batching settings), then it immediately calls publish. When the first message is added, it uses setTimeout to schedule a call that ultimately calls publish on the service. The publisher client has an instance of the queue to which it adds messages when publish is called.

Related

Cloud Run PubSub high latency

I'm building a microservice application consisting of many microservices build with Node.js and running on Cloud Run. I use PubSub in several different ways:
For streaming data daily. The microservices responsible for gathering analytical data from different advertising services (Facebook Ads, LinkedIn Ads, etc.) use PubSub to stream data to a microservice responsible for uploading data to Google BigQuery. There also are services that stream a higher load of data (> 1 Gb) from CRMs and other services by splitting it into smaller chunks.
For messaging among microservices about different events that don't require an immediate response.
Earlier, I experienced some insignificant latency with PubSub. I know it's an open issue considering up to several seconds latency with low messages throughput. But in my case, we are talking about several minutes latency.
Also, I occasionally get an error message
Received error while publishing: Total timeout of API google.pubsub.v1.Publisher exceeded 60000 milliseconds before any response was received.
I this case a message is not sent at all or is highly delayed.
This is how my code looks like.
const subscriptions = new Map<string, Subscription>();
const topics = new Map<string, Topic>();
const listenForMessages = async (
subscriptionName: string,
func: ListenerCallback,
secInit = 300,
secInter = 300
) => {
let logger = new TestLogger("LISTEN_FOR_MSG");
let init = true;
const _setTimeout = () => {
let timer = setTimeout(() => {
console.log(`Subscription to ${subscriptionName} cancelled`);
subscription.removeListener("message", messageHandler);
}, (init ? secInit : secInter) * 1000);
init = false;
return timer;
};
const messageHandler = async (msg: Message) => {
msg.ack();
await func(JSON.parse(msg.data.toString()));
// wait for next message
timeout = _setTimeout();
};
let subscription: Subscription;
if (subscriptions.has(subscriptionName)) {
subscription = subscriptions.get(subscriptionName);
} else {
subscription = pubSubClient.subscription(subscriptionName);
subscriptions.set(subscriptionName, subscription);
}
let timeout = _setTimeout();
subscription.on("message", messageHandler);
console.log(`Listening for messages: ${subscriptionName}`);
};
const publishMessage = async (
data: WithAnyProps,
topicName: string,
options?: PubOpt
) => {
const serializedData = JSON.stringify(data);
const dataBuffer = Buffer.from(serializedData);
try {
let topic: Topic;
if (topics.has(topicName)) {
topic = topics.get(topicName);
} else {
topic = pubSubClient.topic(topicName, {
batching: {
maxMessages: options?.batchingMaxMessages,
maxMilliseconds: options?.batchingMaxMilliseconds,
},
});
topics.set(topicName, topic);
}
let msg = {
data: dataBuffer,
attributes: options.attributes,
};
await topic.publishMessage(msg);
console.log(`Publishing to ${topicName}`);
} catch (err) {
console.error(`Received error while publishing: ${err.message}`);
}
};
A listenerForMessage function is triggered by an HTTP request.
What I have already checked
PubSub client is created only once outside the function.
Topics and Subscriptions are reused.
I made at least one instance of each container running to eliminate the possibility of delays triggered by cold start.
I tried to increase the CPU and Memory capacity of containers.
batchingMaxMessages and batchingMaxMilliseconds are set to 1
I checked that the latest version of #google-cloud/pubsub is installed.
Notes
High latency problem occurs only in the cloud environment. With local tests, everything works well.
Timeout error sometimes occurs in both environments.
The problem was in my understanding of Cloud Run Container's lifecycle. I used to send HTTP response 202 while having PubSub working in the background. After sending the response, the container switched to the idling state, what looked like high latency in my logs.

Any suggestions about how to publish a huge amount of messages within one round of request / response?

If I publish 50K messages using Promise.all like below:
const pubsub = new PubSub({ projectId: PUBSUB_PROJECT_ID });
const topic = pubsub.topic(topicName, {
batching: {
maxMessages: 1000,
maxMilliseconds: 100,
},
});
const n = 50 * 1000;
const dataBufs: Buffer[] = [];
for (let i = 0; i < n; i++) {
const data = `message payload ${i}`;
const dataBuffer = Buffer.from(data);
dataBufs.push(dataBuffer);
}
const tasks = dataBufs.map((d, idx) =>
topic.publish(d).then((messageId) => {
console.log(`[${new Date().toISOString()}] Message ${messageId} published. index: ${idx}`);
})
);
// publish messages concurrencly
await Promise.all(tasks);
// send response to front-end
res.json(data);
I will hit this issue: pubsub-emulator throw error and publisher throw "Retry total timeout exceeded before any response was received" when publish 50k messages
If I use for loop and async/await. The issue is gone.
const n = 50 * 1000;
for (let i = 0; i < n; i++) {
const data = `message payload ${i}`;
const dataBuffer = Buffer.from(data);
const messageId = await topic.publish(dataBuffer)
console.log(`[${new Date().toISOString()}] Message ${messageId} published. index: ${i}`)
}
// some logic ...
// send response to front-end
res.json(data);
But it will block the execution of subsequent logic because of async/await until all messages have been published. It takes a long time to post 50k messages.
Any suggestions about how to publish a huge amount of messages(about 50k) without blocking the execution of subsequent logic? Do I need to use child_process or some queue like bull to publish the huge amount of messages in the background without blocking request/response workflow of the API? This means I need to respond to the front-end as soon as possible, the 50k messages should be the background tasks.
It seems there is a memory queue inside #google/pubsub library. I am not sure if I should use another queue like bull again.
The time it will take to publish large amounts of data depends on a lot of factors:
Message size. The larger the messages, the longer it takes to send them.
Network capacity (both of the connection between wherever the publisher is running and Google Cloud and, if relevant, of the virtual machine itself). This puts an upper bound on the amount of data that can be transmitted. It is not atypical to see smaller virtual machines with limits in the 40MB/s range. Note that if you are testing via Wifi, the limits could be even lower than this.
Number of threads and number of CPU cores. When having to run a lot of asynchronous callbacks, the ability to schedule them to run can be limited by the parallel capacity of the machine or runtime environment.
Typically, it is not good to try to send 50,000 publishes simultaneously from one instance of a publisher. It is likely that the above factors will cause the client to get overloaded and result in deadline exceeded errors. The best way to prevent this is to limit the number of messages that can be outstanding for publish at one time. Some of the libraries like Java support this natively. The Node.js library does not yet support this feature, but likely will in the future.
In the meantime, you'd want to keep a counter of the number of messages outstanding and limit it to whatever the client seems to be able to handle. Start with 1000 and work up or down from there based on the results. A semaphore would be a pretty standard way to achieve this behavior. In your case the code would look something like this:
var sem = require('semaphore')(1000);
var publishes = []
const tasks = dataBufs.map((d, idx) =>
sem.take(function() => {
publishes.push(topic.publish(d).then((messageId) => {
console.log(`[${new Date().toISOString()}] Message ${messageId} published. index: ${idx}`);
sem.leave();
}));
})
);
// Await the start of publishing all messages
await Promise.all(tasks);
// Await the actual publishes
await Promise.all(publishes);

Google Cloud PubSub not ack messages

We have the system of publisher and subscriber systems based on GCP PubSub. Subscriber processing single message quite long, about 1 minute. We already set subscribers ack deadline to 600 seconds (10 minutes) (maximal one) to make sure, that pubsub will not start redelivery too earlier, as basically we have long running operation here.
I'm seeing this behavior of PubSub. While code sending ack, and monitor confirms that PubSub acknowledgement request has been accepted and acknowledgement itself completed with success status, total number of unacked messages still the same.
Metrics on the charts showing the same for sum, count and mean aggregation aligner. On the picture above aligner is mean and no reducers enabled.
I'm using #google-cloud/pubsub Node.js library. Different versions has been tried (0.18.1, 0.22.2, 0.24.1), but I guess issue not in them.
The following class can be used to check.
TypeScript 3.1.1, Node 8.x.x - 10.x.x
import { exponential, Backoff } from "backoff";
const pubsub = require("#google-cloud/pubsub");
export interface IMessageHandler {
handle (message): Promise<void>;
}
export class PubSubSyncListener {
private readonly client;
private listener: Backoff;
private runningOperations: Promise<unknown>[] = [];
constructor (
private readonly handler: IMessageHandler,
private readonly options: {
/**
* Maximal messages number to be processed simultaniosly.
* Listener will try to keep processing number as close to provided value
* as possible.
*/
maxMessages: number;
/**
* Formatted full subscrption name /projects/{projectName}/subscriptions/{subscriptionName}
*/
subscriptionName: string;
/**
* In milliseconds
*/
minimalListenTimeout?: number;
/**
* In milliseconds
*/
maximalListenTimeout?: number;
}
) {
this.client = new pubsub.v1.SubscriberClient();
this.options = Object.assign({
minimalListenTimeout: 300,
maximalListenTimeout: 30000
}, this.options);
}
public async listen () {
this.listener = exponential({
maxDelay: this.options.maximalListenTimeout,
initialDelay: this.options.minimalListenTimeout
});
this.listener.on("ready", async () => {
if (this.runningOperations.length < this.options.maxMessages) {
const [response] = await this.client.pull({
subscription: this.options.subscriptionName,
maxMessages: this.options.maxMessages - this.runningOperations.length
});
for (const m of response.receivedMessages) {
this.startMessageProcessing(m);
}
this.listener.reset();
this.listener.backoff();
} else {
this.listener.backoff();
}
});
this.listener.backoff();
}
private startMessageProcessing (message) {
const index = this.runningOperations.length;
const removeFromRunning = () => {
this.runningOperations.splice(index, 1);
};
this.runningOperations.push(
this.handler.handle(this.getHandlerMessage(message))
.then(removeFromRunning, removeFromRunning)
);
}
private getHandlerMessage (message) {
message.message.ack = async () => {
const ackRequest = {
subscription: this.options.subscriptionName,
ackIds: [message.ackId]
};
await this.client.acknowledge(ackRequest);
};
return message.message;
}
public async stop () {
this.listener.reset();
this.listener = null;
await Promise.all(
this.runningOperations
);
}
}
This is basically partial implementation of async pulling of the messages and immediate acknowledgment. Because one of the proposed solutions was in usage of the synchronous pulling.
I found similar reported issue in java repository, if I'm not mistaken in symptoms of the issue.
https://github.com/googleapis/google-cloud-java/issues/3567
The last detail here is that acknowledgment seems to work on the low number of requests. In case if I fire single message in pubsub and then immediately process it, undelivered messages number decreases (drops to 0 as only one message was there before).
The question itself - what is happening and why unacked messages number is not reducing as it should when ack has been received?
To quote from the documentation, the subscription/num_undelivered_messages metric that you're using is the "Number of unacknowledged messages (a.k.a. backlog messages) in a subscription. Sampled every 60 seconds. After sampling, data is not visible for up to 120 seconds."
You should not expect this metric to decrease immediately upon acking a message. In addition, it sounds as if you are trying to use pubsub for an exactly once delivery case, attempting to ensure the message will not be delivered again. Cloud Pub/Sub does not provide these semantics. It provides at least once semantics. In other words, even if you have received a value, acked it, received the ack response, and seen the metric drop from 1 to 0, it is still possible and correct for the same worker or another to receive an exact duplicate of that message. Although in practice this is unlikely, you should focus on building a system that is duplicate tolerant instead of trying to ensure your ack succeeded so your message won't be redelivered.

Can I filter an Azure ServiceBusService using node.js SDK?

I have millions of messages in a queue and the first ten million or so are irrelevant. Each message has a sequential ActionId so ideally anything < 10000000 I can just ignore or better yet delete from the queue. What I have so far:
let azure = require("azure");
function processMessage(sb, message) {
// Deserialize the JSON body into an object representing the ActionRecorded event
var actionRecorded = JSON.parse(message.body);
console.log(`processing id: ${actionRecorded.ActionId} from ${actionRecorded.ActionTaken.ActionTakenDate}`);
if (actionRecorded.ActionId < 10000000) {
// When done, delete the message from the queue
console.log(`Deleting message: ${message.brokerProperties.MessageId} with ActionId: ${actionRecorded.ActionId}`);
sb.deleteMessage(message, function(deleteError, response) {
if (deleteError) {
console.log("Error deleting message: " + message.brokerProperties.MessageId);
}
});
}
// immediately check for another message
checkForMessages(sb);
}
function checkForMessages(sb) {
// Checking for messages
sb.receiveQueueMessage("my-queue-name", { isPeekLock: true }, function(receiveError, message) {
if (receiveError && receiveError === "No messages to receive") {
console.log("No messages left in queue");
return;
} else if (receiveError) {
console.log("Receive error: " + receiveError);
} else {
processMessage(sb, message);
}
});
}
let connectionString = "Endpoint=sb://<myhub>.servicebus.windows.net/;SharedAccessKeyName=KEYNAME;SharedAccessKey=[mykey]"
let serviceBusService = azure.createServiceBusService(connectionString);
checkForMessages(serviceBusService);
I've tried looking at the docs for withFilter but it doesn't seem like that applies to queues.
I don't have access to create or modify the underlying queue aside from the operations mentioned above since the queue is provided by a client.
Can I either
Filter my results that I get from the queue
speed up the queue processing somehow?
Filter my results that I get from the queue
As you found, filters as a feature are only applicable to Topics & Subscriptions.
speed up the queue processing somehow
If you were to use the #azure/service-bus package which is the newer, faster library to work with Service Bus, you could receive the messages in ReceiveAndDelete mode until you reach the message with ActionId 9999999, close that receiver and then create a new receiver in PeekLock mode. For more on these receive modes, see https://learn.microsoft.com/en-us/azure/service-bus-messaging/message-transfers-locks-settlement#settling-receive-operations

can I limit consumption of kafka-node consumer?

It seems like my kafka node consumer:
var kafka = require('kafka-node');
var consumer = new Consumer(client, [], {
...
});
is fetching way too many messages than I can handle in certain cases.
Is there a way to limit it (for example accept no more than 1000 messages per second, possibly using the pause api?)
I'm using kafka-node, which seems to have a limited api comparing to the Java version
In Kafka, poll and process should happen in a coordinated/synchronized way. Ie, after each poll, you should process all received data first, before you do the next poll. This pattern will automatically throttle the number of messages to the max throughput your client can handle.
Something like this (pseudo-code):
while(isRunning) {
messages = poll(...)
for(m : messages) {
process(m);
}
}
(That is the reason, why there is not parameter "fetch.max.messages" -- you just do not need it.)
I had a similar situation where I was consuming messages from Kafka and had to throttle the consumption because my consumer service was dependent on a third party API which had its own constraints.
I used async/queue along with a wrapper of async/cargo called asyncTimedCargo for batching purpose.
The cargo gets all the messages from the kafka-consumer and sends it to queue upon reaching a size limit batch_config.batch_size or timeout batch_config.batch_timeout.
async/queue provides saturated and unsaturated callbacks which you can use to stop the consumption if your queue task workers are busy. This would stop the cargo from filling up and your app would not run out of memory. The consumption would resume upon unsaturation.
//cargo-service.js
module.exports = function(key){
return new asyncTimedCargo(function(tasks, callback) {
var length = tasks.length;
var postBody = [];
for(var i=0;i<length;i++){
var message ={};
var task = JSON.parse(tasks[i].value);
message = task;
postBody.push(message);
}
var postJson = {
"json": {"request":postBody}
};
sms_queue.push(postJson);
callback();
}, batch_config.batch_size, batch_config.batch_timeout)
};
//kafka-consumer.js
cargo = cargo-service()
consumer.on('message', function (message) {
if(message && message.value && utils.isValidJsonString(message.value)) {
var msgObject = JSON.parse(message.value);
cargo.push(message);
}
else {
logger.error('Invalid JSON Message');
}
});
// sms-queue.js
var sms_queue = queue(
retryable({
times: queue_config.num_retries,
errorFilter: function (err) {
logger.info("inside retry");
console.log(err);
if (err) {
return true;
}
else {
return false;
}
}
}, function (task, callback) {
// your worker task for queue
callback()
}), queue_config.queue_worker_threads);
sms_queue.saturated = function() {
consumer.pause();
logger.warn('Queue saturated Consumption paused: ' + sms_queue.running());
};
sms_queue.unsaturated = function() {
consumer.resume();
logger.info('Queue unsaturated Consumption resumed: ' + sms_queue.running());
};
From FAQ in the README
Create a async.queue with message processor and concurrency of one (the message processor itself is wrapped with setImmediate function so it will not freeze up the event loop)
Set the queue.drain to resume() the consumer
The handler for consumer's message event to pause() the consumer and pushes the message to the queue.
As far as I know the API does not have any kind of throttling. But both consumers (Consumer and HighLevelConsumer) have a 'pause()' function. So you could stop consuming if you get to much messages. Maybe that already offers what you need.
Please keep in mind what's happening. You send a fetch request to the broker and get a batch of message back. You can configure the min and max size of the messages (according to the documentation not the number of messages) you want to fetch:
{
....
// This is the minimum number of bytes of messages that must be available to give a response, default 1 byte
fetchMinBytes: 1,
// The maximum bytes to include in the message set for this partition. This helps bound the size of the response.
fetchMaxBytes: 1024 * 1024,
}
I was facing the same issue, initially fetchMaxBytes value was
fetchMaxBytes: 1024 * 1024 * 10 // 10MB
I just chanbed it to
fetchMaxBytes: 1024
It worked very smoothly after the change.

Resources