We are attempting to migrate a message processing app from Kafka to Google Pub/Sub and it's just not working as expected.
We are running in Kubernetes (Google Cloud) where there may be multiple pods processing messages on the same subscription. Topics and subscriptions are all created using terraform and are more or less permanent. They are not created/destroyed on the fly by the application.
In our development environment, where message throughput is rather low, everything works just fine. But when we scale up to production levels, everything seems to fall apart. We get big backlogs of unacked messages, and yet some pods are not receiving any messages at all. And then, all of a sudden, the backlog will just go away, but then climb again.
We are using the nodejs client library provided by google: #google-cloud/pubsub:3.1.0
Each instance of the application subscribes to the same named subscription, and according to the documentation, messages should be distributed to each subscriber. But that is not happening. Some pods will be consuming messages rapidly, while others sit idle.
Every message is processed in a try/catch block and we are not observing any errors being thrown. So, as far as we know, every received message is getting acked.
I am suspicious that, as pods are terminated with autoscaling or updated deployments, that we are not properly closing subscriptions, but there are no examples addressing a distributed environment and I have not found any document that specifically addresses how to properly manage resources. It is also worth mentioning that the app has multiple subscriptions to different topics.
When a pod shuts down, what actions should be taken on the Subscription object and the PubSub client object? Maybe that's not even the issue, but it seems like a reasonable place to start.
When we start a subscription we do something like this:
private exampleSubscribe(): Subscription {
// one suggestion for having multiple subscriptions in the same app
// was to use separate clients for each
const pubSubClient = new PubSub({
// use a regional endpoint for message ordering
apiEndpoint: 'us-central1-pubsub.googleapis.com:443',
});
pubSubClient.projectId = 'my-project-id';
const sub = pubSubClient.subscription('my-subscription-name', {
// have tried various values for maxMessage from 5 to the default of 1000
flowControl: { maxMessages: 250, allowExcessMessages: false },
ackDeadline: 30,
});
sub.on('message', async (message) => {
await this.exampleMessageProcessing(message);
});
return sub;
}
private async exampleMessageProcessing(message: Message): Promise<void> {
try {
// do some cool stuff
} catch (error) {
// log the error
} finally {
message.ack();
}
}
Upon termination of a pod, we do this:
private async exampleCloseSub(sub: Subscription) {
try {
sub.removeAllListeners('message');
await sub.close();
// note that we do nothing with the PubSub
// client object -- should it also be closed?
} catch (error) {
// ignore error, we are shutting down
}
}
When running with Kafka, we can easily keep up with the message pace with usually no more than 2 pods. So I know that we are not running into issues of it simply taking too long to process each message.
Why are messages being left unacked? Why are pods not receiving messages when there is clearly a large backlog? What is the correct way to shut down one subscriber on a shared subscription?
It turns out that the issue was an improper implementation of message ordering.
The official docs for message ordering in Pub/Sub are rather brief:
https://cloud.google.com/pubsub/docs/ordering
Not much there regarding how to implement an ordering key or the implications of message ordering on horizontal scaling.
Though they do link to some external resources, one of which is this blog post:
https://medium.com/google-cloud/google-cloud-pub-sub-ordered-delivery-1e4181f60bc8
In our case, we did not have enough distinct ordering keys to allow for proper distribution of messages across subscribers/pods.
So this was definitely an RTFM situation, or more accurately: Read The Fine Blog Post Referred To By The Manual. I would have much preferred that the important details were actually in the official documentation. Is that to much to ask for?
Related
I stuck on performance issue when using pubsub to triggers the function.
//this will call on index.ts
export function downloadService() {
// References an existing subscription
const subscription = pubsub.subscription("DOWNLOAD-sub");
// Create an event handler to handle messages
// let messageCount = 0;
const messageHandler = async (message : any) => {
console.log(`Received message ${message.id}:`);
console.log(`\tData: ${message.data}`);
console.log(`\tAttributes: ${message.attributes.type}`);
// "Ack" (acknowledge receipt of) the message
message.ack();
await exportExcel(message);//my function
// messageCount += 1;
};
// Listen for new messages until timeout is hit
subscription.on("message", messageHandler);
}
async function exportExcel(message : any) {
//get data from database
const movies = await Sales.findAll({
attributes: [
"SALES_STORE",
"SALES_CTRNO",
"SALES_TRANSNO",
"SALES_STATUS",
],
raw: true,
});
... processing to excel// 800k rows
... bucket.upload to gcs
}
The function above is working fine if I trigger ONLY one pubsub message.
However, the function will hit memory leak issue or database connection timeout issue if I trigger many pubsub message in short period of time.
The problem I found is, first processing havent finish yet but others request from pubsub will straight to call function again and process at the same time.
I have no idea how to resolve this but I was thinking implement the queue worker or google cloud task will solve the problem?
As mentioned by #chovy in the comments, there is a need to queue up the excelExport function calls since the function's execution is not keeping up with the rate of invocation. One of the modules that can be used to queue function calls is async. Please note that the async module is not officially supported by Google.
As an alternative, you can employ flow control features on the subscriber side. Data pipelines often receive sporadic spikes in published traffic which can overwhelm subscribers in an effort to catch up. The usual response to high published throughput on a subscription would be to dynamically autoscale subscriber resources to consume more messages. However, this can incur unwanted costs — for instance, you may need to use more VM’s — which can lead to additional capacity planning. Flow control features on the subscriber side can help control the unhealthy behavior of these tasks on the pipeline by allowing the subscriber to regulate the rate at which messages are ingested. Please refer to this blog for more information on flow control features.
I'm using a durable function that's triggered off a queue. I'm sending messages off the queue to a service that is pretty flaky, so I set up the RetryPolicy. Even still, I'd like to be able to see the failed messages even if the max retries has been exhausted.
Do I need to manually throw those to a dead-letter queue (and if so, it's not clear to me how I know when a message has been retried any number of times), or will the function naturally throw those to some kind of dead-letter/poison queue?
When an activity fails in Durable Functions, an exception is marshalled back to the orchestration with FunctionFailedException thrown. It doesn't matter whether you used automatic retry or not - at the very end, the whole activity fails and it's up to you to handle the situation. As per documentation:
try
{
await context.CallActivityAsync("CreditAccount",
new
{
Account = transferDetails.DestinationAccount,
Amount = transferDetails.Amount
});
}
catch (Exception)
{
// Refund the source account.
// Another try/catch could be used here based on the needs of the application.
await context.CallActivityAsync("CreditAccount",
new
{
Account = transferDetails.SourceAccount,
Amount = transferDetails.Amount
});
}
The only thing retry changes is handling the transient error(so you do not have to enable the safe route each time you have e.g. network issues).
Background
I have several clients sending messages to an azure service bus queue. To match it, I need several machines reading from that queue and consuming the messages as they arrive, using Node.js.
Research
I have read the azure service bus queues tutorial and I am aware I can use receiveQueueMessage to read a message from the queue.
However, the tutorial does not mention how one can listen to a queue and read messages as soon as they arrive.
I know I can simply poll the queue for messages, but this spams the servers with requests for no real benefit.
After searching in SO, I found a discussion where someone had a similar issue:
Listen to Queue (Event Driven no polling) Service-Bus / Storage Queue
And I know they ended up using the C# async method ReceiveAsync, but it is not clear to me if:
That method is available for Node.js
If that method reads messages from the queue as soon as they arrive, like I need.
Problem
The documentation for Node.js is close to non-existant, with that one tutorial being the only major document I found.
Question
How can my workers be notified of an incoming message in azure bus service queues ?
Answer
According to Azure support, it is not possible to be notified when a queue receives a message. This is valid for every language.
Work arounds
There are 2 main work arounds for this issue:
Use Azure topics and subscriptions. This way you can have all clients subscribed to an event new-message and have them check the queue once they receive the notification. This has several problems though: first you have to pay yet another Azure service and second you can have multiple clients trying to read the same message.
Continuous Polling. Have the clients check the queue every X seconds. This solution is horrible, as you end up paying the network traffic you generate and you spam the service with useless requests. To help minimize this there is a concept called long polling which is so poorly documented it might as well not exist. I did find this NPM module though: https://www.npmjs.com/package/azure-awesome-queue
Alternatives
Honestly, at this point, you may be wondering why you should be using this service. I agree...
As an alternative there is RabbitMQ which is free, has a community, good documentation and a ton more features.
The downside here is that maintaining a RabbitMQ fault tolerant cluster is not exactly trivial.
Another alternative is Apache Kafka which is also very reliable.
You can receive messages from the service bus queue via subscribe method which listens to a stream of values. Example from Azure documentation below
const { delay, ServiceBusClient, ServiceBusMessage } = require("#azure/service-bus");
// connection string to your Service Bus namespace
const connectionString = "<CONNECTION STRING TO SERVICE BUS NAMESPACE>"
// name of the queue
const queueName = "<QUEUE NAME>"
async function main() {
// create a Service Bus client using the connection string to the Service Bus namespace
const sbClient = new ServiceBusClient(connectionString);
// createReceiver() can also be used to create a receiver for a subscription.
const receiver = sbClient.createReceiver(queueName);
// function to handle messages
const myMessageHandler = async (messageReceived) => {
console.log(`Received message: ${messageReceived.body}`);
};
// function to handle any errors
const myErrorHandler = async (error) => {
console.log(error);
};
// subscribe and specify the message and error handlers
receiver.subscribe({
processMessage: myMessageHandler,
processError: myErrorHandler
});
// Waiting long enough before closing the sender to send messages
await delay(20000);
await receiver.close();
await sbClient.close();
}
// call the main function
main().catch((err) => {
console.log("Error occurred: ", err);
process.exit(1);
});
source :
https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-nodejs-how-to-use-queues
I asked myslef the same question, here is what I found.
Use Google PubSub, it does exactly what you are looking for.
If you want to stay with Azure, the following ist possible:
cloud functions can be triggered from SBS messages
trigger an event-hub event with that cloud function
receive the event and fetch the message from SBS
You can make use of serverless functions which are "ServiceBusQueueTrigger",
they are invoked as soon as message arrives in queue,
Its pretty straight forward doing in nodejs, you need bindings defined in function.json which have type as
"type": "serviceBusTrigger",
This article (https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-service-bus#trigger---javascript-example) probably would help in more detail.
We have webjobs consisting of several methods in a single Functions.cs file. They have servicebus triggers on topic/queues. Hence, keep listening to topic/queue for brokeredMessage. As soon as the message arrives, we have a processing logic that does lot of stuff. But, we find sometimes, all the webjobs get reinitialized suddenly. I found few articles on the website which says webjobs do get initialized and it is usual.
But, not sure if that is the only way and can we prevent it from getting reinitialized as we call brokeredMessage.Complete as soon we get brokeredMessage since we do not want it to be keep processing again and again?
Also, we have few webjobs in one app service and few webjobs in other app service. And, we find all of the webjobs from both the app service get re initialized at the same time. Not sure, why?
You should design your process to be able to deal with occasional disconnects and failures, since this is a "feature" or applications living in the cloud.
Use a transaction to manage the critical area of your code.
Pseudo/commented code below, and a link to the Microsoft documentation is here.
var msg = receiver.Receive();
using (scope = new TransactionScope())
{
// Do whatever work is required
// Starting with computation and business logic.
// Finishing with any persistence or new message generation,
// giving your application the best change of success.
// Keep in mind that all BrokeredMessage operations are enrolled in
// the transaction. They will all succeed or fail.
// If you have multiple data stores to update, you can use brokered messages
// to send new individual messages to do the operation on each store,
// giving eventual consistency.
msg.Complete(); // mark the message as done
scope.Complete(); // declare the transaction done
}
Is there any way to configure triggers without attributes? I cannot know the queue names ahead of time.
Let me explain my scenario here.. I have one service bus queue, and for various reasons (complicated duplicate-suppression business logic), the queue messages have to be processed one at a time, so I have ServiceBusConfiguration.OnMessageOptions.MaxConcurrentCalls set to 1. So processing a message holds up the whole queue until it is finished. Needless to say, this is suboptimal.
This 'one at a time' policy isn't so simple. The messages could be processed in parallel, they just have to be divided into groups (based on a field in message), say A and B. Group A can process its messages one at a time, and group B can process its own one at a time, etc. A and B are processed in parallel, all is good.
So I can create a queue for each group, A, B, C, ... etc. There are about 50 groups, so 50 queues.
I can create a queue for each, but how to make this work with the Azure Webjobs SDK? I don't want to copy-paste a method for each queue with a different ServiceBusTrigger for the SDK to discover, just to enforce one-at-a-time per queue/group, then update the code with another copy-paste whenever another group is needed. Fetching a list of queues at startup and tying to the function is preferable.
I have looked around and I don't see any way to do what I want. The ITypeLocator interface is pretty hard-set to look for attributes. I could probably abuse the INameResolver, but it seems like I'd still have to have a bunch of near-duplicate methods around. Could I somehow create what the SDK is looking for at startup/runtime?
(To be clear, I know how to use INameResolver to get queue name as at How to set Azure WebJob queue name at runtime? but though similar this isn't my problem. I want to setup triggers for multiple queues at startup for the same function to get the one-at-a-time per queue processing, without using the trigger attribute 50 times repeatedly. I figured I'd ask again since the SDK repo is fairly active and it's been a year..).
Or am I going about this all wrong? Being dumb? Missing something? Any advice on this dilemma would be welcome.
The Azure Webjob Host discovers and indexes the functions with the ServiceBusTrigger attribute when it starts. So there is no way to set up the queues to trigger at the runtime.
The simpler solution for you is to create a long time running job and implement it manually:
public class Program
{
private static void Main()
{
var host = new JobHost();
host.CallAsync(typeof(Program).GetMethod("Process"));
host.RunAndBlock();
}
[NoAutomaticTriggerAttribute]
public static async Task Process(TextWriter log, CancellationToken token)
{
var connectionString = "myconnectionstring";
// You can also get the queue name from app settings or azure table ??
var queueNames = new[] {"queueA", "queueA" };
var messagingFactory = MessagingFactory.CreateFromConnectionString(connectionString);
foreach (var queueName in queueNames)
{
var receiver = messagingFactory.CreateMessageReceiver(queueName);
receiver.OnMessage(message =>
{
try
{
// do something
....
// Complete the message
message.Complete();
}
catch (Exception ex)
{
// Log the error
log.WriteLine(ex.ToString());
// Abandon the message so that it can be retry.
message.Abandon();
}
}, new OnMessageOptions() { MaxConcurrentCalls = 1});
}
// await until the job stop or restart
await Task.Delay(Timeout.InfiniteTimeSpan, token);
}
}
Otherwise, if you don't want to deal with multiple queues, you can have a look at azure servicebus topic/subscription and create SqlFilter to send your message to the right subscription.
Another option could be to create your own trigger: The azure webjob SDK provides extensibility points to create your own trigger binding :
Binding Extensions Overview
Good Luck !
Based on my understanding, your needs seems to be building a message batch system in parallel. The #Thomas solution is good, but I think Azure Batch service with Table storage may be better and could be instead of the complex solution of ServiceBus queue + WebJobs with a trigger.
Using Azure Batch with Table storage, you can control the task creation and execute the task in parallel and at scale, even monitor these tasks, please refer to the tutorial to know how to.