Bots being stuck in reserved state in storage - bots

This happened in EIL on 12/13/2021:
realtime
None of the bots had jobs, confirmed on bot dashboard:
dashboard

Apparently a message was lost because there was a K8s upgrade going on and 3 pods were mistakenly called by ops to maintenance. Dispatch had requests still in progress for moving each pod/bot into storage. The solution in this case was the following:
I. Confirm the request still in progress and bot assignation still exist for each bot:
SELECT JobId FROM [iHerb_Scs_Wes_Agv_Dispatch].[request].[Requests]
WHERE Id = (SELECT RequestId FROM[iHerb_Scs_Wes_Agv_Dispatch].[dispatch].[BotAssignments] WHERE BotId = 'c45766cb-21e7-4a91-a509-017cf0e38580')
II. Emit FleetJobCompleted event for each bot, using job id found out on previous step:
https://rabbit-cluster-scs-prod.iherbscs.net/#/queues/scs.wes.agv.dispatch/FleetJobCompleted
{
"JobId" : "EA5E165F-0A72-428B-9B2A-017DB3216120"
}

Related

Apache Pulsar Client - Broker notification of Closed consumer - how to resume data feed?

TLDR: using python client library to subscribe to pulsar topic. logs show: 'broker notification of consumer closed' when something happens server-side. subscription appears to be re-established according to logs but we find later that backlog was growing on cluster b/c no msgs being sent to our subscription to consume
Running into an issue where we have an Apache-Pulsar cluster we are using that is opaque to us, and has a namespace defined where we publish/consume topics, is losing connection with our consumer.
We have a python client consuming from a topic (with one Pulsar Client subscription per thread).
We have run into an issue where, due to an issue on the pulsar cluster, we see the following entry in our client logs:
"Broker notification of Closed consumer"
followed by:
"Created connection for pulsar://houpulsar05.mycompany.com:6650"
....for every thread in our agent.
Then we see the usual periodic log entries like this:
{"log":"2022-09-01 04:23:30.269 INFO [139640375858944] ConsumerStatsImpl:63 | Consumer [persistent://tenant/namespace/topicname, subscription-name, 0] , ConsumerStatsImpl (numBytesRecieved_ = 0, totalNumBytesRecieved_ = 6545742, receivedMsgMap_ = {}, ackedMsgMap_ = {}, totalReceivedMsgMap_ = {[Key: Ok, Value: 3294], }, totalAckedMsgMap_ = {[Key: {Result: Ok, ackType: 0}, Value: 3294], })\n","stream":"stdout","time":"2022-09-01T04:23:30.270009746Z"}
This gives the appearance that some connection has been re-established to some other broker.
However, we do not get any messages being consumed. We have an alert on Grafana dashboard which shows us the backlog on topics and subscription backlog. Eventually it either hits a count or rate thresshold which will alert us that there is a problem. When we restart our agent, the subscription is re-establish and the backlog is can immediately be seen heading to 0.
Has anyone experienced such an issue?
Our code is typical:
consumer = client.subscribe(
topic='my-topic',
subscription_name='my-subscription',
consumer_type=my_consumer_type,
consumer_name=my_agent_name
)
while True:
msg = consumer.receive()
ex = msg.value()
i haven't yet found a readily-available way docker-compose or anything to run a multi-cluster pulsar installation locally on Docker desktop for me to try killing off a broker and see how consumer reacts.
Currently Python client only supports configuring one broker's address and doesn't support retry for lookup yet. Here are two related PRs to support it:
https://github.com/apache/pulsar/pull/17162
https://github.com/apache/pulsar/pull/17410
Therefore, setting up a multi-nodes cluster might be nothing different from a standalone.
If you only specified one broker in the service URL, you can simply test it with a standalone. Run a consumer and a producer sending messages periodically, then restart the standalone. The "Broker notification of Closed consumer" appears when the broker actively closes the connection, e.g. your consumer has sent a SEEK command (by seek call), then broker will disconnect the consumer and the log appears.
BTW, it's better to show your Python client version. And GitHub issues might be a better place to track the issue.

Google Pub/Sub with distributed subscribers in Node.js

We are attempting to migrate a message processing app from Kafka to Google Pub/Sub and it's just not working as expected.
We are running in Kubernetes (Google Cloud) where there may be multiple pods processing messages on the same subscription. Topics and subscriptions are all created using terraform and are more or less permanent. They are not created/destroyed on the fly by the application.
In our development environment, where message throughput is rather low, everything works just fine. But when we scale up to production levels, everything seems to fall apart. We get big backlogs of unacked messages, and yet some pods are not receiving any messages at all. And then, all of a sudden, the backlog will just go away, but then climb again.
We are using the nodejs client library provided by google: #google-cloud/pubsub:3.1.0
Each instance of the application subscribes to the same named subscription, and according to the documentation, messages should be distributed to each subscriber. But that is not happening. Some pods will be consuming messages rapidly, while others sit idle.
Every message is processed in a try/catch block and we are not observing any errors being thrown. So, as far as we know, every received message is getting acked.
I am suspicious that, as pods are terminated with autoscaling or updated deployments, that we are not properly closing subscriptions, but there are no examples addressing a distributed environment and I have not found any document that specifically addresses how to properly manage resources. It is also worth mentioning that the app has multiple subscriptions to different topics.
When a pod shuts down, what actions should be taken on the Subscription object and the PubSub client object? Maybe that's not even the issue, but it seems like a reasonable place to start.
When we start a subscription we do something like this:
private exampleSubscribe(): Subscription {
// one suggestion for having multiple subscriptions in the same app
// was to use separate clients for each
const pubSubClient = new PubSub({
// use a regional endpoint for message ordering
apiEndpoint: 'us-central1-pubsub.googleapis.com:443',
});
pubSubClient.projectId = 'my-project-id';
const sub = pubSubClient.subscription('my-subscription-name', {
// have tried various values for maxMessage from 5 to the default of 1000
flowControl: { maxMessages: 250, allowExcessMessages: false },
ackDeadline: 30,
});
sub.on('message', async (message) => {
await this.exampleMessageProcessing(message);
});
return sub;
}
private async exampleMessageProcessing(message: Message): Promise<void> {
try {
// do some cool stuff
} catch (error) {
// log the error
} finally {
message.ack();
}
}
Upon termination of a pod, we do this:
private async exampleCloseSub(sub: Subscription) {
try {
sub.removeAllListeners('message');
await sub.close();
// note that we do nothing with the PubSub
// client object -- should it also be closed?
} catch (error) {
// ignore error, we are shutting down
}
}
When running with Kafka, we can easily keep up with the message pace with usually no more than 2 pods. So I know that we are not running into issues of it simply taking too long to process each message.
Why are messages being left unacked? Why are pods not receiving messages when there is clearly a large backlog? What is the correct way to shut down one subscriber on a shared subscription?
It turns out that the issue was an improper implementation of message ordering.
The official docs for message ordering in Pub/Sub are rather brief:
https://cloud.google.com/pubsub/docs/ordering
Not much there regarding how to implement an ordering key or the implications of message ordering on horizontal scaling.
Though they do link to some external resources, one of which is this blog post:
https://medium.com/google-cloud/google-cloud-pub-sub-ordered-delivery-1e4181f60bc8
In our case, we did not have enough distinct ordering keys to allow for proper distribution of messages across subscribers/pods.
So this was definitely an RTFM situation, or more accurately: Read The Fine Blog Post Referred To By The Manual. I would have much preferred that the important details were actually in the official documentation. Is that to much to ask for?

Service Bus SQL Filter apparently not working in Azure Functions v3

I created a topic "Messages" in my service bus instance, and added a new subscription to it. I can send messages to this topic and the Azure Function v3 trigger is activated just fine. The message is received and displayed in an instant.
When I add an sql filter to filter messages for a subscription it is not working anymore.
What I did so far.
Created an sql filter in the azure portal:
sys.Label = "Test" -> Not working as no messages are received anymore, even though I verified that the Label attribute is set.
sys.Label != "Test" Not working as no messages are received anymore, even though I verified that the Label attribute is set and not matching "Test"
sys.To = "Test" -> Not working, no messages are received. Verified that the messages contain the To member.
sys.Label is not Null -> This is strangely working.
What am I doing wrong here?
As stated in the comment, the solution to this question can be found in the Microsoft Q&A.
To prevent a loss of this solution I am going to post it here too:
The SQL filter value should be in single quotes 'test'
Example : sys.To = 'test'
Make sure that you are defining the message system properties while
sending the message to topic.
Source
If someone is caught off guard by the red single quotes. They are just red, but don't indicate that something is wrong with it. It will just work fine.

Azure NotificationHub - Detect failed notifications

I am trying to store failed notifications in a db, e.g. client does not have internet access. This will enable me to check from a backgroundService if there is a missing notification, and then create it from the backgroundService.
I therefore have the following, on my Azure App Service Mobile:
var notStat = await hub.SendWindowsNativeNotificationAsync(wnsToast, tag);
telemetry.TrackTrace("failure : " + notStat.Failure + " | Results : " + notStat.Results + " | State : " + notStat.State + " | Success : " + notStat.Success + " | trackingID : " + notStat.TrackingId + ");
The code snippet was to test the impact from the client, but no matter what I do the resulting log is just that the message was enqueued.
Question
So how to I detect failed Notifications?
Conclusion
To sum up the discussions made to the accepted answer:
When the notification has been send, the NotificationId and other relevant data, is stored in a separate Table.
The event on the client receiving the notification, will then send a message to the server stating that the notification is received. The entry is then removed from the Table.
The notifications that then are not received by the client, will through a background task be found. This will be every time the background task fires, e.g. every 6 hours, the background task will retrieve all the missing notifications. This enables the background task to create the relevant notifications and the user will not miss any notification.
The return of enqueued is expected - please, refer to the troubleshooting guidance. For more insights on what is happened try to set the EnableTestSend -
"result.State will simply state Enqueued at the end of the execution without any insight into what happened to your push. Now you can use the EnableTestSend boolean" (c) documentation
But be aware that when EnableTestSend is enabled, there are some limits (described on the same page, so will not copy paste it here to avoid the future issues with the outdated info).
You can use Per Message Telemetry functionality or REST API as well - Fiddler+some documentation.
And, as a follow-up questions, there were some discussions on SO i saw that you may find helpful: first and second.
And, as a last one, i would highly recommend (if you did not yet) to take a look at FAQ - it is important to know how different platforms handle the notifications, to avoid the situation when you try to debug something that was done by desing (for example, maybe, if the device is offline, and there are notifications, only the last will be delivered, etc).

Azure Servicebus AutoDeleteOnIdle

I'm trying to figure out the correct behavior when setting AutoDeleteOnIdle. I have a topic called MyGameMessages (not disclosing the game name since it might be considered advertisement).
What I do is that I create a subscription on each node in my server farm.
var manager = GetNameSpaceManager();
_subscriptionId = Guid.NewGuid().ToString();
var description = new SubscriptionDescription(topic, _subscriptionId);
description.AutoDeleteOnIdle = TimeSpan.FromHours(1);
manager.CreateSubscription(description);
Then I start up a thread that pretty much loops for eternity (or at least until signaled to quit)
while(_running)
{
if (_subscriptionId == null)
break;
var message = client.Receive(TimeSpan.FromMinutes(1)); // MARK A
if (message != null)
{
var body = message.GetBody<T>();
// Do stuff with message
message.Complete();
}
}
Question A:
The first implementation had no timeout at MARK A. If no message is sent to this topic within one hour the subscription was autodeleted. Is this the behavior to expect? The client isn't really dead but I guess it just sits around waiting for a message. Is there no keep alive?
Question B:
Would it help to add the timeout as in MARK A or is it a better solution to create a new subscription every 50th minute (to create a small overlap just in case) and abandon the old one?
Thanks
Johan
Johan, the scenario you describe above should work per your expectations. A pending receive call will keep the subscription alive even if no messages are flowing. Using longer timeouts for the Receive are better so you do not have chatty traffic when message volume is low. One thing to confirm is if your are setting the AutoDeleteOnIdle value for the Topic, in that case a receive on a subscription will NOT keep the Topic alive and if no messages are sent to the Topic for one hour then it will get deleted. Deleting a Topic results in all the Subscriptions being deleted too.
Are you still seeing this behavior of Subscriptions being deleted? If so then please create a ticket with Azure live site support and the product team an investigate the specifics.

Resources