I'm having odd errors with Azure Service Bus. For long running apps, using the batch api so that I can read a batch of messages at once (and throttle back when I have no messages available etc.), I will eventually start to get "40400: Endpoint not found" errors. These are only transient in that it doesn't stop everything but, once they occur they are intermittently persistent.
I also regularly get Message Lock lost exceptions, with 60 second timeouts, for updating batches of messages (max. 100 at a time). This really shouldn't be happening as it's running under "test" conditions where the messages are read, nothing happens to them, and then I complete them (i.e. there is no "programming logic" that takes any time at all to cause the timeout).
I really don't know how to work out why these occur and what I can do to prevent them.
Obviously I have all the re-try logic so that it doesn't bring down my app but, eventually, my app will process messages so slowly that in effect it's doing nothing at all.
My suspicion is that its because my queue (a "global worldwide queue") resides in North Europe but my app resides in East US, so the latency is causing an issue. If this is the case then I'm really stumped as it's an Azure data center communicating with another Azure data center (so should be fast) and then secondly, how on earth should you architect global queues for distributed access if the performance is so bad? AFAIK Service Bus doesn't support single endpoint globally distributed queues...
Related
We have a service which pings our EP1 Premium service and yesterday we received 3 client side timeout errors after 2 minutes of waiting. When opening the trace in App insights, these requests which time out are not even logged and have no trace of ever being received Azure side, and therefore stay unanswered. By looking at the metrics provided in the Azure Functions app, I found out that 1-2 minutes after the request has been sent, the app loses all its ability to work as its Total App Domains falls to 0 as well as all connections, threads and so on and this state lasts until the next request is received, therefore "skipping" the request that happened beforehand. This is a big issue as I need to make sure requests get answered in a timely manner.
The client service sent HTTP requests to the Azure Functions app expecting an answer, only to time out while the Azure-side doesn't have any record of ever receiving the request.
I believe this issues is related to Consumption Plan of Azure Functions called Cold Start behaviour. The "skipping" mechanism is explained below:
Apps may scale to zero when idle, meaning some requests may have additional latency at startup. The consumption plan does have some optimizations to help decrease cold start time, including pulling from pre-warmed placeholder functions that already have the function host and language processes running.https://learn.microsoft.com/en-us/azure/azure-functions/functions-scale#cold-start-behavior
Please also consider of having look on this article, which explains the behaviour. https://azure.microsoft.com/en-us/blog/understanding-serverless-cold-start/
I have an Azure function with ServiceBusTrigger which will post the message content to a webservice behind an Azure API Manager. In some cases the load of the (3rd party) webserver backend is too high and it collapses returning error 500.
I'm looking for a proper way to implement circuit breaker here.
I've considered the following:
Disable the azure function, but it might result in data loss due to multiple messages in memory (serviceBus.prefetchCount)
Implement API Manager with rate-limit policy, but this seems counter productive as it runs fine in most cases
Re-architecting the 3rd party webservice is out of scope :)
Set the queue to ReceiveDisabled, this is the preferred solution, but it results in my InputBinding throwing a huge amount of MessagingEntityDisabledExceptions which I'm (so far) unable to catch and handle myself. I've checked the docs for host.json, ServiceBusTrigger and the Run parameters but was unable to find a useful setting there.
Keep some sort of responsecode resultset and increase retry time, not ideal in a serverless scenario with multiple parallel functions.
Let API manager map 500 errors to 429 and reschedule those later, will probably work but since we send a lot of messages it will hammer the service for some time. In addition it's hard to distinguish between a temporary 500 error or a consecutive one.
Note that this question is not about deciding whether or not to trigger the circuitbreaker, merely to handle the appropriate action afterwards.
Additional info
Azure functionsV2, dotnet core 3.1 run in consumption plan
API Manager runs Basic SKU
Service Bus runs in premium tier
Messagecount: 300.000
I have an Azure Functions application which once in a while "freezes" and stops processing messages and timed events.
When this happens I do not see anything in the logs (AppInsight), neither exceptions nor any kind of unfamiliar traces.
The application has following functions:
One processing messages from a Service Bus topic subscription (belonging to another application)
One processing from an internal storage queue
One timer based function triggered every half hour
Four HTTP endpoints
Our production app runs fine. This is due to an internal dashboard (on big screen in the office), which polls one of the HTTP endpoints every 5 minutes, there by keeping it alive.
Our test, stage and preproduction apps stop after a while, stopping to process messages and timer events.
This question is more or less the same as my previous question, but the without error message that was in focus then. Much fewer error messages now, as our deployment has been fixed.
A more detailed analysis can be found in the GitHub issue.
On a consumption plan, all triggers are registered in the host, so that these can be handled, leading to my functions being called at the right time. This part of the host also handles scalability.
I had two bugs:
Wrong deployment. Do zip based deployment as described in the Docs.
Malformed host.json. Comments in JSON are not right, although it does work in most circumstances in Azure Functions. But not all.
The sites now works as expected, both concerning availability and scalability.
Thanks to the people in the Azure Functions team (Ling Toh, Fabio Cavalcante, David Ebbo) for helping me out with this.
I recently created an error manager to take logged errors from clients on our network and put them into an MSMQ for processing. I have a separate Windows Service running on the server to pick items off the queue and push them into a database.
When I wrote it and tested it everything worked great; however I neglected to consider that at deploy-time, having 100 clients all sending to a public queue might not be performant, best-case, and worst-case there could be all kinds of collisions, it seems to me.
My thought right now is to front the MSMQ with a WCF service and make everyone go through that. The logic being that at that point I could employ some locking, etc. If I went with a service I think I could employ a private queue instead of a public one, which would be tons faster, as well.
What I'm not sure is, am I overthinking it? MSMQ is pretty robust and the methods I think are thread-safe. Should I just leave it alone and see what happens? If I do put in the service, how much management would I need to have in place?
I recently created an error manager to take logged errors from clients
on our network and put them into an MSMQ for processing
I assume you're using System.Messaging for this? If so there is nothing at all wrong with your approach.
having 100 clients all sending to a public queue might not be
performant
MSMQ was designed from the bottom up to handle high load. Depending on the size of the individual messages and the storage threshold of the machine, a queue can hold 10's of thousand of messages without any noticeable performance impact.
Because a "send" in MSMQ involves the queue manager on each machine writing messages locally before transmission (in a store and forward messaging pattern), there is almost no chance of "collisions" or any other forms of contention happening; if the sender is unable to transmit the message it simply "sends" it to a temporary local queue and then the actual transmission happens in the background and is mediated by the fault tolerant and very reliable msmq protocol.
My thought right now is to front the MSMQ with a WCF service and make
everyone go through that
This would be a valid choice if you were starting from nothing. As another poster has stated, WCF does hide you from some of the msmq-voodoo by removing the necessity to use System.Messaging. However, you've already written the code so I see little benefit exposing a netMsmqBinding endpoint.
If I went with a service I think I could employ a private queue
instead of a public one
As far as I understand it from your description, there's nothing to stop you using a private queue in your current scenario. In fact I'd recommend always using private queues as they're much simpler.
If I do put in the service, how much management would I need to have
in place?
You will have more management overhead with a wcf service. Because you're wrapping each end of a send-receive with the WCF stack, there is more code to spin up and therefore potentially fail. WCF stack exceptions are famously difficult to troubleshoot without full service logging enabled.
EDIT - in response to comments
I think for a private queue you have to actually be writing FROM the
machine the queue sits on, which would not work in a networked
environment
Untrue. MSMQ supports transactional reads to and writes from any private queue, regardless of whether the queue is local or remote.
This is because any time a message is sent from one machine to another in msmq, regardless of the queue address, the following happens:
Queue manager on sending machine writes the message to a temporary local "outbound" queue.
Queue manager on sending machine contacts queue manager on receiving machine and transmits the message.
Queue manager on receiving machine places the message into the destination queue.
If you are using transactions, the above steps will comprise 3 distinct transactions.
Something to remember: the safest paradigm in exchanging messages between queues on different machines is send remote, read local.
So this means when you send a message, you're instructing msmq to send to a remote queue address. However, when someone sends something to you, they must do the same. So you end up reading only from local queues, and sending only to remote queues.
This way you get the most reliable messaging setup, because when reading, a local queue will always be available.
Try it! I've been using msmq for cross machine communication for nearly 10 years and I've never used a public queue. I don't even know what they're for!
I would expose an WCF "IsOneWay" method.
And then host your WCF in IIS.
The IsOneWay will wire up to MSMQ.
This way...you have the robustness of IIS hosting. You can expose any endpoint you want.
But eventually the request makes it to MSMQ.
One of hte reasons is the ease of using msmq with wcf. Having written and used msmq "pre-wcf" I found the code (pulling messages off the queue and error handling) to be difficult and problematic. That alone would push me to WCF hosting.
And as you mention, the security around a local-queue is much easier to deal with.
Bottom line, let WCF handle the msmq-voodoo for you.
Simple example below.
[ServiceContract]
public interface IMyControllerController
{
[OperationContract(IsOneWay = true)]
void SubmitRequest( MyObject obj );
}
http://msdn.microsoft.com/en-us/library/ms733035%28v=vs.110%29.aspx
http://msdn.microsoft.com/en-us/library/system.servicemodel.operationcontractattribute.isoneway%28v=vs.110%29.aspx
What happens in WCF to methods with IsOneWay=true at application termination
http://blogs.msdn.com/b/tomholl/archive/2008/07/12/msmq-wcf-and-iis-getting-them-to-play-nice-part-1.aspx
If there are no longer any publishers or subscribers reading nor writing to a Queue, Topic, or Subscription, because of crashes or other abnormal terminations (instance restart, etc.), is that Queue/Topic/Subscription effectively orphaned?
I tested this by creating a few Queues, and then terminating the applications. Those Queues were still on the Service Bus a long time later. It seems that they will just stay there forever. That would be wonderful if we WANTED that behavior, but in this case, we do not.
How can we detect and delete these Queues, Topics, and Subscriptions? They will count towards Azure limits, etc, and we cannot have these orphaned processes every time an instance is restarted/patched/crashes.
If it helps make the question clearer, this is a unique situation in which the Queues/Topics/Subscriptions have special names, or special Filters, and a very limited set of publishers (1) and subscribers (1) for a limited time. This is not a case where we want survivability. These are instance-specific response channels. Whether we use Queues or Subscriptions is immaterial. If the instance is gone, so is the need for that Queue (or Subscription).
This is part of a solution where each web role has a dedicated response channel that it monitors. At any time, this web role may have dozens of requests pending via other messaging channels (Queues/Topics), and it is waiting for the answers on multiple threads. We need the response to come back to the thread that placed the message, so that the web role can respond to the caller. It is no good in this situation to simply have a Subscription based on the machine, because it will be receiving messages for other threads. We need each publishing thread to establish a dedicated response channel, so that the only thing on that channel is the response for that thread.
Even if we use Subscriptions (with some kind of instance-related filter) to do a long-polling receive operation on the Subscription, if the web role instance dies, that Subscription will be orphaned, correct?
This question can be boiled down like so:
If there are no more publishers or subscribers to a Queue/Topic/Subscription, then that service is effectively orphaned. How can those orphans be detected and cleaned up?
In this scenario you are looking for the Queue/Subscriptions to be "dynamic" in nature. They would be created and removed based on use as opposed to the current explicit provisioning model for these entities. Service Bus provides you with the APIs to perform create/delete operations so you can plug these on role OnStart/OnStop events appropriately. If those operations fail for some reason then the orphaned entities will exist. Again you can run clean up operation on them based on some unique identifier for the name of the entities. An example of this can be seen here: http://windowsazurecat.com/2011/08/how-to-simplify-scale-inter-role-communication-using-windows-azure-service-bus/
In the near future we will add more metadata and query capabilities to Queues/Topics/Subscriptions so you can see when they were last accessed and make cleanup decisions.
Service Bus Queues are built using the “brokered messaging” infrastructure designed to integrate applications or application components that may span multiple communication protocols, data contracts, trust domains, and/or network environments. The allows for a mechanism to communicate reliably with durable messaging.
If a client (publisher) sends a message to a service bus queue and then crashes the message will be stored on the Queue until as consumer reads the message off the queue. Also if your consumer dies and restarts it will just poll the queue and pick up any work that is waiting for it (You can scale out and have multiple consumers reading from queue to increase throughput), Service Bus Queues allow you to decouple your applications via durable cloud gateway analogous to MSMQ on-premises (or other queuing technology).
What I'm really trying to say is that you won't get an orphaned queue, you might get poisoned messages that you will need to handled, this blog post gives some very detailed information re: Service Bus Queues and their Capacity and Quotas which might give you a better understanding http://msdn.microsoft.com/en-us/library/windowsazure/hh767287.aspx
Re: Queue Management, you can do this via Visual Studio (1.7 SDK & Tools) or there is an excellent tool called Service Bus Explorer that will make your life easier for queue managagment: http://code.msdn.microsoft.com/windowsazure/Service-Bus-Explorer-f2abca5a
*Note the default maximum number of queues is 10,000 (per service namespace, this can be increased via a support call)
As Abhishek Lai mentioned there is no orphan detecting capability supported.
Orphan detection can be implement externally in multiple ways.
For example, whenever you send/receive a message, update a timestamp in an SQL database to indicate that the queue/tropic/subscription is still active. This timestamp can then be used to determine orphans.
If your process will crash which is very much possible there will be issue with the message delivery within the queue however queue will still be available to process your request. Handling Application Crashes and Unreadable Messages with Windows Azure Service Bus queues are described here:
The Service Bus provides functionality to help you gracefully recover from errors in your application or difficulties processing a message. If a receiver application is unable to process the message for some reason, then it can call the Abandon method on the received message (instead of the Complete method). This will cause the Service Bus to unlock the message within the queue and make it available to be received again, either by the same consuming application or by another consuming application.
In the event that the application crashes after processing the message but before the Complete request is issued, then the message will be redelivered to the application when it restarts. This is often called At Least Once Processing, that is, each message will be processed at least once but in certain situations the same message may be redelivered. If the scenario cannot tolerate duplicate processing, then application developers should add additional logic to their application to handle duplicate message delivery. This is often achieved using the MessageId property of the message, which will remain constant across delivery attempts.
If there are no longer any processes reading nor writing to a queue, because of crashes or other abnormal terminations (instance restart, etc.), is that queue effectively orphaned?
No the queue is in place to allow communication to occur via Brokered Messages, if all your apps die for some reason then the queue still exists and will be there when they become alive again, it's the communication channel for loosely decoupled applications. Regards Billing 'Messages are charged based on the number of messages sent to, or delivered by, the Service Bus during the billing month' you won't be charged if a queue exists but nobody is using it.
I tested this by creating a few queues, and then terminating the
applications. Those queues were still on the machine a long time
later.
The whole point of the queue is to guarantee message delivery of loosely decoupled applications. Think of the queue as an entity or application in its own right with high availability (SLA) as its hosted in Azure, your producer/consumers can die/restart and the queue will be active in Azure. *Note I got a bit confused with your wording re: "still on the machine a long time later", the queue doesn't actually live on your machine, it sits up in Azure in a designated service bus namespace. You can view and managed the queues via the tools I pointed out in the previous answer.
How can we detect and delete these queues, as they will count towards
Azure limits, etc.
As stated above the default maximum number of queues is 10,000 (per service namespace, this can be increased via a support call), queue management can be done via the tools stated in the other answer. You should only be looking to delete queue's when you no longer have producer/consumers looking to write to them (i.e. never again). You can of course create and delete queues in your producer/consumer applications via the namespaceManager.QueueExists, more information here How to Use Service Bus Queues
If it helps make the question clearer, this is a unique situation in which the queues have special names, and a very limited set of publishers (1) and subscribers (1) for a limited time.
It sounds like you need to use Topics & Subscriptions How to Use Service Bus Topics/Subscriptions, this link also has a section on 'How to Delete Topics and Subscriptions' If you have a very limited lifetime then you could handle topic creation/deletion in your app's otherwise you could have have a separate Queue/Topic/Subscription setup/deletion script to handle this logic...