Encountering a strange issue with one of our queues (for production, no less). When I try to put a message onto the queue, it's throwing an exception that simply states:
A timeout has occurred during the operation
The messages do seem to be making it onto the queue, as evidenced by the fact that I can see the queue length increasing in the management portal. However, the client application is not receiving any messages.
The management portal shows that there have been several failed requests, and also several internal server exceptions; though unfortunately I don't see any way to get more details about those failed requests and errors.
I'm somewhat at a loss as to what may have caused this, how to get more information about what's wrong, and how to move ahead in troubleshooting this. Any help would be greatly appreciated.
edit: I should mention just for completeness sake, that I did not make any changes to the clients that I'm aware of; This issue just sort of started happening all of a sudden
edit #2, woke up this morning, and things have magically returned to normal. Still not sure what happened, so I'd like to change the tone of the question to solicit suggestions as to how this kind of thing may be mitigated and/or troubleshooted (troubleshot? troubleshat? :) ) better
I have experienced this scenario too. When I tried too create a new service bus namespace, and pointed my app to this new namespace, it worked for me. This suggests that it might be some hardware failure going on (on the node where your sb-namespace resides).
Be sure to use transient failure handling, for example http://www.nuget.org/packages/EnterpriseLibrary.WindowsAzure.TransientFaultHandling/
But there might as well be required too use a "second level retry" for errors that are not transient. This you have to code yourself.
Too be more fault tolerant you can also use the new feature of paired namespaces. Here is a good resource: http://msdn.microsoft.com/en-us/library/dn292562.aspx
Hth
//Peter
Related
I have an api application that is running in a docker container, and since moving to AWS, the api stops daily with the error: Erlang closed the connection. I've monitored the server during that time and no IOPS seem to be causing this issue. Beyond that though, when the api fails, it won't restart on it's own on one of our clusters. I'm not sure where to find the logs to get more context and could use any input that may be helpful. Also, more context here, is that this api worked fairly well before in our data-center/physical server space, but now in AWS, it fails daily. Any thoughts or suggestions as to why this may be failing to restart?
I've looked at the syslogs and the application server logs and don't see any kind of failures. Maybe I'm not looking in the proper place. At this point, someone from one of our teams has to manually restart the api with an init.d command. I don't want to create a cron job to "fix" this because that's a band-aid and not a true fix.
There really isn't enough information on the structure of your app or its connections, so no one can give a "true fix".
The problem may be as small as configuring your nodes differently, changing some of the server's local configurations, or you may need some "keep alive" handler towards AWS.
Maybe try adding a manual periodic dump to see if its an accumulating problem, but I believe if Erlang closed the connection there's a problem between your API and AWS, and your API can't understand where it is located.
We have a long running ASP.NET WebApp in Azure which has no real endpoints exposed – it serves a single functional purpose primarily reading and manipulating database data, effectively a batched, scheduled task, triggered by a timer every 30 seconds.
The app runs fine most of the time but we are seeing occasional issues where the CPU load for the app goes close to the maximum for the AppServicePlan, instantaneously rather than gradually, and stops executing any more timer triggers and we cannot find anything explicitly in the executing code to account for it (no signs of deadlocks etc. and all code paths have try/catch so there should be no unhandled exceptions). More often than not we see errors getting a connection to a database but it’s not clear if those are cause or symptoms.
Note, this is the only resource within the AppService Plan. The Azure SQL database is in the same region and whilst utilised by other apps is very lightly used by them and they also exhibit none of the issues seen by the problem app.
It feels like this is infrastructure related but we have been unable to find anything to explain what is happening so if anyone has any suggestions for where we should be looking they would be gratefully received. We have enabled basic Application Insights (not SDK) but other than seeing CPU load spike prior to loss of app response there is little information of interest given our limited knowledge of how to best utilise Insights.
According to your description, I thought of two points to troubleshoot your problem. First of all, you can track the running status of your program through the code, and put a log at the beginning and end of your batch scheduled tasks to record the status of each run. If possible, record request and response information and start and end information. This can completely record the time and running status of your task.
Secondly, you can record logs before the program starts database operations, and whether the database connection is successful. The best case is to be able to record, what business will trigger CPU load when operating, and track the specific operating conditions, in order to specifically analyze what causes the database connection failure.
Because you cannot reproduce your problem, you can only guess the cause of the problem. If you still can't find where the problem is through the above two points, then modify your timer appropriately, and let the program trigger once every 5 minutes instead of 30s.
At certain times during the week while I'm testing my Mobile Services app I get a 503 error (Service Unavailable). It happens whether I try to call the app from localhost or live on my Azure Website. It hangs around for 10-15 minutes and then goes away on its own. It doesn't seem to be caused by anything in particular that I am doing (i.e. I have not updated any code). The 503 error occurs when I'm trying to call one of my custom APIs in my Mobile Services account. A few of the requests make it through (strangely enough) but the majority return a 503 error.
I've seen that someone had a very similar problem here (Why does Azure give me an intermittent Error 503. The service is unavailable?) without an acceptable resolution.
I am using the free version of Mobile Services but I should be no where near pushing the limits of what the free version can handle; I am the sole user of the app right now.
It will soon be time to make the service live and I'm shuddering at the thought of support calls that will come in during one of these funky states the service gets into. Any help in debugging the problem would be greatly appreciated.
EDIT:
I've narrowed this down to a database problem. I have one main query (sproc) that I use to feed data to the UI. I noticed that when I get the 503 errors the query takes about 13 seconds (when run in SSMS). When things are running "normally", the query takes less than a second.
This doesn't solve my problem though, in fact it makes it more perplexing because I am using the Business Edition of Windows Azure SQL Database and there shouldn't be a 13 second fluctuation in execution time!
This problem seems to happen randomly. Is there some kind of caching in SQL Server that could explain this? Maybe my query really does take 13 seconds to execute and the caching superficially speeds it up.
Could you try transitioning your database/server to one of the "editions"? They have resource governance to promote predictable performance. Web/Business suffer from a noisy neighbor problem. It sounds like that may be your issue, considering it is intermittent.
Here's a link to a page describing the editions. https://msdn.microsoft.com/en-us/library/azure/dn741340.aspx
I've successfully implemented Umbraco 4.7 to a Windows Azure Website and SQL Azure, but sometimes I get errors similar to this one:
SQL helper exception in ExecuteScalar ---> System.Data.SqlClient.SqlException: A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.) --->
It seems that Umbraco does not manage retry logic (sql azure transient fault handling). Does anybody know of any non-traumatic way to implement this on the umbraco side?
Amhed, There is no easy way todo this, people using Umbraco in Azure have generally come up with session state problems that are causing connection issues, but as you stated you are only catching a 1% which means you are picking up transient errors. You will have probably seen the discussion here
http://our.umbraco.org/forum/core/general/27179-SQL-Azure-connectivity-issues?p=1
The net effect is that you have to take the umbraco code base and rebuild it with a retry framework in it. That comes with a serious overhead to using Umbraco and taking future releases. You would best lobbying Umbraco core team to put a full retry framework in place and supporting it. (Lets not even talk about security issues ;)
This probably not what you want to hear but effectively it is roll your own datalayer time.
Having said all that I went and had a look: ;) (Because this does interest me for something else)
As starting point though from looking at the source in Umbraco they are using
Microsoft.ApplicationBlocks.Data
As the base for making connections and executing SQL commands
from Looking at the umbraco.DataLayer.SqlHelpers.SqlServer.SqlServerHelper
using SH = Microsoft.ApplicationBlocks.Data.SqlHelper;
So my guess is that you would need to replace this block. A quick (dirty) search on the internet you get
http://www.sharpdeveloper.net/source/SqlHelper-Source-Code-cs.html
You could reflect out the class though to make sure.
This then you could rebuild out with a transient fault handling framework and then you could effectively drop in potential as a change of dll (famous last words).
But you could at least get
using (ReliableSqlConnection conn = new ReliableSqlConnection(connString, retryPolicy))
easily going in that type of class and more lovely stuff.
http://msdn.microsoft.com/en-us/library/hh680899(v=pandp.50).aspx
If I was going to do this, this is where i would start from and I would go in at that level.
I am not sure if that would cover 100% the connection set as I don't work actively in umbraco code base so this is best guess but looking at source this were I would start from and change out and looks to be your best starting point.
hths,
James
I'm working through a basic tutorial on the ServiceBus. A web role adds objects to a ServiceBus Queue, while a worker role reads those messages off the queue and marks them complete. This is all within the local environment (compute emulator).
It seems like it should be incredibly simple, but I'm seeing the following behavior:
The call QueueClient.Receive() is always timing out.
Some messages are just hanging out in the queue and are not being picked up by the worker.
What could be going on? How can I debug the state of these messages?
You can check the length of the Queue (from the portal or looking at the MessageCount property).
Another possibility is that the messages are DeadLettered. You can read from the DeadLettered subqueue using this sample code.
First of all, please make sure you indeed have some messages in the queue. I would like to suggest you to run the end solution of this tutorial: http://msdn.microsoft.com/en-us/WAZPlatformTrainingCourse_ServiceBusMessaging. If that works fine, please compare your code with the sample code. If that doesn’t work as well, it is likely to be a configuration issue or a network issue. Then I would recommend you to check whether you have properly configured the Service Bus account, and check if you’re able to access internet from your machine.
Best Regards,
Ming Xu.