I'm using NServiceBus (v 4.0.5) on an Azure virtual machine using the Azure Service Bus transport (v 4.0.5). The NServiceBus.Host service has been crashing on an occasional basis but lately has been crashing more often than not. The exception thrown is:
Application: NServiceBus.Host.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: Microsoft.ServiceBus.Common.CallbackException
Stack:
at Microsoft.ServiceBus.Common.Fx+IOCompletionThunk.UnhandledExceptionFrame(UInt32, UInt32, System.Threading.NativeOverlapped*)
at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
I'm using dedicated machine running the generic host service, and I have 3 machines which send messages to it (I don't use pub/sub).
What I've tried
Rebooting / restarting the service manually.
Researching the error: not many people seem to have received this message, and for the people that have, their response did not apply to my situation.
Verifying the dead letter queue: several messages are placed in the dead letter queue (over 400 in the past 6 months), but I could not correlate any specific message types to the crash (at least 40% of my message types have been found in the dead letter queue). I'm assuming that most of these messages have been added to the DLQ because the service is failing.
Checking application logs: my application logs exceptions to a log4net log, however no exceptions were logged during the time of the crashes.
Checking event logs: nothing relevant was found except for the main error message noted above.
Upgrading NServiceBus to 4.4.2 and WindowsAzureServiceBus package to 5.1.1: due to NuGet package conflicts upgrading is proving to be painful. I'm using Microsoft.Data.OData 5.4.0 and Microsoft.Data.Edm 5.4.0, but the NServiceBus.Azure package depends on v5.2.0 of these assemblies. I could discard the nuget package dependencies and add the references myself, but I'd like to know why the WindowsAzureServiceBus package depends specifically on v5.2.0 before doing this.
Any thoughts or ideas would be helpful.
Thank you!
I will look into this, It sounds like a bug, most likely an unhandled exception coming from the azure servicebus (but doesn't necessarily originate there)
I've created a github issue here: https://github.com/Particular/NServiceBus.Azure/issues/133
Are you able to reproduce the issue? And what has changed between the time where you saw it occasionally and where it happens often.
One thing you could do is to add an eventhandler for all exceptions occuring on the appdomain and log those as well, that should theorethically catch anything and if there is an innerexception to this callback exception you could catch it this way.
On the strict dependency of the packages. This is mostly done because nuget package manager does not apply binding redirects to the app.config of worker roles, which tripped up way to many users in the past (it often manifests itself as an infinitly rebooting worker role). So go ahead and override.
Related
I have 6 WebApps (asp.net, windows) running on azure and they have been running for years. i do tweak from time to time, but no major changes.
About a week ago, all of them seem to leak handles, as shown in the image: this is just the last 30 days, but the constant curve goes back "forever". Now, while i did some minor changes to some of the sites, there are at least 3 sites that i did not touch at all.
But still, major leakage started for all sites a week ago. Any ideas what would be causing this?
I would like to add that one of the sites does only have a sinle aspx page and another site does not have any code at all. It's just there to run a webjob containing the letsencrypt script. That hasn't changed for several months.
So basically, i'm looking for any pointers, but i doubt this can has anything to do with my code, given that 2 of the sites do not have any of my code and still show the same symptom.
Final information from the product team:
The Microsoft Azure Team has investigated the issue you experienced and which resulted in increased number of handles in your application. The excessive number of handles can potentially contribute to application slowness and crashes.
Upon investigation, engineers discovered that the recent upgrade of Azure App Service with improvements for monitoring of the platform resulted into a leak of registry key handles in application worker processes. The registry key handle in question is not properly closed by a module which is owned by platform and is injected into every Web App. This module ensures various basic functionalities and features of Azure App Service like correct processing HTTP headers, remote debugging (if enabled and applicable), correct response returning through load-balancers to clients and others. This module has been recently improved to include additional information passed around within the infrastructure (not leaving the boundary of Azure App Service, so this mentioned information is not visible to customers). This information includes versions of modules which processed every request so internal detection of issues can be easier and faster when caused by component version changes. The issue is caused by not closing a specific registry key handle while reading the version information from the machine’s registry.
As a workaround/mitigation in case customers see any issues (like an application increased latency), it is advised to restart a web app which resets all handles and instantly cleans up all leaks in memory.
Engineers prepared a fix which will be rolled out in the next regularly scheduled upgrade of the platform. There is also a parallel rollout of a temporary fix which should finish by 12/23. Any apps restarted after this temporary fix is rolled out shouldn’t observe the issue anymore as the restarted processes will automatically pick up a new version of the module in question.
We are continuously taking steps to improve the Azure Web App service and our processes to ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):
• Fixing the registry key handle leak in the platform module
• Fix the gap in test coverage and monitoring to ensure that such regression will not happen again in the future and will be automatically detected before they are rolled out to customers
So it appears this is a problem with azure. Here is the relevant part of the current response from azure technical support:
==>
We had discussed with PG team directly and we had observed that, few other customers are also facing this issue and hence our product team is actively working on it to resolve this issue at the earliest possible. And there is a good chance, that the fixes should be available within few days unless something unexpected comes in and prevent us from completing the patch.
<==
Will add more info as it comes available.
We are running NodeJS in the App Engine standard environment and while we try to be perfect programmers, we some times have a bug, the issue we're running into app engine completely crashes the server every time and throws a 203 error.
We've tried to do all the standard error handling things for Node, but it seems like app engine is a special case. Has anyone seen this or handled this issue before?
As it is stated in the answer https://stackoverflow.com/a/51769527/10301041:
"The error 203 means that Google App Engine detected that the RPC channel has closed unexpectedly and shuts down the instance. The request failure is caused by the instance shutting down."
An error in your code can be the cause of that. Other cause might be one of the project quotas.
If you still running on the issue and you can't identify the source of the error I would suggest to contact GCP support, as it is also suggested in the answer above.
Encountering a strange issue with one of our queues (for production, no less). When I try to put a message onto the queue, it's throwing an exception that simply states:
A timeout has occurred during the operation
The messages do seem to be making it onto the queue, as evidenced by the fact that I can see the queue length increasing in the management portal. However, the client application is not receiving any messages.
The management portal shows that there have been several failed requests, and also several internal server exceptions; though unfortunately I don't see any way to get more details about those failed requests and errors.
I'm somewhat at a loss as to what may have caused this, how to get more information about what's wrong, and how to move ahead in troubleshooting this. Any help would be greatly appreciated.
edit: I should mention just for completeness sake, that I did not make any changes to the clients that I'm aware of; This issue just sort of started happening all of a sudden
edit #2, woke up this morning, and things have magically returned to normal. Still not sure what happened, so I'd like to change the tone of the question to solicit suggestions as to how this kind of thing may be mitigated and/or troubleshooted (troubleshot? troubleshat? :) ) better
I have experienced this scenario too. When I tried too create a new service bus namespace, and pointed my app to this new namespace, it worked for me. This suggests that it might be some hardware failure going on (on the node where your sb-namespace resides).
Be sure to use transient failure handling, for example http://www.nuget.org/packages/EnterpriseLibrary.WindowsAzure.TransientFaultHandling/
But there might as well be required too use a "second level retry" for errors that are not transient. This you have to code yourself.
Too be more fault tolerant you can also use the new feature of paired namespaces. Here is a good resource: http://msdn.microsoft.com/en-us/library/dn292562.aspx
Hth
//Peter
I have strange problem with my multi threaded server. It is Windows service and works similar to FTP server managing socket connection to many clients. It was created using Delphi 2006 (Turbo Delphi) and works well on most machines. Unfortunately on some machines it sometimes crashes without any trace from itself (exception should be saved to log, but are not). Sometimes system shows MessageBox (it is not MessageBox from service, but I think it is system MessageBox), but most often I see such information in System EventLog:
Application popup: ht_switch.exe - Application Error : The exception unknown software exception (0x0eedfade) occurred in the application at location 0x77e4bef7.
In Application EventLog I can see:
Faulting application ht_switch.exe, version 1.2.0.2, faulting module kernel32.dll, version 5.2.3790.5069, fault address 0x0000bef7.
Sometimes such entries are in Application or System EventLog, but nothing happens -- my server works as usually, but sometimes is simply disappears. Then Service Manager reports in EventlLog that my service unexpectedly stopped.
I see no "common" scenario to such problem. It appears on some WinXP, Win2003 and Win2008. All test machines have all MS patches applied.
I have read answers to: 0x0eedfade kernelbase.dll faulting module in d7 windows service but I do not use Dialog unit.
What can I do to repair it? How to trace such 0x0eedfade exception?
EDIT
I tested for some days my server with both EurekaLog and madExcept.
EurekaLog:
Server works without problem. No exception is reported in EventLog. No exception is reported in %AppData%\EurekaLab s.a.s\EurekaLog\Bug Reports\ (there should be directory for my program, but it was not created -- I don't know if it should be created or if it is an EurekaLog error).
EurekaLog7 have problem with setting "Application Type" to Windows Service. It is known problem and authors works on it. My service compiled with it works on WinXP but was not able to work on Win2003. It simply do not start.
madExcept:
Server worked for 4 hours and crashed. I have caught this exception in my thread:
EAccessViolation: Access violation at address 7C90100B in module 'ntdll.dll'. Read of address 00000018!!!
I haven't noticed any madExcept report on this exception. After this exception one thread was lost with socket in CLOSE_WAIT state (other side closed connection). Then I restarted my service and after next few hours it worked without problem.
disabled EurekaLog and madExcept:
After 10-30 minutes I see MessageBox with error. But 0x0eedfade error is cryptic and do not show me any hint on source of the problem. It is also very strange because after displaying such message service works without problem (most of the time).
Summary od exception interceptors:
EurekaLog and madExcept are probably good at exceptions raised by Delphi but it seems that change behavior of my service and error magically disappeared or they report exception to place I cannot find.
EDIT: Problem solved
After some debugging that lead me to nowhere (Call Stack with very strange places) I resigned from it and started to inspect lastly commited changes. One change was string operation where string (AnsiString) can be of length 64 or 128 (some kind of bit mask). I set 70th character of string that was earlier allocated with SetLength(buffer, 64). That was the problem. I think I would save time by enabling range checking.
How to trace such 0x0eedfade exception?
This is the code for a Delphi exception. Clearly you are raising a Delphi exception that is not being handled and that is bringing your process down.
You should add madExcept, EurekaLog, JCLDebug or similar to your process. These tools will produce diagnostics reports when your process fails. The most useful part of those reports will be the stack trace at the point of failure. You should be able then to work out where the failure occurs, at the very least, and this usually is enough to work out what is wrong with your code.
Behavior:
Application is loaded and being used as expected.
Suddenly, a particular DLL can no longer be loaded. The error message is:
ActiveX component cannot create object.
In each case, the object had been created successfully many times before failure. All objects are marked for "retain in memory".
This error is cleared when the application pool is recycled. It may be hours or months before it is seen again.
Issue has happened within two hours of a refresh, as well as never happened in months of uptime.
Issue has happened with hundreds of simultaneous users (heavy usage) and also with 1-3 users.
While the issue is occurring, the process running that application pool cannot create the object that is failing. However it can create any other objects. Memory, CPU, and other resources all remain at normal usage. In addition, other processes (such as a stand-alone exe) can successfully create the object.
The first instance of the issue appeared in mid 2008. There have been less than fifty instances since then, despite a pool of hundreds of servers for it to occur on. All instances except one have failed on the same DLL.
DLL Failure Info:
most common - generic data structure implementing a b-tree, has no references other than to its interface. Code consists of arrays and one use of the vb6 Event functionality. The object has not been changed in any way since 2005.
one-time - interop to a .NET module. the failure is occurring when trying to create the interop object, not the .NET object. This object is updated a few times each year.
Application Environment:
IIS hosted application
VB6, classic ASP, some interop to minor .NET components
Windows Server 2003 / Windows Server 2008 (both have independently had the problem)
Attempts to Reproduce:
Using scripts (and real-life humans) to run the same end-user workflows that our logs reported the days before the issue occurred.
Using scripts to create/destroy suspected objects as fast as possible from multiple simultaneous sessions.
Wild speculation.
No intentional success, but it does manifest randomly on the servers on its own.
Troubleshooting:
Code reviews
Test harnesses to investigate upper limits of object creation / destruction
Verification of ability to create object outside of the process experiencing the issue
Monitoring of resources over time on servers under load
Review of IIS, error, and event logs to determine events leading up to issue
Questions:
Any ideas on how to reproduce the issue?
What could cause this behavior?
Ideas for bypassing the first two questions in favor of a fast solution?
The DLL isn't on a network drive is it? You can get "glitches" where the drive is not available momentarily that then means COM can't do what it needs and could then fail to notice the drive is available again.
I used Process Monitor to debug similar problem when accessing ADO/OLEDB stack. Turned out environment got corrupted at some point and ADO classes are registered with InprocServer32 being REG_EXPAND_SZ pointing to %CommonProgramFiles%\System\ado\msado15.dll or similar ot x64 OSes.
Also when you register an application with Restart Manager, on failure the process gets restarted by winlogon process whose environment is different than explorer's one and unfortunately is missing %CommonProgramFiles% -- ouch!
This seems like a random failure; some race condition.
Try VMWARE to record the state of the machine you run this dll on. When the error happens you can then replay the record and inspect the memory contents. That why you won't have to play try and catch the error. At least you will have a solid record of it.
While I can't provide a solution, try catching the error and retry loading the dll when this happens after a refresh to the environment.