DebugDiag Analysis hangs while "Dumping Thread Data" - iis

I'm trying to run DebugDiag Analysis (v2.3) on my Windows 10 laptop using crash-dump files generated from a w3wp.exe process on a Windows Server 2016 box... but the application never passes "Dumping Thread Data" (the progress animation continues, so it's not a GUI hang.)
(I am not allowed to installed DebugDiag on the server, as it is production box and owned by a client. They allowed me to change the registry settings to generate the crash-dump files, which I then copied to my local machine.)
IIS-10 is running an ASP.Net 4.5.2 web application, and the corresponding event log entries on the Server (when the crash-dump files were generated) suggest there is an unhandled stackoverflow (0xc000000fd).
I've tried all of the crash-dump files, which are approximately 50Mb in size - but I end up just quitting DebugDiag because it never passes "Dumping Thread Data".
According to the only answer on this question a file of 400Mb "can take 4-5 minutes to run"... however, I've left my (quite powerful) laptop running the analysis of 50Mb for well over 30 minutes with no progress.
I've also seen that an internet connection is required - and I can confirm that I have a quick connection, so that shouldn't be the issue.
If required I can place a file somewhere to download, but don't want to do that unless it's really necessary.
I'm doubtful anybody will be able to help, but I'm struggling to see how to fix this

Related

Application pool disabling

I have an application in the Production environment which is Windows Server 2012/IIS 8 and is load balanced.
Recently out of nowhere, the website app pool suddenly started gettig disabled. The System Windows Logs logged the following error message by the Resource-Exhaustion-Detector ...
Application Pool 'x' is being automatically disabled due to a series of failures in the process(es) serving that application pool.
Windows successfully diagnosed a low virtual memory condition. The following programs consumed the most virtual memory: w3wp.exe (6604) consumed 5080641536 bytes, w3wp.exe (1572) consumed 477335552 bytes, and w3wp.exe (352) consumed 431423488 bytes.
Anyone got any idea how I figure out what is happening? We've never come across this issue before and the application has been running for a good couple of years.
Also, this isn't something that happens regularly but instead seems to happen one every day or so, and even that is at a random time. The Virtual Memory was initially 4GB but because of the issue above, we increased it to 8GB. Recently it spiked at using about 6.8GB out of 8GB, which it has no reason to do so.
Any help would be really appreciated!
The answer is easy here, obviously and certainly you have two issues here
1- You have a serious bug in your process/code that happens intermittently "you need to debug it to find how/when that happens" or at least run a ProcDump
such that you keep it listening on the server on the process W3WP till an exception happens and then analyze this dump to find where the code get stuck and consume that memory/otherwise just debug the code and see what changes were made in last few months "not days"
2- the application get stopped because you have configured/it is configured by default to get disabled break after a certain number of failure repeats, and that's a normal behavior but the main issue as I said is not the application pool itself, its inside the process
please let me know if you need a further explanation or help on this matter

IIS 7.5 connection problems

In the last couple of weeks my site has hung. Connections increase but seem to never release. Once it hits 700+ current connections (using the performance tool) the whole site hangs and I have no choice but to do iisreset to get it working again. Normally it's around 100 concurrent connections at peak time. No errors or warnings in the event log when it stops releasing so that isn't helpful. This problem has been happening after copying new DLLs over the old ones to do a site update. I have a single server so no choice but to copy over live. But in the case of today it was fine after the update but then the problem happened two hours later. It's .NET 4.5 site on Windows 2008 R2 64bit. Is there a way I can find out what's causing this, like other log files some place or something I should try doing when it happens? What I have tried is recycling the app pool (doesn't help), turning off the app pool and back on (doesn't turn on, gives exception), restarting the site (doesn't help), and iisreset (works every time).
Usually some requests are "stuck" when you see these syndromes, as app.pool recycle and site restart will wait for the pending requests to finish in general (graceful exit), while iisreset will actually kill the w3wp.exe process after 20 (or 30?) seconds if it doesn't exit on its own (non-graceful exit), this why iisreset works for you.
Listing the active requests (appcmd.exe list request /?) should give you a clue about why is this happening.

Creating objects suddenly begins failing after they have been loaded in memory successfully

Behavior:
Application is loaded and being used as expected.
Suddenly, a particular DLL can no longer be loaded. The error message is:
ActiveX component cannot create object.
In each case, the object had been created successfully many times before failure. All objects are marked for "retain in memory".
This error is cleared when the application pool is recycled. It may be hours or months before it is seen again.
Issue has happened within two hours of a refresh, as well as never happened in months of uptime.
Issue has happened with hundreds of simultaneous users (heavy usage) and also with 1-3 users.
While the issue is occurring, the process running that application pool cannot create the object that is failing. However it can create any other objects. Memory, CPU, and other resources all remain at normal usage. In addition, other processes (such as a stand-alone exe) can successfully create the object.
The first instance of the issue appeared in mid 2008. There have been less than fifty instances since then, despite a pool of hundreds of servers for it to occur on. All instances except one have failed on the same DLL.
DLL Failure Info:
most common - generic data structure implementing a b-tree, has no references other than to its interface. Code consists of arrays and one use of the vb6 Event functionality. The object has not been changed in any way since 2005.
one-time - interop to a .NET module. the failure is occurring when trying to create the interop object, not the .NET object. This object is updated a few times each year.
Application Environment:
IIS hosted application
VB6, classic ASP, some interop to minor .NET components
Windows Server 2003 / Windows Server 2008 (both have independently had the problem)
Attempts to Reproduce:
Using scripts (and real-life humans) to run the same end-user workflows that our logs reported the days before the issue occurred.
Using scripts to create/destroy suspected objects as fast as possible from multiple simultaneous sessions.
Wild speculation.
No intentional success, but it does manifest randomly on the servers on its own.
Troubleshooting:
Code reviews
Test harnesses to investigate upper limits of object creation / destruction
Verification of ability to create object outside of the process experiencing the issue
Monitoring of resources over time on servers under load
Review of IIS, error, and event logs to determine events leading up to issue
Questions:
Any ideas on how to reproduce the issue?
What could cause this behavior?
Ideas for bypassing the first two questions in favor of a fast solution?
The DLL isn't on a network drive is it? You can get "glitches" where the drive is not available momentarily that then means COM can't do what it needs and could then fail to notice the drive is available again.
I used Process Monitor to debug similar problem when accessing ADO/OLEDB stack. Turned out environment got corrupted at some point and ADO classes are registered with InprocServer32 being REG_EXPAND_SZ pointing to %CommonProgramFiles%\System\ado\msado15.dll or similar ot x64 OSes.
Also when you register an application with Restart Manager, on failure the process gets restarted by winlogon process whose environment is different than explorer's one and unfortunately is missing %CommonProgramFiles% -- ouch!
This seems like a random failure; some race condition.
Try VMWARE to record the state of the machine you run this dll on. When the error happens you can then replay the record and inspect the memory contents. That why you won't have to play try and catch the error. At least you will have a solid record of it.
While I can't provide a solution, try catching the error and retry loading the dll when this happens after a refresh to the environment.

IIS7: Faulting application w3wp.exe, what is the root cause of these crashes?

Our Website is in .NET but with some old ASP and 32bits libraries too in it. It had been working fine for a while (2 years). But for the past month, we have seen the following error on our IIS7 server, which we have been unable to track down and fix:
"Faulting application w3wp.exe, version 7.0.6001.18000, time stamp 0x47919413, faulting module kernel32.dll, version 6.0.6001.18215, time stamp 0x4995344f, exception code 0xe053534f, fault offset 0x0002f328, process id 0x%9, application start time 0x%10."
We are able to reproduce the error:
One of our .ASPX pages starts loading, executing code and queries (we have response.flush() all over the page to track where the code breaks), then it suddenly stops and we get the above error in IIS.
The page stops loading and, without the response.flush(), it's not redirecting to our error.aspx page (as configured in web.config)
The error does NOT happen all the time. Sometimes, it happens 3 times in a row, then it's working fine for 15 minutes non-stop with a proper redirection to error.aspx.
The error we get then is a classic: "Either BOF or EOF is True, or the current record has been deleted."
When the error occurs, the page hangs and all other session on the same computer from any browsers have hanging web pages as well (BTW, we only allow 1 worker process while we are testing). From other computers, the site loads fine.
I can recycle the Application Pool, kill w3wp.exe, restart IIS. Nothing will do. The only way to successfully load the page again is to Restart MS SQL which handles our Session States. I don't know why this is, but we guessed that the Session Cookies on the users browsers points to a thread which was not terminated properly (due to the above crash) and IIS is waiting for it to terminate to process more code (?). If someone can explain this better, that would be really helpful. Is there a timeout which we can set to "terminate" threads? Is it a MS SQL related issue?
I have also looked at the Private and Virtual Memory usages, because I think our code is not the most effective and I am certain we have remaining memory leaks. However, I saw the page crash even though both Private and Virtual Memories were still quite low (under 100MB each).
I have used Debug Diag and WinDbg as indicated here: http://blogs.msdn.com/b/tess/archive/2009/03/20/debugging-a-net-crash-with-rules-in-debug-diag.aspx, but we are not able to make windbg work, this is what we are trying to do at the moment.
If someone could help us or point us toward the right direction that would be really great, thank you.
"Either BOF or EOF is True, or the current record has been deleted" means the table is empty and you are attempting to do a MoveNext. So check for eof before you do any moves.
IIS is notorious for throwing kernel errors in w3wp.exe like this one. All your errors in session state are just symptoms of the crashed process. Multiple APP pools won't help much - they just spread the error around.
I''d wager it is SQL deadlocks due to your user environment changing. This will cause a 10-second lag as SQL tries to determine which query to kill off. One wins, one loses. The loser gets back a pointer to an unexpectedly empty table and you try a move and subsequent crash. You maybe could point your DB to an ODBC connection and turn on tracing, or figure out a way to get SQL to log it.
I had all the same symptoms as above in Perl. I was able to make a wrapper fn() to do all SQL queries and log all sql, + params and any errors to disk to track down the problem. It was deadlocks, then we were able to code in auto-retry, and eventually we recoded the query order and scanned columns to eliminate the deadlocks.
It's entirely possible one of your referenced/linked assemblies somewhere has randomly gone corrupt (it can happen) on disk. Can you try a replicate the problem on a new, clean machine with the same stats, fresh installs of the latest xyz drivers you're using?
I solved a mysterious problem that took me months to isolate this way. It seemed clean, new machines with the same specs and prerequired drivers would work just fine - only some older machines with the same specs were failing consistently. I ended up uninstalling everything (IIS, ASP.NET, .NET, database and client) and starting from scratch. The end cause when I isolated it was that the db client driver was corrupt on the older machines (and all the older machines were clones of each other, so I assume they were cloned after the corruption occured), and it seemed to be messing with the .NET memory space even when I wasn't calling it directly. I have yet to even reply to my "help me debug this monster" post with this answer because I doubted it would ever help anyone.
We started receiving this error after installing windows updates on a Windows Server 2008R2 machine. Windows Process Activation Service (WAS) installs some additional site bindings that caused issues for our setup.
We removed net.tcp, net.pipe, net.msmq, and msmq.formatname bindings from our website and no longer got the faulting application exception.
This is probably an edge case, but just in case someone is coming here and they are using MVCMailer , I was getting this same error due to the .SendAsync() method on the mailers.
I switched them all to .Send() and the crashing stopped.
See this SO answer for ways to use the mailer async and avoid the crash (allegedly, I did not personally implement it)

Web Server slows down (ASP.NET)

We have a really strange problem. One of the servers in the server farm becomes really slow. We see a number of timeouts in the logs and overall response time is not where it should be (and is on other servers in the farm).
What is also strange is that it is not just the web app - Just logging into the server takes up to 1.5 min to show you the desktop. Once you are in, the system is as responsive as ever - unless you try to launch something, i.e. notepad - it takes another minute to launch and after launch it works fine.
I checked a number of things - memory utilization is reasonable, CPU is below 15%, windows handles, event logs do not show anything.
Recycling the aps.net process does not fix it - it still takes over a minute to log in. Rebooting the server helped, but now it started to slow down again.
After a closer look we found out that Windows Temp directory is full of temp files - over 65k files. This is certainly something to take care of. But my question is could it be the root cause of the sluggishness, or there is still something else lurking in the shadows?
Edit
After more digging I am zeroing in on the issue related to the size of temp directories. This article: describes something very similar. I am still not too sure because the fact that the server is slow to open even Notepad remains unexplained.
Is it possible that under such conditions creating a new temp file takes over a minute?
You might want to check how many threads your using in the ASP.NET thread pool when the timeouts occur. Another idea might be to look at the GC information in perfmon and see if the GC is running a gen2 collection?
Ok, It is official, all of this was grief caused by this issue. When one of our servers was again behaving badly we cleaned the temp directory and it fixed the problem, including the slow login.
This last part still baffles me - I do not understand how excessive number of files in a temp directory can cause login to take over 1 min, leave alone launching a program, but whatever it is clearing the directory fixed it and I can live with it.
Did you check virtual memory as well ? paging ? does you app logs a lot of data in different files ? also - check - maybe the utilization happens in kernel mode and not user mode.

Resources