How to investigate cherrypy crashing? - cherrypy

We have a cherrypy service that integrates with several backend web services. During load testing cherrypy process is regularly crashed after a while (45 minutes). We know the bottleneck is the backend web services we are using. Before crashing we see 500 and 503 errors when accessing the backend services, but I can't figure why cherrypy itself will crash (the whole process was killed). Can you give me ideas how to investigate where the problem is? Is it possible that the thread_poll (50) is queueing up too many requests?

In my early CherryPy days I had it crashing once. I mean a Python process crash caused by a segfault. When I investigated it I found that I messed with MySQLdb connections, caching them in objects which were accessed by CherryPy threads interchangeably. Because a MySQLdb connection is not thread-safe it should be accessed only from the thread in was created in. Also because of concurrency involved the crashes seemed nondeterministic, and only appeared in load-testing. So load-testing can work as a debugging tool here -- try Apache JMeter or Locust (Pythonic).
When a process crashes you can instruct Linux to write a core dump which will have a stack trace (e.g. on MySQLdb C-code side in my example). However alien low-level C environment is to you (it is to me), the stack trace can help find what library is causing the crash or at least narrow a circle of suspects. Here is an article about it.
Also I want to note that unlikely problem is in CherryPy. It is actually very stable.

Related

NodeJS Performance Issue

I'm running an API server using NodeJS 6.10.3 LTS on Ubuntu 14.04 (trusty). I've noticed that my API server tops out at ~600 reqs/min running on a c4.large EC2 instance. By tops out I mean, I see the CPU go uptil 100% Note, I know that I'm not fully utilizing the instance by using the cluster module, but that's ok for now.
I took a .cpuprofile dump of my API server for 10 seconds, and noticed that every second, for ~300ms, the profiler shows my NodeJS code is sitting (idle).
Does anyone know what that (idle) implies? Is it a GC issue? Or is it a internal (to V8) lock that I'm triggering? Any help or pointers to tools to help debug this would be nice. I'm working on anonymizing some of stack traces in the cpuprofile so I can share.
The packages I'm using are ExpressJS 4, Couchbase NodeJS SDK, Socket.IO mainly. The codepaths are mainly reading requests, and pushing to Couchbase. And finally querying couchbase via Views API, and pushing some aggregated data on a Socket.IO channel. So all pretty I/O async friendly stuff. I've made sure that I'm not calling any synchronous functions. There are no patterns of function calls before the (idle) in the cpu profile.
It could also just be I/O wait, meaning none of the sockets have data ready to read yet and so the time is spent idle. If you are using a load testing library you should check that the requests are evenly distributed within a second.
Take a look at https://www.npmjs.com/package/gc-stats to check GC data. There are flags to increase heap space, and to change when GC runs, if the problem turns out to be GC related.

Node.js Clusters with Additional Processes

We use clustering with our express apps on multi cpu boxes. Works well, we get the maximum use out of AWS linux servers.
We inherited an app we are fixing up. It's unusual in that it has two processes. It has an Express API portion, to take incoming requests. But the process that acts on those requests can run for several minutes, so it was build as a seperate background process, node calling python and maya.
Originally the two were tightly coupled, with the python script called by the request to upload the data. But this of course was suboptimal, as it would leave the client waiting for a response for the time it took to run, so it was rewritten as a background process that runs in a loop, checking for new uploads, and processing them sequentially.
So my question is this: if we have this separate node process running in the background, and we run clusters which starts up a process for each CPU, how is that going to work? Are we not going to get two node processes competing for the same CPU. We were getting a bit of weird behaviour and crashing yesterday, without a lot of error messages, (god I love node), so it's bit concerning. I'm assuming Linux will just swap the processes in and out as they are being used. But I wonder if it will be problematic, and I also wonder about someone getting their web session swapped out for several minutes while the longer running process runs.
The smart thing to do would be to rewrite this to run on two different servers, but the files that maya uses/creates are on the server's file system, and we were not given the budget to rebuild the way we should. So, we're stuck with this architecture for now.
Any thoughts now possible problems and how to avoid them would be appreciated.
From an overall architecture prospective, spawning 1 nodejs per core is a great way to go. You have a lot of interdependencies though, the nodejs processes are calling maya which may use mulitple threads (keep that in mind).
The part that is concerning to me is your random crashes and your "process that runs in a loop". If that process is just checking the file system you probably have a race condition where the nodejs processes are competing to work on the same input/output files.
In theory, 1 nodejs process per core will work great and should help to utilize all your CPU usage. Linux always swaps the processes in and out so that is not an issue. You could start multiple nodejs per core and still not have an issue.
One last note, be sure to keep an eye on your memory usage, several linux distributions on EC2 do not have a swap file enabled by default, running out of memory can be another silent app killer, best to add a swap file in case you run into memory issues.

Troubleshooting a Hanging Java Web App

I have a web application that hangs under high loads. I'm not going to go into the specifics of the code because I really just want some troubleshooting advice and tooling recommendations.
It's a web app, so each request get's a thread. Under a high load test, the app begins to consume all of the cpu, while becoming unresponsive. I suspect that the request threads are hanging in the new code that we are testing. Due to the fact of the cpu consumption, I'm assuming this must be on my app side. My understanding, which could be wrong, is that total cpu consumption indicated my first troubleshooting efforts should be in looking at the code that's consuming those cycles.
What are some tools and/or methods for inspecting which threads are hanging and on what lines of code? Again, I can easily force the app into the problematic behavior.
I've found and been trying out visualvm. Seems like the perfect tool. Still open for suggestions though. I looked at eclipse TPTP and it seems to be end-of-life-ing as well as requiring a more heavy weight deployment.
You can insert logging messages at starting a thread and closing a thread. Then you start the application and inspect the output while penetrating the code.
Another way is to look for memory leaks. If you are sure you haven't one, you can extend the virtual memory of your JVM.
#chad: do you have Database in whole picture...you may want to start by looking what is happening at DB side...you can very well look into DB locks, current sessions etc.

IIS7: Faulting application w3wp.exe, what is the root cause of these crashes?

Our Website is in .NET but with some old ASP and 32bits libraries too in it. It had been working fine for a while (2 years). But for the past month, we have seen the following error on our IIS7 server, which we have been unable to track down and fix:
"Faulting application w3wp.exe, version 7.0.6001.18000, time stamp 0x47919413, faulting module kernel32.dll, version 6.0.6001.18215, time stamp 0x4995344f, exception code 0xe053534f, fault offset 0x0002f328, process id 0x%9, application start time 0x%10."
We are able to reproduce the error:
One of our .ASPX pages starts loading, executing code and queries (we have response.flush() all over the page to track where the code breaks), then it suddenly stops and we get the above error in IIS.
The page stops loading and, without the response.flush(), it's not redirecting to our error.aspx page (as configured in web.config)
The error does NOT happen all the time. Sometimes, it happens 3 times in a row, then it's working fine for 15 minutes non-stop with a proper redirection to error.aspx.
The error we get then is a classic: "Either BOF or EOF is True, or the current record has been deleted."
When the error occurs, the page hangs and all other session on the same computer from any browsers have hanging web pages as well (BTW, we only allow 1 worker process while we are testing). From other computers, the site loads fine.
I can recycle the Application Pool, kill w3wp.exe, restart IIS. Nothing will do. The only way to successfully load the page again is to Restart MS SQL which handles our Session States. I don't know why this is, but we guessed that the Session Cookies on the users browsers points to a thread which was not terminated properly (due to the above crash) and IIS is waiting for it to terminate to process more code (?). If someone can explain this better, that would be really helpful. Is there a timeout which we can set to "terminate" threads? Is it a MS SQL related issue?
I have also looked at the Private and Virtual Memory usages, because I think our code is not the most effective and I am certain we have remaining memory leaks. However, I saw the page crash even though both Private and Virtual Memories were still quite low (under 100MB each).
I have used Debug Diag and WinDbg as indicated here: http://blogs.msdn.com/b/tess/archive/2009/03/20/debugging-a-net-crash-with-rules-in-debug-diag.aspx, but we are not able to make windbg work, this is what we are trying to do at the moment.
If someone could help us or point us toward the right direction that would be really great, thank you.
"Either BOF or EOF is True, or the current record has been deleted" means the table is empty and you are attempting to do a MoveNext. So check for eof before you do any moves.
IIS is notorious for throwing kernel errors in w3wp.exe like this one. All your errors in session state are just symptoms of the crashed process. Multiple APP pools won't help much - they just spread the error around.
I''d wager it is SQL deadlocks due to your user environment changing. This will cause a 10-second lag as SQL tries to determine which query to kill off. One wins, one loses. The loser gets back a pointer to an unexpectedly empty table and you try a move and subsequent crash. You maybe could point your DB to an ODBC connection and turn on tracing, or figure out a way to get SQL to log it.
I had all the same symptoms as above in Perl. I was able to make a wrapper fn() to do all SQL queries and log all sql, + params and any errors to disk to track down the problem. It was deadlocks, then we were able to code in auto-retry, and eventually we recoded the query order and scanned columns to eliminate the deadlocks.
It's entirely possible one of your referenced/linked assemblies somewhere has randomly gone corrupt (it can happen) on disk. Can you try a replicate the problem on a new, clean machine with the same stats, fresh installs of the latest xyz drivers you're using?
I solved a mysterious problem that took me months to isolate this way. It seemed clean, new machines with the same specs and prerequired drivers would work just fine - only some older machines with the same specs were failing consistently. I ended up uninstalling everything (IIS, ASP.NET, .NET, database and client) and starting from scratch. The end cause when I isolated it was that the db client driver was corrupt on the older machines (and all the older machines were clones of each other, so I assume they were cloned after the corruption occured), and it seemed to be messing with the .NET memory space even when I wasn't calling it directly. I have yet to even reply to my "help me debug this monster" post with this answer because I doubted it would ever help anyone.
We started receiving this error after installing windows updates on a Windows Server 2008R2 machine. Windows Process Activation Service (WAS) installs some additional site bindings that caused issues for our setup.
We removed net.tcp, net.pipe, net.msmq, and msmq.formatname bindings from our website and no longer got the faulting application exception.
This is probably an edge case, but just in case someone is coming here and they are using MVCMailer , I was getting this same error due to the .SendAsync() method on the mailers.
I switched them all to .Send() and the crashing stopped.
See this SO answer for ways to use the mailer async and avoid the crash (allegedly, I did not personally implement it)

Memory Leaks and Apache

My VPS account has been occasionally running out of memory. It's using Apache on Linux. Support says it's a slow memory leak and has enabled MaxRequestsPerChild to deal with it.
I have a few questions about this. When a child process dies, will it cause my scripts to lose session data? Does anyone have advice on how I can track down this memory leak?
Thanks
No, when a child process dies you will not lose any data unless it was in the middle of a request at the time (which should not happen if it exits due to MaxRequestsPerChild).
You should try to reproduce the memory leak using an identical software stack on your test system. You can use tools such as Valgrind to try to detect it.
You can also try a debug build of your web server and its modules, which will enable you to detect what's going on.
It's difficult to reproduce the behaviour of production systems in non-production ones. If you have auto-test coverage of your web application, you could try using your full auto-test suite, but in practice this is unlikely to cover every code path therefore may miss the leaky one.
When a child process dies, will it cause my scripts to lose session data?
Without knowing what scripting language and session handler you are using (and the actual code) it rather hard to say.
In most cases, using scripting languages in modules or via [fast] cgi, then its very unlikely that the session data would actually be lost - although if the process dies in the middle of processing a request it may not get the chance to write the updated session back to whatever is storing the session. And in the very unlikely event it dies during the writeback, it may corrupt the session data. These are quite exceptional circumstances.
OTOH if your application logic is implemented via a daemon (e.g. a Java container) then its quite probable that memory leaks could accumulate (although these would be reported against a different process).
Note that if the problem is alleviated by setting MaxRequestsPerChild then it implies that the problem is occurring in an Apache module.
The production releases of Apache itself, in my experience, is very stable without memory leaks. However I've not used all the modules. Not sure if ExtendedStatus gives a breakdwon of memory usage by module - might be worth checking.
I've previously seen problems with the memory management of modules loaded by the PHP module not respecting PHP's memory limits - these did clear down at the end of the request though.
C.

Resources