Troubleshooting a Hanging Java Web App

Troubleshooting a Hanging Java Web App - multithreading

I have a web application that hangs under high loads. I'm not going to go into the specifics of the code because I really just want some troubleshooting advice and tooling recommendations.
It's a web app, so each request get's a thread. Under a high load test, the app begins to consume all of the cpu, while becoming unresponsive. I suspect that the request threads are hanging in the new code that we are testing. Due to the fact of the cpu consumption, I'm assuming this must be on my app side. My understanding, which could be wrong, is that total cpu consumption indicated my first troubleshooting efforts should be in looking at the code that's consuming those cycles.
What are some tools and/or methods for inspecting which threads are hanging and on what lines of code? Again, I can easily force the app into the problematic behavior.
I've found and been trying out visualvm. Seems like the perfect tool. Still open for suggestions though. I looked at eclipse TPTP and it seems to be end-of-life-ing as well as requiring a more heavy weight deployment.

You can insert logging messages at starting a thread and closing a thread. Then you start the application and inspect the output while penetrating the code.
Another way is to look for memory leaks. If you are sure you haven't one, you can extend the virtual memory of your JVM.

#chad: do you have Database in whole picture...you may want to start by looking what is happening at DB side...you can very well look into DB locks, current sessions etc.

Related

Server performance degrading over time (a short time) - Xpages

I have an xpages 9.0.1 server running multiple online form type apps. The server runs fine and performance response is quite good. Pages load fast, users are happy.
Over time (yet to determine how long), the server performance degrades and ultimately grinds to an almost stop.
Each night I am scheduling -c "tel http restart" and it is getting me out of trouble.
I am not sure what page is causing the problem as the degrading happens over a couple of days.
Most of our xpages are SSJS, all of our java (of which there is not much) is appropriately recycled.
It does not seem to be effecting RAM memory - it bounces up a bit and down a bit but well with limits. There is no correlation with the increased response times to more memory used.
So where do I look and what tools can I use to isolate the problem. We are more Dev than Admin.
Cheers
Damien

There are profiling tools available that may help pinpoint which application is causing problems. From OpenNTF, XPages Toolbox is specifically for XPages and was contributed by Philippe Riand, who at the time was Chief Architect for XPages http://www.openntf.org/main.nsf/project.xsp?r=project/XPages%20Toolbox.
There are more heavy-duty, Java-specific tools like YourKit available.
Chapter 20 of Mastering XPages second edition specifically covers performance and there is also a lot of information in XPages Portable Command Guide about performance tuning.
If performance is degrading over time, it could be session timeout. By default, that's 30 minutes. You can extend it, but the danger then is that a browser cannot tell the server it's closing the session when the user closes the browser. So those sessions hang around. Equally if there are very long-running tasks, they would hang around until they complete and the session would then still be active until the timeout.

Are you recycling your SSJS?
If you go into the server tasks of Domino Admin what do you see the CPU usage of the HTTP task doing. Also what is the memory usage of your nHTTP task? You may want to watch that.
Have you gone into the console to see if you can see if there us anything that looks bad?
If you can't pinpoint a problem you may want to think of putting some of your pages on a different server to determine if which app if not all is causing this.

Are you using scoped variables that are session or application scope? Application scope variables stay alive so if you are creating those and have some sort of issue where you end up creating a bunch that can affect memory.
Also there is a server and application setting for when the XPages stay in memory. The suggested setting to Keep only the current page in memory and save the others to disk. This is in the XSP properties.

Node.js app has periodic slowness and/or timeouts (does not accept incoming requests)

This problem is killing the stability of my production servers.
To recap, the basic idea is that my node server(s) sometimes intermittently slow down, sometimes resulting in Gateway Timeouts. As best as I can tell from my logs, something is blocking the node thread (meaning that the incoming request is not accepted), but I cannot for the life of me figure out what.
The problem ranges in severity. Sometimes what should be <100ms requests take ~10 seconds to complete; sometimes they never even get accepted by the node server at all. In short, it is as though some random task is working and blocking the node thread for a period of time, thus slowing down (or even blocking) incoming requests; the one thing I can say for sure is that the need-to-fix-symptom is a "Gateway Timeout".
The issue comes and goes without warning. I have not been able to correlate it against CPU usage, RAM usage, uptime, or any other relevant statistic. I've seen the servers handle a large load fine, and then have this error with a small load, so it does not even appear to be load-related. It is not unusual to see the error around 1am PST, which is the smallest load time of the day! Restarting the node app does seem to maybe make the problem go away for a while, but that really doesn't tell me much. I do wonder if it might be a bug in node.js... not very comforting, considering it is killing my production servers.
The first thing I did was to make sure I had upgraded node.js to the latest (0.8.12), as well as all my modules (here they are). Of course, I also have plenty of error catchers in place. I'm not doing anything funky like printing out lots to the console or writing to lots of files.
At first, I thought it was outbound HTTP requests blocking the incoming socket, because the express middleware was not even picking up the inbound request, but I gave up the theory because it looks like the node thread itself became busy.
Next, I went through all my code with JSHint and fixed literally every single warning, including a few accidental globals (forgetting to write "var") but this didn't help
After that, I assumed that perhaps I was running out of memory. But, my heap snapshots via nodetime are looking pretty good now (described below).
Still thinking that memory might be an issue, I took a look at garbage collection. I enabled the --nouse-idle-notification flag and did some more code optimization to NULL objects when they were not needed.
Still convinced that memory was the issue, I added the --expose-gc flag and executed the gc(); command every minute. This did not change anything, except to occasionally make requests a bit slower perhaps.
In a desperate attempt, I setup the "cluster" module to use 2 workers and automatically restart them every 30 min. Still, no luck.
I increased the ulimit to over 10,000 and kept an eye on the open files. There seem to be < 300 open files (or sockets) per node.js app, and increasing the ulimit thus had no impact.
I've been logging my server with nodetime and here's the jist of it:
CentOS 5.2 running on the Amazon Cloud (m1.large instance)
Greater than 5000 MB free memory at all times
Less than 150 MB heap size at all times
CPU usage is less than 60% at all times
I've also checked my MongoDB servers, which have <5% CPU usage and no requests are taking > 100ms to complete, so I highly doubt there's a bottleneck.
I've wrapped (almost) all my code using Q-promises (see code sample), and of course have avoided Sync() calls like the plague. I've tried to replicate the issue on my testing server (OSX), but have had little luck. Of course, this may be just because the production servers are being used by so many people in so many unpredictable ways that I simply cannot replicate via stress tests...

Many months after I first asked this question, I found the answer.
In a nutshell, the problem was that I was not piping a big asset when transferring it from one server to another. In other words, I was downloading an image from one server, before uploading it to a S3 bucket. Instead of streaming the download into the upload, I downloaded the file into memory, and then uploaded it.
I'm not sure why this did not show up as a memory spike, or elsewhere in my statistics.

My guess is Mongoose. If you are storing large payloads in Mongo, Mongoose can be pretty slow due to how it builds the Mongoose objects. See https://github.com/LearnBoost/mongoose/issues/950 for more details on the problem. If this is the problem you wouldn't see it in Mongo itself since the query returns quickly, but object instantiation could take 75x the query time.
Try setting up timers around (process.hrtime()) before and after you the Mongoose objects are being created to see if that might be the problem. If this is the problem, I would switch to using the node Mongo driver directly instead of going through Mongoose.

You are heavily leaking memory, try setting every object to null as soon as you don't need it anymore! Read this.
More information about hunting down memory leaks can be found here.
Give special attention to having multiple references to the same object and check if you have circular references, those are a pain to debug but will help you very much.
Try invoking the garbage collector manually every minute or so (I don't know if you can do this in node.js cause I'm more of a c++ and php coder). From my years of experience working with c++ I can tell you the most likely cause of your application slowing down over time is memory leaks, find them and plug them, you'll be ok!
Also assuming you're not caching and/or processing images, audio or video in memory or anything like that 150M heap is a lot! Those could be hundreds of thousands or even millions of small objects.
You don't have to be running out of memory for your application to slow down... just searching for free memory with that many objects already allocated is a huge job for the memory allocator, it takes a lot of time to allocate each new object and as you leak more and more memory that time only increases.

Is "--nouse-idle-connection" a mistake? do you really mean "--nouse_idle_notification".
I think it's maybe some issues about gc with too many tiny objects.
node is single process, so watch the most busy cpu core is much important than the load.
when your program is slow, you can execute "gdb node pid" and "bt" to see what node is busy doing.

What I'd do is set up a parallel node instance on the same server with some kind of echo service and test that one. If it runs fine, you narrow down your problem to your program code (and not a scheduler/OS-level problem). Then, step by step, include the modules and test again. Certainly this is a lot of work, takes long and I dont know if it is doable on your system.

If you need to get this working now, you can go the NASA redundancy route:
Bring up a second copy of your production servers, and put a proxy in front of them which routes each request to both stacks and returns the first response. I don't recommend this as a perfect long-term solution but it should help significantly reduce issues in production now, and help you gather log data that you could replay to recreate the issues on non-production servers.
Obviously, this is straight-forward for read requests, but more complex for commands which write to the db.

We have a similar problem with our Node.js server. It didn't scale well for weeks and we had tried almost everything as you had. Our problem was in the implicit backlog value which is set very low for high-concurrent environments.
http://nodejs.org/api/http.html#http_server_listen_port_hostname_backlog_callback
Setting the backlog to a significantly higher value (e.g. 10000) as well as tune networking in our kernel (/etc/sysctl.conf on Linux) as described in manual section helped a lot. From this time forward we don't have any timeouts in our Node.js server.

Is it possible to force termination of backgrounding apps on iOS?

I've written an app which is handling videos. As we know, video processing takes a huge amount of memory while dealing with HD resolution. My App always seemed to crash. But actually I am 100% sure, that there is no memory leak in my code. Instruments is showing no leak.
At the beginning I am startin up one OpenGLES view and the video engine. For a very short time the memory consumption is high, but falling down to normal level after the initializations are done. I am always getting memory warnings during this period. Normally this is no problem. But if I have a lot of apps in suspended mode running, the App seems to be crashing. Watching into the crash log and using the debugger shows up, that I am only running out of memory.
My customers are flooding my support mail with "app is crashing" mails. But I do know, that they have too much Apps running in the background, so there is no memory left to go. I think it's bad style programing saying the customer that he has to close Background tasks before running the app.
According to this post this is a common problem.
My question is: Is it possible to tell the OS that one needs a lot of memory so the OS should terminate some suspended Apps? This memory stuff makes me crazy, because it's no bug I could fix.

No. It is not possible to affect anything outside of your sandbox without API calls. None exist for affecting other processes in the public API.
Have you tried to minimize your memory usage? In my experience once a memory warning it thrown apps can be more likely to have problems once they are in the background, even when memory usages drops.
If you are using OpenGLES and textures, if you haven't already compress your textures. What is the specific cause of your memory allocation spike?

IIS Application Pool Optimisation

We are experiencing some performance issues with one of our websites and one of the things I am looking at is optimising the application pool. Are there any recommended methods of calculating maximum virtual/used memory to allow the pool?

The default settings of IIS are pretty well thought out. I'd first start by placing the application in its own application pool to make sure there aren't any other websites that might be causing the problem.
From there, I'd work at analyzing the code to make sure it is the resources the application has access to that are limiting the website. If the resources are what is limiting, it should be easy to figure out what needs priority in the application pool.
If your application is running massive memory software, it would be apparent by the spike and eventual ceiling hit of no more memory available, thus, you'd up your memory for that pool. Likewise, if the site just bogs down over time, you might check how long your sessions are staying alive and possibly shorten them. Just some things to start out with.
As far as hard and fast rules, I haven't come across any and usually it is so 'per-application' specific, that it is hard to define. I would just pay attention to performance counters and then turn one thing on, check them again, and then try another thing, making sure to keep all tests agnostic so you know what is working and what is not.

Isolating a rampant process in IIS

I have a webserver that is pegged and I've been able to isolate it to a particular website instance. I'd like to dig deeper and isolate the particular page/process that is causing the issue.. Any tips?

You can take a memory dump of the process and poke around with windbg.
There are posts on this issue from Tess Ferrandez blog. Just do as she say.

Which version of IIS are you using? Some of the higher ones allow for a separation of which process gets used to handle requests such as a worker process that you could isolate a bit more that way. I'd also suggest reading through the IIS logs to see what requests were being handled, how long they took, etc.
There are many different quirks to each IIS version. The really low ones just had a start/stop functionality, but the newer ones have really given administrators much more control and power, IMO.

You should try using a profiler to identify what is using up the most resources. I've used dotTrace Profiler, although that can be expensive if you're on a tight budget.
It allows you to see exactly what processes and method calls use of the most processing time of a request really well so you can isolate the most resource intensive operations.
You should really be able to use any profiler to do this, not just dotTrace. I just happen to only have experience with this one in particular.

Change your web garden setting to 10 or greater. Then watch your CPU and memory utilization on the web server.
Continue to increase the web garden setting until either the app is completely responsive with less than 5% average utilization OR you have actually maxed your web server's memory.
UPDATE
It's not about diagnosing, it's about properly configuring the IIS server. Web Gardens are one of the top misunderstood features of IIS. By increasing the available threads to handle new requests you remove the appearance of contention at the web server level and place it squarely where it belongs. In this case at your database. Instead of masking a problem it actually highlights exactly where the problem is.

This turned out to be a SQL problem (sql 2005). The solution was found by using SQL activity monitor to identify a suspended process with a Async_network_io wait type. We then ran SQL profiler to narrow it down to two massive queries which were returning an over abundance of results.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string