When I save something in an administration page in Drupal, for example when I save on
http://drupal62/admin/build/modules
it takes a very long time. It says,
Executed 2980 queries in 51606.38 milliseconds. Queries taking longer than 5 ms and queries executed more than once, are highlighted. Page execution time was 52547.06 ms.
I know that this question is vague. I don't think is is a MySQL problem. Maybe you have seen it before.
I've had exactly the same problem as you and although it's not exactly a MySQL problem, it's related to Drupal 6 database optimisation. Are you running on localhost? If so, then could you give some more information as to OS, and if you're using XAMPP, WAMP, etc.
I am currently running Drupal 6 on Windows 7 and WAMP Server with none of the lag you're experiencing. If it's the same issue, I've got it documented so will let you know the config changes.
You're right it's too vague to be fully answered here without more information.
It will depend of course of your specific configuration, of the modules you use, how many of them, your PHP memory limit, possible errors in your database.
A common debugging method is simply to disable most, if not all the modules and re-enable them one by one, until you can single out what goes wrong. You can also start by clearing all caches , then again depending on the modules you're using.
Related
I have a problem with a bunch (around 50) of classic ASP-Sites on Win2012R2 with Access-Databases, which drives us crazy.
All asp-pages of all sites on this server run smoothly for around 45 seconds, after that period they (all) completely stop responding to any click for 15 to 20 seconds, then this delay disappears again for the next 45 seconds like it never existed before, it re-appears again - and so on. This effect started out of nothing a few weeks ago, after several months without any problems.
Static HTML-pages are not affected, and it seems, even asp-pages without connecting to their database run fine. We, therefore, tried testing to convert from Access to SQLExpress, but this didn't change anything - even the converted site was affected in the same way (so it seems not to be Access).
We then tried to stop all sites in IIS and re-activating just one single site with very few visitors to see if it only appears, when many requests are sent to the server. But the effect still showed up, even after Restarting IIS and even after restarting the whole machine with just one website activated in IIS. It seems to be completely independent from the number of effects, just like the server (rather: the asp-engine of IIS) being busy with itself in a periodical pattern.
What we can see in performace monitor (see screenshot): while requests/sec goes down to 0 at some moment, when the effect starts, the number of requests executed continuosly accumultes from a normal level (which looks "logical" to me, but only describes the effect, not justifies, where it comes from). A few seconds before the effect vanishes, request/sec again grow and these counters revert to normal values.
We had a similar problem a year ago on a Windows 2008-Server, where the sites ran without any problems for several years and then it started out of nothing. After testing some of the sites on a server of another hoster, we found out, the problems didn't appear on his server with Windows 2012 R2 (and still don't for a full year, while hosting 3 of our sites there). At another hosters virtual Windows 2012R2-Server we have another single site hosted with more traffic than most of our others and even there the problem didn't appear since a full year now. So we our hoster switched over to WinServer2012R2 and - bingo - all the problems were gone. All sites performed like a charm again from that moment on without changing anything but the OS.
We then stopped investigating the issue, thinking the problem relates to the OS. But around 9 months later, it re-appeared and after hours and hours of investigating we have no idea, what to search for and what to do (beside of moving all our sites to the other hosters server, which isn't a real solution to the problem and we cannot guarantee, the effect will not re-appear on this machine sometimes in the future).
I definitively found a solution by myself, but in a totally random way. After weeks of searching for a solution to the problem, I worked on cleaning up the server's hard disk and deleted all files in Windows/temp-folder (> 18.000 files!). And since this moment (4 days ago), the described response lag never showed up any more! But a small bunch of new .tmp-files were created in the folder.
My theory is: maybe every time a user visits one of the websites (which opens connections to its Access database, causing a .ldb-file in the database folder), a randomly(?) named .tmp-file (like: jet12f0.tmp) is created in Windows/temp folder in parallel. These files are "normally" deleted again, as database connection closes and the .ldb disappears. Maybe some of the connections are not closed correctly, therefore the corresponding .tmp-file in the Windows/temp folder resides there as an "orphan", literally forever. As time goes by, the folder fills up with these orphaned files. And then it comes, that a new .tmp-file should be generated, but with a name of a still existing "orphaned" .tmp-file. This now causes the server to stop all actions, because it is not possible to establish the new file, named like an existing. After 15 to 20 seconds the conflict is solved by some mechanism (unknown to me) and all runs perfect again, until the next conflict arises around 45 seconds later. And so on...
I must assume: this is only a "amateur" theory, I'm not a server "Guru".
Cleaning up this temp-folder from time to time seems to prevent the server from getting into this situation, because there are no file/naming conflicts.
I agree: The real solution would be finding the problem in the code (if there is one), but we can live with that situation, comparing the effort to find the problem with just cleaning up the temp-folder once in a month or so ;-)
I have an xpages 9.0.1 server running multiple online form type apps. The server runs fine and performance response is quite good. Pages load fast, users are happy.
Over time (yet to determine how long), the server performance degrades and ultimately grinds to an almost stop.
Each night I am scheduling -c "tel http restart" and it is getting me out of trouble.
I am not sure what page is causing the problem as the degrading happens over a couple of days.
Most of our xpages are SSJS, all of our java (of which there is not much) is appropriately recycled.
It does not seem to be effecting RAM memory - it bounces up a bit and down a bit but well with limits. There is no correlation with the increased response times to more memory used.
So where do I look and what tools can I use to isolate the problem. We are more Dev than Admin.
Cheers
Damien
There are profiling tools available that may help pinpoint which application is causing problems. From OpenNTF, XPages Toolbox is specifically for XPages and was contributed by Philippe Riand, who at the time was Chief Architect for XPages http://www.openntf.org/main.nsf/project.xsp?r=project/XPages%20Toolbox.
There are more heavy-duty, Java-specific tools like YourKit available.
Chapter 20 of Mastering XPages second edition specifically covers performance and there is also a lot of information in XPages Portable Command Guide about performance tuning.
If performance is degrading over time, it could be session timeout. By default, that's 30 minutes. You can extend it, but the danger then is that a browser cannot tell the server it's closing the session when the user closes the browser. So those sessions hang around. Equally if there are very long-running tasks, they would hang around until they complete and the session would then still be active until the timeout.
Are you recycling your SSJS?
If you go into the server tasks of Domino Admin what do you see the CPU usage of the HTTP task doing. Also what is the memory usage of your nHTTP task? You may want to watch that.
Have you gone into the console to see if you can see if there us anything that looks bad?
If you can't pinpoint a problem you may want to think of putting some of your pages on a different server to determine if which app if not all is causing this.
Are you using scoped variables that are session or application scope? Application scope variables stay alive so if you are creating those and have some sort of issue where you end up creating a bunch that can affect memory.
Also there is a server and application setting for when the XPages stay in memory. The suggested setting to Keep only the current page in memory and save the others to disk. This is in the XSP properties.
I'm having trouble with my Meteor app when it gets to its peak amount of traffic (peak for this is nothing, 1k visits, maybe 2,500 pageviews in a day). CPU usage spikes and never recovers, so I've taken to using Nodetime to monitor usage and I've been reloading the process (forever restart) to get things back to normal.
I'm fairly new to profiling, so finding the underlying cause has me at a loss for where to start. I'm fairly certain it has to do with my app's server code, but the profiling seems to point to the Fibers module as a "hotspot" which I understand aids in making my server code synchronous.
Below is a snippet from the profiling results. I hope someone can guide me in the right direction in troubleshooting this!
While I don't have a specific answer to your question, I have experience dealing with CPU issues for our production meteor app for so I can give you a list of things to investigate.
Upgrade to the latest version of meteor and the appropriate node version (see the changelog). As of this writing that's meteor 0.8.2 and node 0.10.28.
Read this and this article. The latter makes a great point that you really should always try to delay activation of subscriptions until you need them. In particular you may not need to publish anything for users who are not logged in. In my experience, meteor CPU problems have everything to do with subscriptions.
Be careful with observe and observeChanges. These are expensive and are easy to abuse. In particular:
Make sure you are calling stop() on your handles when they are no longer needed (consider using a package like publish-with-relations so this is done for you).
Fetch only the collections and fields that you absolutely need. Observe works by continually diffing objects (requires lots of CPU). The fewer and smaller objects you have, the less there is to compute.
Consider using smart-collections before it is retired. Use oplog tailing - this can make for a night and day difference in performance and CPU usage in your app.
Consider making some things not reactive (also mentioned in the articles above). For us that was a big win. We had one extremely expensive join that was used on two frequently accessed pages on the site. When it got to the point where the CPU was pegged at 100% about every 30 minutes I gave up on reactivity for that element and just did the join on the server and shipped the data to the client via a method call. I also created a server-side expiring cache for these results and stored them by user (special thanks to Matt DeBergalis for this suggestion).
Do a preventative nightly restart. I have a cron job that tells forever to restart our app once a day in the middle of the night. That brings the CPU down from ~10% to 1%. This seems like black magic, but the fact that the CPU usage changes after a reset leads me to believe this is a good idea.
Updated thoughts (1/13/14)
We migrated to oplog tailing as soon as it was available (meteor 0.7) and that made a big difference. Note that in order to get access to the oplog, you'll probably need to either host your own db or run a dedicated instance on the hosting provider of your choice. I'd also recommend adding the facts package to actually tell if its working.
There was a memory leak discovered in publish-with-relations, and as of this writing the atmosphere version (v0.1.5) hasn't been bumped to reflect these changes. If you are using it in production, I strongly recommend checking out the HEAD version and running it locally.
We stopped doing nightly restarts a couple of weeks ago. So far everything has been fine (fingers crossed).
Updated thoughts (7/2/14)
A few months ago we switched over to using an Elastic Deployment on mongohq. It's affordable, the performance has been great, and they even have a blog post which tells you how to enable oplog tailing.
I'd strongly recommend checking out kadira to help diagnose performance issues in your app. Also check out the academy articles which have a number of good tips in them.
I'm also having this problem. Actually there is an issue with 0.6.6.1, I run meteor --release 0.6.6 and the cpu is back to normal now.
This problem is killing the stability of my production servers.
To recap, the basic idea is that my node server(s) sometimes intermittently slow down, sometimes resulting in Gateway Timeouts. As best as I can tell from my logs, something is blocking the node thread (meaning that the incoming request is not accepted), but I cannot for the life of me figure out what.
The problem ranges in severity. Sometimes what should be <100ms requests take ~10 seconds to complete; sometimes they never even get accepted by the node server at all. In short, it is as though some random task is working and blocking the node thread for a period of time, thus slowing down (or even blocking) incoming requests; the one thing I can say for sure is that the need-to-fix-symptom is a "Gateway Timeout".
The issue comes and goes without warning. I have not been able to correlate it against CPU usage, RAM usage, uptime, or any other relevant statistic. I've seen the servers handle a large load fine, and then have this error with a small load, so it does not even appear to be load-related. It is not unusual to see the error around 1am PST, which is the smallest load time of the day! Restarting the node app does seem to maybe make the problem go away for a while, but that really doesn't tell me much. I do wonder if it might be a bug in node.js... not very comforting, considering it is killing my production servers.
The first thing I did was to make sure I had upgraded node.js to the latest (0.8.12), as well as all my modules (here they are). Of course, I also have plenty of error catchers in place. I'm not doing anything funky like printing out lots to the console or writing to lots of files.
At first, I thought it was outbound HTTP requests blocking the incoming socket, because the express middleware was not even picking up the inbound request, but I gave up the theory because it looks like the node thread itself became busy.
Next, I went through all my code with JSHint and fixed literally every single warning, including a few accidental globals (forgetting to write "var") but this didn't help
After that, I assumed that perhaps I was running out of memory. But, my heap snapshots via nodetime are looking pretty good now (described below).
Still thinking that memory might be an issue, I took a look at garbage collection. I enabled the --nouse-idle-notification flag and did some more code optimization to NULL objects when they were not needed.
Still convinced that memory was the issue, I added the --expose-gc flag and executed the gc(); command every minute. This did not change anything, except to occasionally make requests a bit slower perhaps.
In a desperate attempt, I setup the "cluster" module to use 2 workers and automatically restart them every 30 min. Still, no luck.
I increased the ulimit to over 10,000 and kept an eye on the open files. There seem to be < 300 open files (or sockets) per node.js app, and increasing the ulimit thus had no impact.
I've been logging my server with nodetime and here's the jist of it:
CentOS 5.2 running on the Amazon Cloud (m1.large instance)
Greater than 5000 MB free memory at all times
Less than 150 MB heap size at all times
CPU usage is less than 60% at all times
I've also checked my MongoDB servers, which have <5% CPU usage and no requests are taking > 100ms to complete, so I highly doubt there's a bottleneck.
I've wrapped (almost) all my code using Q-promises (see code sample), and of course have avoided Sync() calls like the plague. I've tried to replicate the issue on my testing server (OSX), but have had little luck. Of course, this may be just because the production servers are being used by so many people in so many unpredictable ways that I simply cannot replicate via stress tests...
Many months after I first asked this question, I found the answer.
In a nutshell, the problem was that I was not piping a big asset when transferring it from one server to another. In other words, I was downloading an image from one server, before uploading it to a S3 bucket. Instead of streaming the download into the upload, I downloaded the file into memory, and then uploaded it.
I'm not sure why this did not show up as a memory spike, or elsewhere in my statistics.
My guess is Mongoose. If you are storing large payloads in Mongo, Mongoose can be pretty slow due to how it builds the Mongoose objects. See https://github.com/LearnBoost/mongoose/issues/950 for more details on the problem. If this is the problem you wouldn't see it in Mongo itself since the query returns quickly, but object instantiation could take 75x the query time.
Try setting up timers around (process.hrtime()) before and after you the Mongoose objects are being created to see if that might be the problem. If this is the problem, I would switch to using the node Mongo driver directly instead of going through Mongoose.
You are heavily leaking memory, try setting every object to null as soon as you don't need it anymore! Read this.
More information about hunting down memory leaks can be found here.
Give special attention to having multiple references to the same object and check if you have circular references, those are a pain to debug but will help you very much.
Try invoking the garbage collector manually every minute or so (I don't know if you can do this in node.js cause I'm more of a c++ and php coder). From my years of experience working with c++ I can tell you the most likely cause of your application slowing down over time is memory leaks, find them and plug them, you'll be ok!
Also assuming you're not caching and/or processing images, audio or video in memory or anything like that 150M heap is a lot! Those could be hundreds of thousands or even millions of small objects.
You don't have to be running out of memory for your application to slow down... just searching for free memory with that many objects already allocated is a huge job for the memory allocator, it takes a lot of time to allocate each new object and as you leak more and more memory that time only increases.
Is "--nouse-idle-connection" a mistake? do you really mean "--nouse_idle_notification".
I think it's maybe some issues about gc with too many tiny objects.
node is single process, so watch the most busy cpu core is much important than the load.
when your program is slow, you can execute "gdb node pid" and "bt" to see what node is busy doing.
What I'd do is set up a parallel node instance on the same server with some kind of echo service and test that one. If it runs fine, you narrow down your problem to your program code (and not a scheduler/OS-level problem). Then, step by step, include the modules and test again. Certainly this is a lot of work, takes long and I dont know if it is doable on your system.
If you need to get this working now, you can go the NASA redundancy route:
Bring up a second copy of your production servers, and put a proxy in front of them which routes each request to both stacks and returns the first response. I don't recommend this as a perfect long-term solution but it should help significantly reduce issues in production now, and help you gather log data that you could replay to recreate the issues on non-production servers.
Obviously, this is straight-forward for read requests, but more complex for commands which write to the db.
We have a similar problem with our Node.js server. It didn't scale well for weeks and we had tried almost everything as you had. Our problem was in the implicit backlog value which is set very low for high-concurrent environments.
http://nodejs.org/api/http.html#http_server_listen_port_hostname_backlog_callback
Setting the backlog to a significantly higher value (e.g. 10000) as well as tune networking in our kernel (/etc/sysctl.conf on Linux) as described in manual section helped a lot. From this time forward we don't have any timeouts in our Node.js server.
We have a really strange problem. One of the servers in the server farm becomes really slow. We see a number of timeouts in the logs and overall response time is not where it should be (and is on other servers in the farm).
What is also strange is that it is not just the web app - Just logging into the server takes up to 1.5 min to show you the desktop. Once you are in, the system is as responsive as ever - unless you try to launch something, i.e. notepad - it takes another minute to launch and after launch it works fine.
I checked a number of things - memory utilization is reasonable, CPU is below 15%, windows handles, event logs do not show anything.
Recycling the aps.net process does not fix it - it still takes over a minute to log in. Rebooting the server helped, but now it started to slow down again.
After a closer look we found out that Windows Temp directory is full of temp files - over 65k files. This is certainly something to take care of. But my question is could it be the root cause of the sluggishness, or there is still something else lurking in the shadows?
Edit
After more digging I am zeroing in on the issue related to the size of temp directories. This article: describes something very similar. I am still not too sure because the fact that the server is slow to open even Notepad remains unexplained.
Is it possible that under such conditions creating a new temp file takes over a minute?
You might want to check how many threads your using in the ASP.NET thread pool when the timeouts occur. Another idea might be to look at the GC information in perfmon and see if the GC is running a gen2 collection?
Ok, It is official, all of this was grief caused by this issue. When one of our servers was again behaving badly we cleaned the temp directory and it fixed the problem, including the slow login.
This last part still baffles me - I do not understand how excessive number of files in a temp directory can cause login to take over 1 min, leave alone launching a program, but whatever it is clearing the directory fixed it and I can live with it.
Did you check virtual memory as well ? paging ? does you app logs a lot of data in different files ? also - check - maybe the utilization happens in kernel mode and not user mode.