I have a very specific question regarding some interesting behaviour I observed some performing experiments with L1D cache miss rates for certain programs.
Basically, I tried to find out how high the L1D load miss rate of a Tomcat9 server is when it's running alone on an isolated core. Additionally, I wanted to compare this value to the miss rate when I invalidate the L1D cache using the IA32_FLUSH_CMD msr upon each context switch.
As you would expect, the miss rate is higher when I invalidate the cache.
Now here comes the interesing part: I tried the same thing running both my tomcat server and additionally an apache2 web server. The result was that the miss rate for the tomcat server is actually HIGHER than when I invalidate the cache on each context switch, which I don't really understand. I would expect it to be at most as high as the miss rate measure with invalidating the cache, because, while the web server may evict some (maybe even all) of the lines of the tomcat server, it should most likely keep some aswell. And even if it evicted all of the lines, I'd expect that to give me a similar miss rate as when I invalidate the cache.
Some important information about the system:
SMT is disabled, so no cache interference from other CPUs
For the tests I'm running benchmarks to send continuous requests to both servers.
i7-7700k (32kb l1d cache)
Ubuntu Server 20.04, Kernel v5.13
If you have any idea about why this happens, I'd really appreciate the input.
Thank you!
Related
Context; I'm writing a monitoring/management app for a VPS, which is running Linux
Reasons; I need to quickly be able to identify overloaded threads, high ram usage, badly behaving tasks.
Problems and current stage; Right now my code works well, I'm using systeminformation npm module to gather some system information like CPU usage, memory usage, disk status and task list, I put it into an object and send to all connected clients on a socket.io server. Problem is, it seems that this approach literally brings the host machine to it's knees (Both server and client are running locally, because I'm still working on them), by that I mean my CPU usage going from 6% to 80% in an instant, which is ridiculous. I want this updating to be atleast once a second, but if possible, 60/s. Point is, I need to either find a different way of retrieving the usage data of CPU (ideally with each thread as well), memory, disks and the list of tasks. I know this question is not very specific, but I believe this is something more people than just me encounter, that being that NodeJS just kills the machine (irony). The question remains, looking forward towards any help!
I tried different approaches before but they seemed to lower the usage by a bit or just up it because of the need to have more modules loaded. This generally leads me to the conclusion I just need a better module to handle this stuff.
I have a WordPress website hosted on a shared Red Hat 4.4.7-18 Linux box and from time to time I get huge memory and I/O spikes exceding my allowed memory limit making my website unresponsive.
I have Cpanel installed but there is no way I can find out exactly what process is causing this but only seeing those spikes.
I think this is being caused by one of my plugins and would really want to know which of them does that. I have 15+ plugin installed so disabling and monitoring if the issue still exists it's not an option as this apparently happens randomly.
I'm having trouble with my Meteor app when it gets to its peak amount of traffic (peak for this is nothing, 1k visits, maybe 2,500 pageviews in a day). CPU usage spikes and never recovers, so I've taken to using Nodetime to monitor usage and I've been reloading the process (forever restart) to get things back to normal.
I'm fairly new to profiling, so finding the underlying cause has me at a loss for where to start. I'm fairly certain it has to do with my app's server code, but the profiling seems to point to the Fibers module as a "hotspot" which I understand aids in making my server code synchronous.
Below is a snippet from the profiling results. I hope someone can guide me in the right direction in troubleshooting this!
While I don't have a specific answer to your question, I have experience dealing with CPU issues for our production meteor app for so I can give you a list of things to investigate.
Upgrade to the latest version of meteor and the appropriate node version (see the changelog). As of this writing that's meteor 0.8.2 and node 0.10.28.
Read this and this article. The latter makes a great point that you really should always try to delay activation of subscriptions until you need them. In particular you may not need to publish anything for users who are not logged in. In my experience, meteor CPU problems have everything to do with subscriptions.
Be careful with observe and observeChanges. These are expensive and are easy to abuse. In particular:
Make sure you are calling stop() on your handles when they are no longer needed (consider using a package like publish-with-relations so this is done for you).
Fetch only the collections and fields that you absolutely need. Observe works by continually diffing objects (requires lots of CPU). The fewer and smaller objects you have, the less there is to compute.
Consider using smart-collections before it is retired. Use oplog tailing - this can make for a night and day difference in performance and CPU usage in your app.
Consider making some things not reactive (also mentioned in the articles above). For us that was a big win. We had one extremely expensive join that was used on two frequently accessed pages on the site. When it got to the point where the CPU was pegged at 100% about every 30 minutes I gave up on reactivity for that element and just did the join on the server and shipped the data to the client via a method call. I also created a server-side expiring cache for these results and stored them by user (special thanks to Matt DeBergalis for this suggestion).
Do a preventative nightly restart. I have a cron job that tells forever to restart our app once a day in the middle of the night. That brings the CPU down from ~10% to 1%. This seems like black magic, but the fact that the CPU usage changes after a reset leads me to believe this is a good idea.
Updated thoughts (1/13/14)
We migrated to oplog tailing as soon as it was available (meteor 0.7) and that made a big difference. Note that in order to get access to the oplog, you'll probably need to either host your own db or run a dedicated instance on the hosting provider of your choice. I'd also recommend adding the facts package to actually tell if its working.
There was a memory leak discovered in publish-with-relations, and as of this writing the atmosphere version (v0.1.5) hasn't been bumped to reflect these changes. If you are using it in production, I strongly recommend checking out the HEAD version and running it locally.
We stopped doing nightly restarts a couple of weeks ago. So far everything has been fine (fingers crossed).
Updated thoughts (7/2/14)
A few months ago we switched over to using an Elastic Deployment on mongohq. It's affordable, the performance has been great, and they even have a blog post which tells you how to enable oplog tailing.
I'd strongly recommend checking out kadira to help diagnose performance issues in your app. Also check out the academy articles which have a number of good tips in them.
I'm also having this problem. Actually there is an issue with 0.6.6.1, I run meteor --release 0.6.6 and the cpu is back to normal now.
This problem is killing the stability of my production servers.
To recap, the basic idea is that my node server(s) sometimes intermittently slow down, sometimes resulting in Gateway Timeouts. As best as I can tell from my logs, something is blocking the node thread (meaning that the incoming request is not accepted), but I cannot for the life of me figure out what.
The problem ranges in severity. Sometimes what should be <100ms requests take ~10 seconds to complete; sometimes they never even get accepted by the node server at all. In short, it is as though some random task is working and blocking the node thread for a period of time, thus slowing down (or even blocking) incoming requests; the one thing I can say for sure is that the need-to-fix-symptom is a "Gateway Timeout".
The issue comes and goes without warning. I have not been able to correlate it against CPU usage, RAM usage, uptime, or any other relevant statistic. I've seen the servers handle a large load fine, and then have this error with a small load, so it does not even appear to be load-related. It is not unusual to see the error around 1am PST, which is the smallest load time of the day! Restarting the node app does seem to maybe make the problem go away for a while, but that really doesn't tell me much. I do wonder if it might be a bug in node.js... not very comforting, considering it is killing my production servers.
The first thing I did was to make sure I had upgraded node.js to the latest (0.8.12), as well as all my modules (here they are). Of course, I also have plenty of error catchers in place. I'm not doing anything funky like printing out lots to the console or writing to lots of files.
At first, I thought it was outbound HTTP requests blocking the incoming socket, because the express middleware was not even picking up the inbound request, but I gave up the theory because it looks like the node thread itself became busy.
Next, I went through all my code with JSHint and fixed literally every single warning, including a few accidental globals (forgetting to write "var") but this didn't help
After that, I assumed that perhaps I was running out of memory. But, my heap snapshots via nodetime are looking pretty good now (described below).
Still thinking that memory might be an issue, I took a look at garbage collection. I enabled the --nouse-idle-notification flag and did some more code optimization to NULL objects when they were not needed.
Still convinced that memory was the issue, I added the --expose-gc flag and executed the gc(); command every minute. This did not change anything, except to occasionally make requests a bit slower perhaps.
In a desperate attempt, I setup the "cluster" module to use 2 workers and automatically restart them every 30 min. Still, no luck.
I increased the ulimit to over 10,000 and kept an eye on the open files. There seem to be < 300 open files (or sockets) per node.js app, and increasing the ulimit thus had no impact.
I've been logging my server with nodetime and here's the jist of it:
CentOS 5.2 running on the Amazon Cloud (m1.large instance)
Greater than 5000 MB free memory at all times
Less than 150 MB heap size at all times
CPU usage is less than 60% at all times
I've also checked my MongoDB servers, which have <5% CPU usage and no requests are taking > 100ms to complete, so I highly doubt there's a bottleneck.
I've wrapped (almost) all my code using Q-promises (see code sample), and of course have avoided Sync() calls like the plague. I've tried to replicate the issue on my testing server (OSX), but have had little luck. Of course, this may be just because the production servers are being used by so many people in so many unpredictable ways that I simply cannot replicate via stress tests...
Many months after I first asked this question, I found the answer.
In a nutshell, the problem was that I was not piping a big asset when transferring it from one server to another. In other words, I was downloading an image from one server, before uploading it to a S3 bucket. Instead of streaming the download into the upload, I downloaded the file into memory, and then uploaded it.
I'm not sure why this did not show up as a memory spike, or elsewhere in my statistics.
My guess is Mongoose. If you are storing large payloads in Mongo, Mongoose can be pretty slow due to how it builds the Mongoose objects. See https://github.com/LearnBoost/mongoose/issues/950 for more details on the problem. If this is the problem you wouldn't see it in Mongo itself since the query returns quickly, but object instantiation could take 75x the query time.
Try setting up timers around (process.hrtime()) before and after you the Mongoose objects are being created to see if that might be the problem. If this is the problem, I would switch to using the node Mongo driver directly instead of going through Mongoose.
You are heavily leaking memory, try setting every object to null as soon as you don't need it anymore! Read this.
More information about hunting down memory leaks can be found here.
Give special attention to having multiple references to the same object and check if you have circular references, those are a pain to debug but will help you very much.
Try invoking the garbage collector manually every minute or so (I don't know if you can do this in node.js cause I'm more of a c++ and php coder). From my years of experience working with c++ I can tell you the most likely cause of your application slowing down over time is memory leaks, find them and plug them, you'll be ok!
Also assuming you're not caching and/or processing images, audio or video in memory or anything like that 150M heap is a lot! Those could be hundreds of thousands or even millions of small objects.
You don't have to be running out of memory for your application to slow down... just searching for free memory with that many objects already allocated is a huge job for the memory allocator, it takes a lot of time to allocate each new object and as you leak more and more memory that time only increases.
Is "--nouse-idle-connection" a mistake? do you really mean "--nouse_idle_notification".
I think it's maybe some issues about gc with too many tiny objects.
node is single process, so watch the most busy cpu core is much important than the load.
when your program is slow, you can execute "gdb node pid" and "bt" to see what node is busy doing.
What I'd do is set up a parallel node instance on the same server with some kind of echo service and test that one. If it runs fine, you narrow down your problem to your program code (and not a scheduler/OS-level problem). Then, step by step, include the modules and test again. Certainly this is a lot of work, takes long and I dont know if it is doable on your system.
If you need to get this working now, you can go the NASA redundancy route:
Bring up a second copy of your production servers, and put a proxy in front of them which routes each request to both stacks and returns the first response. I don't recommend this as a perfect long-term solution but it should help significantly reduce issues in production now, and help you gather log data that you could replay to recreate the issues on non-production servers.
Obviously, this is straight-forward for read requests, but more complex for commands which write to the db.
We have a similar problem with our Node.js server. It didn't scale well for weeks and we had tried almost everything as you had. Our problem was in the implicit backlog value which is set very low for high-concurrent environments.
http://nodejs.org/api/http.html#http_server_listen_port_hostname_backlog_callback
Setting the backlog to a significantly higher value (e.g. 10000) as well as tune networking in our kernel (/etc/sysctl.conf on Linux) as described in manual section helped a lot. From this time forward we don't have any timeouts in our Node.js server.
When running any kind of server under load there are several resources that one would like to monitor to make sure that the server is healthy. This is specifically true when testing the system under load.
Some examples for this would be CPU utilization, memory usage, and perhaps disk space.
What other resource should I be monitoring, and what tools are available to do so?
As many as you can afford to, and can then graph/understand/look at the results. Monitoring resources is useful for not only capacity planning, but anomaly detection, and anomaly detection significantly helps your ability to detect security events.
You have a decent start with your basic graphs. I'd want to also monitor the number of threads, number of connections, network I/O, disk I/O, page faults (arguably this is related to memory usage), context switches.
I really like munin for graphing things related to hosts.
I use Zabbix extensively in production, which comes with a stack of useful defaults. Some examples of the sorts of things we've configured it to monitor:
Network usage
CPU usage (% user,system,nice times)
Load averages (1m, 5m, 15m)
RAM usage (real, swap, shm)
Disc throughput
Active connections (by port number)
Number of processes (by process type)
Ping time from remote location
Time to SSL certificate expiry
MySQL internals (query cache usage, num temporary tables in RAM and on disc, etc)
Anything you can monitor with Zabbix, you can also attach triggers to - so it can restart failed services; or page you to alert about problems.
Collect the data now, before performance becomes an issue. When it does, you'll be glad of the historical baselines, and the fact you'll be able to show what date and time problems started happening for when you need to hunt down and punish exactly which developer made bad changes :)
I ended up using dstat which is vmstat's nicer looking cousin.
This will show most everything you need to know about a machine's health,
including:
CPU
Disk
Memory
Network
Swap
"df -h" to make sure that no partition runs full which can lead to all kinds of funky problems, watching the syslog is of course also useful, for that I recommend installing "logwatch" (Logwatch Website) on your server which sends you an email if weird things start showing up in your syslog.
Cacti is a good web-based monitoring/graphing solution. Very complete, very easy to use, with a large userbase including many large Enterprise-level installations.
If you want more 'alerting' and less 'graphing', check out nagios.
As for 'what to monitor', you want to monitor systems at both the system and application level, so yes: network/memory/disk i/o, interrupts and such over the system level. The application level gets more specific, so a webserver might measure hits/second, errors/second (non-200 responses), etc and a database might measure queries/second, average query fulfillment time, etc.
Beware the afore-mentioned slowquerylog in mysql. It should only be used when trying to figure out why some queries are slow. It has the side-effect of making ALL your queries slow while it's enabled. :P It's intended for debugging, not logging.
Think 'passive monitoring' whenever possible. For instance, sniff the network traffic rather than monitor it from your server -- have another machine watch the packets fly back and forth and record statistics about them.
(By the way, that's one of my favorites -- if you watch connections being established and note when they end, you can find a lot of data about slow queries or slow anything else, without putting any load on the server you care about.)
In addition to top and auth.log, I often look at mtop, and enable mysql's slowquerylog and watch mysqldumpslow.
I also use Nagios to monitor CPU, Memory, and logged in users (on a VPS or dedicated server). That last lets me know when someone other than me has logged in.
network of course :) Use MRTG to get some nice bandwidth graphs, they're just pretty most of the time.. until a spammer finds a hole in your security and it suddenly increases.
Nagios is good for alerting as mentioned, and is easy to get setup. You can then use the mrtg plugin to get alerts for your network traffic too.
I also recommend ntop as it shows where your network traffic is going.
A good link to get you going with Munin and Monit: link text
I typically watch top and tail -f /var/log/auth.log.