Currently I have an issue in which w3wp.exe locks the CPU at 100%. I need to solve this issue, but in the mean time, i would like to trigger an IIS reset when the CPU remains above 95% for more than 2 minutes.
I have been playing with performance monitor, but i need something which will allow for a time condition. (It may allow you to do this, but so far i haven't worked it out)
Any ideas to get this to work would be appreciated.
Depending on which OS system you are running on this will be more or less complicated.
You should be collecting thread dumps and OS statistics on the process broken down by threads inside that process.
Try a thread dump every second for about 10 to 20 iterations when CPU is maxed, you should be able to identify which thread is making your host cpu bound.
Of course you have to work on the specifics of getting this information depending on your OS and IIS.
Related
I am observing strange effects with the CPU percentage as shown in e.g. top or htop on Linux (Ubuntu 16.04) for one special application. The application uses many threads (around 1000). Each thread has one computational task. About half of these tasks need to be computed once per "trigger" - the trigger is an external event received exactly every 100ms. The other threads are mostly sleeping (waiting for user interaction) and hence do not play a big role here. So to summarise: many threads are waking up basically simultaneously within a short period of time, doing there (relatively short) computation and going back to sleep again.
Since the machine running this application has 8 virtual CPUs (4 cores each 2 threads, it's an i7-3612QE), only 8 threads can really wake up at a time, so many threads will have to wait. Also some of these tasks have interdependencies, so they anyway have to wait, but I think as an approximation one can think of this application as a bunch of threads going to the runnable state at the same time every 100ms and each doing only a short computation (way below 1ms of CPU time each).
Now coming to the strange effect: If I look at the CPU percentage in "top", it shows something like 250%. As far as I know, top looks on the CPU time (user + system) the kernel accounts for this process, so 250% would mean the process uses 3 virtual CPUs on average. So far so good. Now, if I use taskset to force the entire process to use only a single virtual CPU, the CPU percentage drops to 80%. The application has internal accounting which tells me that still all data is being processed. So the application is doing the same amount of work, but it seemingly uses less CPU resources. How can that be? Can I really trust the kernel CPU time accounting, or is this an artefact of the measurement?
The CPU percentage also goes down, if I start other processes which take a lot of CPU, even if the do nothing ("while(true);") and are running at low priority (nice). If I launch 8 of these CPU-eating processes, the application reaches again 80%. With fewer CPU-eaters, I get gradually higher CPU%.
Not sure if this plays a role: I have used the profiler vtune, which tells me my application is actually quite inefficient (only about 1 IPC), mostly because it's memory bound. This does not change if I restrict the process to a single virtual CPU, so I assume the effect is not caused by a huge increase in efficiency when running everything on the same core (which would be strange anyway).
My question was essentially already answered by myself in the last paragraph: The process is memory bound. Hence not the CPU is the limited resource but the memory bandwidth. Allowing such process to run on multiple CPU cores in parallel will mainly have the effect that more CPU cores are waiting for data to arrive from RAM. This is counted as CPU load, since the CPU is executing the thread, but just quite slowly. All my other observations go along with this.
I have a C++ program running with 20 threads (boost threads) on one of the RHEL6.5 systems virtualized in dell server. The result is deterministic, but the cpu time and wall time varies a lot in different runs. Sometimes, it takes 200s cpu time to finish, sometimes it may take up to 300s cpu time to finish. This bothers me as performance is a criterion for our testing.
I've changed the originally used boost::timer::cpu_timer for wall/cpu time calc and use sys apis 'clock_gettime' and 'getrusage'. It doesn't help.
Is it because of the 'steal time' by hypervisor (Vmware)? Is steal time included in the user/sys time collected by 'getrusage'?
Anyone have knowledge on this? Many Thanks.
It would be useful if you provided some extra information. For example are your threads dependent? meaning is there any synchronization going among them?
Since you are using a virtual machine, how is your CPU shared with other users of the server. It might be that even the same single CPU core is shared, thus not each time you have the same allocation of CPU resources [this is the steal time you mention above].
Also you mention that CPU time is different: this is the time spent in user code. If you have sync among threads (such as a mutex, etc) then depending on how operating system wakes up threads etc, the over all time might vary.
I have a Linux embedded system built with PREEMPT_RT (real time patch) that creates multiple SCHED_FIFO threads, with a priority of 90 each. The goal is that they execute without being preempted, but with the same priority.
Each thread does a little bit of work, then goes to sleep using std::this_thread::sleep_for() for a few milliseconds, then gets scheduled back and executes the same amount of work.
Most of the time, each thread latency is impeccable, but once every minute or so (not an exact regular interval) all threads get hogged at the same time for one second or more (instead of the low milliseconds they usually get called at).
I have made sure Power management is disabled in the kernel kconfig, I have called mlockall() to avoid memory getting paged out, to no avail.
I have tried to use ftrace with wakeup_rt as the tracer, but the highest latency recorded was around 5ms, not nearly enough time to be the cause of the issue.
I am not sure what tool would be best to identify where the latency is coming from. Does anyone have ideas please?
Reading CLR via C# 2.0 (I dont have 3.0 with me at the moment)
Is this still the case:
If there is only one CPU in a computer, only one thread can run at any one time. Windows has to keep track of the thread objects, and every so often, Windows has to decide which thread to schedule next to go to the CPU. This is additional code that has to execute once every 20 milliseconds or so. When Windows makes a CPU stop executing one thread's code and start executing another thread's code, we call this a context switch. A context switch is fairly expensive because the operating system has to:
So circa CLR via C# 2.0 lets say we are on Pentium 4 2.4ghz 1 core non-HT, XP. Every 20 milliseconds? Where a CLR thread or Java thread is mapped to an OS thread only a maximum of 50 threads per second may get a chance to to run?
I've read that context switching is very fast in mircoseconds here on SO, but how often roughly (magnitude style guesses) will say a modest 5 year old server Windows 2003 Pentium Xeon single core give the OS the opportunity to context switch? 20ms in the right area?
I dont need exact figures I just want to be sure that's in the right area, seems rather long to me.
The Quantum as it's called is dependant on a few things, including performance tweaks the operating system makes as it goes along; for instance the foreground process is given a higher priority and can be given [a quantum 3 times longer than default. There is a also a difference between Server and Client SKU typically a client would have a default quantum of 30ms where a server would be 180ms.
So a foreground process that wants as much CPU as it can get may get a quantum of 90ms before a context switch.. and then the OS may decide it doesn't need to switch and let the Quantum continue.
Your "50 threads at a time" math is wrong. You assume that each of those threads is in a 100% CPU state. Most threads are in fact asleep waiting for IO or other events. Even then, most threads don't use their entire 20 ms before going into IO mode or otherwise giving up its slice.
Try this. Write an app with an inifinite loop (eats its entire CPU window). Run 50 instances of it. See how Windows reacts.
I just did a test I got 43 threads seeing its share in a second (after warming up) which makes Richter statement pretty accurate (with overhead) I say. Quadcore/Win7/64bit. Yes these were 100% cpu threads so obviously they weren't given themselves back before their 20ms. Interesting
Given a machine with 1 CPU and a lot of RAM. Besides other kinds of applications (web server etc.), there are 2 other server applications running on that machine doing the exact same kind of processing although one uses 10 threads and the other users 1 thread. Assume the processing logic for each request is 100% CPU-bound and typically takes no longer than 2 seconds to finish. The question is whose throughput, in terms of transactions processed per minute, might be better? Why?
Note that the above is not a real environment, I just make up the data to make the question clear. My current thinking is that there should be no difference because the apps are 100% CPU-bound and therefore if the machine can handle 30 requests per minute for the 2nd app, it will also be able to handle 3 requests per minute for each of the 10 threads of the 1st app. But I'm glad to be proven wrong, given the fact that there are other applications running in the machine and one application might not be always given 100% CPU time.
There's always some overhead involved in task switching, so if the threads aren't blocking on anything, fewer threads is generally better. Also, if the threads aren't executing the same part of code, you'll get some cache flushing each time you swtich.
On the other hand, the difference might not be measurable.
Interesting question.
I wrote a sample program that does just this. It has a class that will go do some processor intensive work, then return. I specify the total number of threads I want to run, and the total number of times I want the work to run. The program will then equally divide the work between all the threads (if there's only one thread, it just gets it all) and start them all up.
I ran this on a single proc VM since I could find a real computer with only 1 processor in it anymore.
Run independently:
1 Thread 5000 Work Units - 50.4365sec
10 Threads 5000 Work Units - 49.7762sec
This seems to show that on a one proc PC, with lots of threads that are doing processor intensive work, windows is smart enough not to rapidly switch them back and fourth, and they take about the same amount of time.
Run together (or as close as I could get to pushing enter at the same time):
1 Thread 5000 Work Units - 99.5112sec
10 Threads 5000 Work Units - 56.8777sec
This is the meat of the question. When you run 10 threads + 1 thread, they all seem to be scheduled equally. The 10 threads each took 1/10th longer (because there was an 11th thread running) while the other thread took almost twice its time (really, it got 1/10th of its work done in the first 56sec, then did the other 9/10ths in the next 43sec...which is about right).
The result: Window's scheduler is fair on a thread level, but not on a process level. If you make a lot of threads, it you can leave the other processes that weren't smart enought to make lots of threads high and dry. Or just do it right and us a thread pool :-)
If you're interested in trying it for yourself, you can find my code:
http://teeks99.com/ThreadWorkTest.zip
The scheduling overhead could make the app with 10 threads slower than the one with 1 thread. You won't know for sure unless you create a test.
For some background on multithreading see http://en.wikipedia.org/wiki/Thread_(computer_science)
This might very well depend on the operating system scheduler. For example, back in single-thread days the scheduler knew only about processes, and had measures like "niceness" to figure out how much to allocate.
In multithreaded code, there is probably a way in which one process that has 100 threads doesn't get 99% of the CPU time if there's another process that has a single thread. On the other hand, if you have only two processes and one of them is multithreaded I would suspect that the OS may give it more overall time. However, AFAIK nothing is really guaranteed.
Switching costs between threads in the same process may be cheaper than switching between processes (e.g., due to cache behavior).
One thing you must consider is wait time on the other end of the transaction. Having multiple threads will allow you to be waiting for a response on one while preparing the next transaction on the next. At least that's how I understand it. So I think a few threads will turn out better than one.
On the other hand you must consider the overhead involved with dealing on multiple threads. The details of the application are important part of the consideration here.