I've been using pstack (called in a loop periodically) as a substitute for a real profiling tool. I've noticed that even though there's more then 85% cpu usage for that pid in top, pstack shows the pid being blocked on I/O more often than being CPU bound.
How's pstack implemented? Is there any reason why pstack would be more susceptible to attaching to the pid when it's actually blocked on I/O?
You say you're calling pstack periodically in a loop - i.e. in a separate process (B) from the one you are profiling(A). If they are running in a single core, then B is more likely to "wake up" when A is blocked.
Regardless, I would trigger pstack manually, on the theory that not many samples are needed. Rather the samples I do get need to be scrutinized, not just lumped together.
In general, it's good to take samples during I/O time as well as CPU time, because both I/O and CPU wastage can make your program slow.
If it somewhat inflates one or the other, that's fairly harmless, assuming your real goal is to precisely identify things to optimize, rather than just get precise measurements of fuzzy things like functions.
Related
The GDB manual states that when using all-stop mode for debugging a multithreaded application, it is not possible to advance every thread in lock-step by exactly one statement. This makes sense since a step in GDB essentially allows all threads to be scheduled by the OS (however the OS decides to do this) until the next statement is reached by the thread for which the step was called.
My question is this: Is it reasonable to assume that the average scheduling behavior of the OS in between GDB steps is comparable to the average scheduling behavior of the OS when not stepping (while still using GDB to keep as many variables constant as possible), or does the stepping muck with the scheduling enough that the advancement of threads is not (on average) the same as without stepping?
If the stepping does affect the behavior, how can I get an accurate representation of multithreaded program flow and program state at discrete points in my program? Will recording and playing back be viable?
Is it reasonable to assume that the average scheduling behavior of the OS in between GDB steps is comparable to the average scheduling behavior of the OS when not stepping
Not really. The "not stepping" average behavior will have threads either running out of their time quanta, or blocking on system calls. In the "stepping" case, the threads are unlikely to every run out of their time quanta (because the time distance between steps is likely to be very short). So the average behavior is likely to be very different.
how can I get an accurate representation of multithreaded program flow and program state at discrete points in my program?
In general, you shouldn't care about multithreaded program flow. It is impossible to debug multithreaded programs that way.
When doing multithreaded programming, you must care about preserving invariants (every resource that can be accessed by multiple threads is protected against data races, etc.). If you do, your program will just work (TM). If you don't, you are unlikely to find all ways that the program misbehaves anyway.
I have 2 theoretical questions related to Linux system programming in C about nanosleep and process destruction.
So, the first one:
It is possible to make 97% CPU load just by using nanosleep. For example, let's consider a for-loop that iterates 50 times with a delay of 1 second, on a child process. The delay is obtained using nanosleep. What I observed, on a devboard with Debian Linux, is that after somewhere between 15 and 20 iterations, nanosleep blocks and CPU load is 90 % ( I used top to see the value).
The second question is somehow related to the first one. With the same code,
a for-loop running 50 times on a child process I observed that when nanosleep blocks ( freeze) at a 90% CPU load the child process become a zombie process.
It's a kernel mechanism that tries to kill a process that is using too much the CPU?
Again, sorry that I can't post the code, it's not mine... But I found curios this 2 cases and I didn't find something about on Internet, or I didn't know how to search. I just want to know, theoretically if it's possible to have 90% CPU load just using nanosleep, and secondly if the kernel have a safety mechanism that tries to kill processes that use too much the CPU.
I'm interested to find some opinions about this cases, maybe recommended alternative functions.
PS:I don't want to see comments that are asking for source code since this question is just theoretically.
I'm not an expert but I assume that if your kernel kills the process cause it's consuming too much resources it will depend solely on the distro you're using.
About the use of CPU, theoretically you could continously get processes into the CPU and just sleep them. In this case the OS will be dispatching the processes from CPU to lock queue and back (Overhead), and depending on the type of queuing your distro uses to dispatch processes (Round Robin, queues with aging, etc, I can't remember right now where you can set this parameter) it could eventually starve other processes
Anyway this is not a C related question just OS
Your question makes no sense (sorry).
If nanosleep blocks, it won't use any CPU, because it will be blocking, i.e. waiting in the kernel for something else to happen. That's what blocking means.
For a process to be using 100% CPU, it must be busy waiting.
If it's busywaiting, and calling nanosleep, we can conclude that each call to nanosleep is of very short duration.
An alternative explanation is it's using a large amount of CPU doing something else, and only very occasionally calling nanosleep, or there is more than one thread running, and a thread other than the one calling nanosleep is using lots of CPU.
I am running a Java program which does a heavy load work and needs lots of memory and CPU attention.
I took the snapshot of task manager while that program was running and this is how it looks like
Clearly this program is making use of all 8 cores available on my machine but if you see the CPU usage graph, you can see dips in the CPU usage and these dips are consistent across all cores.
My question is, Is there some way of avoiding these dips? Can i make sure that all my cores are being used consistently without any dip and come to rest only after my program has finished?
This looks so familiar. Obviously, your threads are blocking for some reason. Here are my suggestions:
Check to see if you have any thread blocking (synchronization). Thread synchronization is easy to do wrong and can stop computation for extended periods of time.
Make sure you aren't waiting on I/O (file, network, devices, etc). Often the default for network or other I/O is to block.
Don't block on message passing or remote procedure calls.
Use a more sophisticated profiler to get a better look. I use Intel VTune, but then I have access to it. There are other low-level profiling tools that are just as capable but more difficult to use.
Check for other processes that might be using the system. I've had situations where that other process doesn't use the processor (blocks) but doesn't give the context up (doesn't swap out and allow another process to run).
When I say "don't block", I don't mean that you should poll. That's even worse as it consumes processing without doing anything useful. Restructure your algorithm to hide latency. Use a new algorithm that permits more latency hiding. Find alternate ways of thread synchronization that minimizes or eliminates blocking.
My two cents.
At first glance, my question might look bit trivial. Please bear with me and read completely.
I have identified a busy loop in my Linux kernel module. Due to this, other processes (e.g. sshd) are not getting CPU time for long spans of time (like 20 seconds). This is understandable as my machine has only single CPU and busy loop is not giving chance to schedule other processes.
Just to experiment, I had added schedule() after each iteration in the busy loop. Even though, this would be keeping the CPU busy, it should still let other processes run as I am calling schedule(). But, this doesn't seem to be happening. My user level processes are still hanging for long spans of time (20 seconds).
In this case, the kernel thread got nice value -5 and user level threads got nice value 0. Even with low priority of user level thread, I think 20 seconds is too long to not get CPU.
Can someone please explain why this could be happening?
Note: I know how to remove busy loop completely. But, I want to understand the behaviour of kernel here. Kernel version is 2.6.18 and kernel pre-emption is disabled.
The schedule() function simply invokes the scheduler - it doesn't take any special measures to arrange that the calling thread will be replaced by a different one. If the current thread is still the highest priority one on the run queue then it will be selected by the scheduler once again.
It sounds as if your kernel thread is doing very little work in its busy loop and it's calling schedule() every time round. Therefore, it's probably not using much CPU time itself and hence doesn't have its priority reduced much. Negative nice values carry heavier weight than positives, so the difference between a -5 and a 0 is quite pronounced. The combination of these two effects means I'm not too surprised that user space processes miss out.
As an experiment you could try calling the scheduler every Nth iteration of the loop (you'll have to experiment to find a good value of N for your platform) and see if the situation is better - calling schedule() too often will just waste lots of CPU time in the scheduler. Of course, this is just an experiment - as you have already pointed out, avoiding busy loops is the correct option in production code, and if you want to be sure your thread is replaced by another then set it to be TASK_INTERRUPTIBLE before calling schedule() to remote itself from the run queue (as has already been mentioned in comments).
Note that your kernel (2.6.18) is using the O(1) scheduler which existed until the Completely Fair Scheduler was added in 2.6.23 (the O(1) scheduler having been added in 2.6 to replace the even older O(n) scheduler). The CFS doesn't use run queues and works in a different way, so you might well see different behaviour - I'm less familiar with it, however, so I wouldn't like to predict exactly what differences you'd see. I've seen enough of it to know that "completely fair" isn't the term I'd use on heavily loaded SMP systems with a large number of both cores and processes, but I also accept that writing a scheduler is a very tricky task and it's far from the worst I've seen, and I've never had a significant problem with it on a 4-8 core desktop machine.
I have an analysis that can be parallelized over a different number of processes. It is expected that things will be both IO and CPU intensive (very high throughput short-read DNA alignment if anyone is curious.)
The system running this is a 48 core linux server.
The question is how to determine the optimum number of processes such that total throughput is maximized. At some point the processes will presumably become IO bound such that adding more processes will be of no benefit and possibly detrimental.
Can I tell from standard system monitoring tools when that point has been reached?
Would the output of top (or maybe a different tool) enable me to distinguish between a IO bound and CPU bound process? I am suspicious that a process blocked on IO might still show 100% CPU utilization.
When a process is blocked on IO, it isn't running, so no time is accounted against it. If there's another process that can run, then that will run instead; if there isn't, the time is counted as 'IO wait', which is accounted as a global statistic.
IO wait would be a useful thing to monitor. It shows up in top's header as something like %iw. You can monitor it in more detail with tools like iostat and vmstat. Serverfault might be a better place to ask about that.
Even a single IO-bound process will rarely show high CPU utilization because the operating system has scheduled its IO and is usually just waiting for it to complete. So top cannot accurately distinguish between an IO-bound process and a non-IO-bound process that merely periodically uses the CPU. In fact, a system horribly overloaded with all IO-bound processes, barely able to accomplish anything can exhibit very low CPU utilization.
Using only top, as a first pass, you can indeed merely keep adding threads/processes until CPU utilization levels off to determine the approximate configuration for a given machine.
You can use tools like iostat and vmstat to show how much time processes are spending blocked on I/O. There's generally no harm in adding more processes than you need, but the benefit decreases. You should measure throughput vs. processes as a measurement of overall efficiency.