I have a Windows Service that runs 20 threads. Each thread manages a remote device. The threads basically just access database (SQL Server) and send and receive messages through TCP/IP. No fancy algorithm or heavy processing. This process is taking 80% in a 8-core CPU with lots of memory.
I didn't run a profiler on this service but I've seen the code million of times... It's just a while loop with a sleep of 500ms at the end of the loop. So the code inside the loop is executed only 2 or less times in 1 second time frame.
I was wondering if the process is I/O bound, why such this heavy CPU utilization? What if I lower the process priority?
Related
If a single thread cost 25% workload of a multi-core CPU, are 4 threads will fill the CPU?
Besides the 4 threads, can I do something else with the CPU?
If a single thread uses 25% of the CPU, then what is using the other 75%?
If the answer is, nothing is using the other 75%—if the CPU is idle 75% of the time—then the single thread is not cpu-bound. That is to say, something other than the availability of CPU cycles is limiting the speed of the program.
What is blocking the thread? Probably the thread is awaiting some resource; messages from the network, disk I/O, user input, etc. Whatever it is, if there are at least four of them, and if your program can be structured to concurrently use four of them, then maybe running four threads will allow the program to run four times as fast.
An example would be, if your program is some kind of a server, and it spends a lot of time awaiting messages from clients, then adding more threads would allow it to simultaneously await more messages, and it would improve the throughput.
A counterexample would be if your program is processing data from a single disk file, and it spends most of its time awaiting data from the disk. Since there's only one disk, and the disk can only go so fast, adding more threads would not help in that case.
I am observing strange effects with the CPU percentage as shown in e.g. top or htop on Linux (Ubuntu 16.04) for one special application. The application uses many threads (around 1000). Each thread has one computational task. About half of these tasks need to be computed once per "trigger" - the trigger is an external event received exactly every 100ms. The other threads are mostly sleeping (waiting for user interaction) and hence do not play a big role here. So to summarise: many threads are waking up basically simultaneously within a short period of time, doing there (relatively short) computation and going back to sleep again.
Since the machine running this application has 8 virtual CPUs (4 cores each 2 threads, it's an i7-3612QE), only 8 threads can really wake up at a time, so many threads will have to wait. Also some of these tasks have interdependencies, so they anyway have to wait, but I think as an approximation one can think of this application as a bunch of threads going to the runnable state at the same time every 100ms and each doing only a short computation (way below 1ms of CPU time each).
Now coming to the strange effect: If I look at the CPU percentage in "top", it shows something like 250%. As far as I know, top looks on the CPU time (user + system) the kernel accounts for this process, so 250% would mean the process uses 3 virtual CPUs on average. So far so good. Now, if I use taskset to force the entire process to use only a single virtual CPU, the CPU percentage drops to 80%. The application has internal accounting which tells me that still all data is being processed. So the application is doing the same amount of work, but it seemingly uses less CPU resources. How can that be? Can I really trust the kernel CPU time accounting, or is this an artefact of the measurement?
The CPU percentage also goes down, if I start other processes which take a lot of CPU, even if the do nothing ("while(true);") and are running at low priority (nice). If I launch 8 of these CPU-eating processes, the application reaches again 80%. With fewer CPU-eaters, I get gradually higher CPU%.
Not sure if this plays a role: I have used the profiler vtune, which tells me my application is actually quite inefficient (only about 1 IPC), mostly because it's memory bound. This does not change if I restrict the process to a single virtual CPU, so I assume the effect is not caused by a huge increase in efficiency when running everything on the same core (which would be strange anyway).
My question was essentially already answered by myself in the last paragraph: The process is memory bound. Hence not the CPU is the limited resource but the memory bandwidth. Allowing such process to run on multiple CPU cores in parallel will mainly have the effect that more CPU cores are waiting for data to arrive from RAM. This is counted as CPU load, since the CPU is executing the thread, but just quite slowly. All my other observations go along with this.
I have a system that does a lot of blocking network calls. Based on that the CPU idle time is relatively high as well as the overall system load. When I add more CPUs to the system idle time remains the same but system load drops. Why is that happening?
Is the following what is happening behind the scenes?
When a process that wants to do blocking network I/O is dispatched, the CPU prepares and propagates the call to the I/O bus, sets up the interrupt handler and then switches to another process as there is nothing else to be done. When the I/O call returns the I/O subsystem raises an interrupt and the CPU gets back to the original process in order to complete its execution.
Therefore, the job of the CPU consists here of interfacing with the I/O subsystem and of context switching. When there are many processes that need to go through this the load increases as a result of processes that wait for their I/O and their context switch. When I add more CPUs to the system there are less context switches per CPU and therefore less waiting and less load.
Is this a correct explanation? If yes, in which state are the processes the wait for their network I/O to finish? This should be runnable in order to affect the system load.
All is well. The load average is the number of processes waiting for CPU time; the more CPUs, the more the processes are spread across CPUs, so the shorter the length of the queue. This works just like how the number of cashiers affect the line of customers wanting to check-out.
I've got a program that has about 80 threads. It's running on a ~50ish core machine on linux 3.36. At most there are 2 of these programs running at once, and they are identical. Nothing else is running on the machine.
The threads themselves are real-time linux pthreads with SCHED_RR (round robin) policy.
10 are highest priority (yes, I set ulimit to 99) and have cpu affinity set to 10 of the cores. In other words, they are each pinned to their own core.
about 60 are medium priority.
about 10 are low priority.
The 10 highest priority threads are constantly using cpu.
The rest are doing network IO as well as doing some work on the CPU. Here's the problem: I'm seeing one of the low priority threads being starved, sometimes over 15 seconds at a time. This specific thread is waiting on a TCP socket for some data. I know the data has been fully sent because I can see that the server on the other end of the connection has sent the data (i.e., it logs a timestamp after sending the data). Usually the thread takes milliseconds to receive and process it, but sporadically it will take 15 seconds after the other server has successfully sent the data. Note that increasing the priority of the thread and pinning it to a CPU has eradicated this issue, but this is not a long-term solution. I would not expect this behavior in the first place - 15 seconds is a very long time.
Does anyone know why this would be happening? We have ruled out that it is any of the logic in the program/threads. Also note that the program is written in C.
I would not expect this behavior in the first place - 15 seconds is a very long time.
If your 60 medium-priority threads were all runnable, then that's exactly what you'd expect: with realtime threads then lower-priority threads won't run at all while there's higher-priority threads still runnable.
You might be able to use perf timechart to analyse exactly what's going on.
What happens if a process keeps creating threads especially when the number of threads exceeds the limit of the OS? What will Windows and Linux do?
If the threads aren't doing any work (i.e. you don't start them), then on Windows you're subject to resource limitations as pointed out in the blog post that Hans linked. A Linux system, too, will have some limit on the number of threads it can create; after all, your computer doesn't have infinite virtual memory, so at some point the call to create a thread is going to fail.
If the threads are actually doing work, what usually happens is that the system starts thrashing. Each thread (including the program's main thread) gets a small timeslice (typically measured in tens of milliseconds), and then it gets swapped out for the next available thread. With so many threads, their working sets are large enough to occupy all available RAM, so every thread context switch requires that the currently running thread is written to virtual memory (disk), and the next available thread is read from disk. So the system spends more time doing thread context switches than it does actually running the threads.
The threads will continue to execute, but very very slowly, and eventually you will run out of virtual memory. However, it's likely that it would take an exceedingly long time to create that many threads. You would probably give up and shut the machine off.
Most often, a machine that's suffering from this type of thrashing acts exactly like a machine that's stuck in an infinite loop on all cores. Even pressing Control+Break (or similar) won't take effect immediately because the thread that's handling that signal has to be in memory and running in order to process it. And after the thread does respond to such a signal, it takes an exceedingly long time for it to terminate all of the threads and clean up virtual memory.