Did cpu workload increased linearly in multithreading? - multithreading

If a single thread cost 25% workload of a multi-core CPU, are 4 threads will fill the CPU?
Besides the 4 threads, can I do something else with the CPU?

If a single thread uses 25% of the CPU, then what is using the other 75%?
If the answer is, nothing is using the other 75%—if the CPU is idle 75% of the time—then the single thread is not cpu-bound. That is to say, something other than the availability of CPU cycles is limiting the speed of the program.
What is blocking the thread? Probably the thread is awaiting some resource; messages from the network, disk I/O, user input, etc. Whatever it is, if there are at least four of them, and if your program can be structured to concurrently use four of them, then maybe running four threads will allow the program to run four times as fast.
An example would be, if your program is some kind of a server, and it spends a lot of time awaiting messages from clients, then adding more threads would allow it to simultaneously await more messages, and it would improve the throughput.
A counterexample would be if your program is processing data from a single disk file, and it spends most of its time awaiting data from the disk. Since there's only one disk, and the disk can only go so fast, adding more threads would not help in that case.

Related

Concurrent threads, processes and multiple cores

I'm trying to understand the usage of CPU cores with regard to concurrent threads and processes. Please see the below questions:
Assume I have 2 CPU cores. When there are 2 processes running, each process has only 1 thread. Are the two processes using the 2 cores?
Assume I have 2 CPU cores. When there is 1 process running, which has 2 threads. Are the two threads using the 2 cores?
Assume I have 2 CPU cores. When there are 2 processes running, each process has 2 threads. How are the two cores used by those processes and cores?
How to calculate the maximum real concurrent execution given CPU cores? What are other factor I should take into account?
1,2: Quite likely but not definitely. A portion of the system software determines what runs where. It would be unlikely to choose to keep a process or thread waiting for cpu attention when there is one that is otherwise idle, it isn't absolute.
Most processing involves some sort of transfer to and from a device, network, etc.. Typically this necessitates a period of inactivity waiting for the transfer to complete. During this inactivity, another process / thread can run on that cpu. So, if a given process is 30% cpu time and 70% I/O time, then I can run about 3 of them concurrently on a single cpu without degrading performance.
3,4: Like the paragraph above implies, depending upon the workload, their could be any distribution of the threads among the cpus. If the threads were all compute bound (100% cpu), most operating systems switch between them at a small enough granularity that all remain lively, and large enough that the switching has a minimal impact on them.
This scheduling may take other notions into consideration, such as data affinity. Recently touched bits of data are likely to remain in the cpu cache when a thread has relinquished it. The next time the thread is to be scheduled, it would be best to put it onto that cpu, to retain the effort required to warm the cache for it. It might also think that two threads of one process (address space) are more likely to share data, so should prefer the same cpu.
4: depending upon your system, there are likely to be many performance analysis tools available. Top, on UNIX-inspired systems is a simple tool which gives system wide utilization information, and the simple tool time will show how much time a process spent on a cpu vs real-world time. If you run each of your tasks sequentially, noting the cpu-time that they take, then time them running concurrently, the ratio between these cpu-times indicates the scaling factor of your concurrent app. Note that real-world time can be misleading because of io-overlap.

CPU percentage and heavy multi-threading

I am observing strange effects with the CPU percentage as shown in e.g. top or htop on Linux (Ubuntu 16.04) for one special application. The application uses many threads (around 1000). Each thread has one computational task. About half of these tasks need to be computed once per "trigger" - the trigger is an external event received exactly every 100ms. The other threads are mostly sleeping (waiting for user interaction) and hence do not play a big role here. So to summarise: many threads are waking up basically simultaneously within a short period of time, doing there (relatively short) computation and going back to sleep again.
Since the machine running this application has 8 virtual CPUs (4 cores each 2 threads, it's an i7-3612QE), only 8 threads can really wake up at a time, so many threads will have to wait. Also some of these tasks have interdependencies, so they anyway have to wait, but I think as an approximation one can think of this application as a bunch of threads going to the runnable state at the same time every 100ms and each doing only a short computation (way below 1ms of CPU time each).
Now coming to the strange effect: If I look at the CPU percentage in "top", it shows something like 250%. As far as I know, top looks on the CPU time (user + system) the kernel accounts for this process, so 250% would mean the process uses 3 virtual CPUs on average. So far so good. Now, if I use taskset to force the entire process to use only a single virtual CPU, the CPU percentage drops to 80%. The application has internal accounting which tells me that still all data is being processed. So the application is doing the same amount of work, but it seemingly uses less CPU resources. How can that be? Can I really trust the kernel CPU time accounting, or is this an artefact of the measurement?
The CPU percentage also goes down, if I start other processes which take a lot of CPU, even if the do nothing ("while(true);") and are running at low priority (nice). If I launch 8 of these CPU-eating processes, the application reaches again 80%. With fewer CPU-eaters, I get gradually higher CPU%.
Not sure if this plays a role: I have used the profiler vtune, which tells me my application is actually quite inefficient (only about 1 IPC), mostly because it's memory bound. This does not change if I restrict the process to a single virtual CPU, so I assume the effect is not caused by a huge increase in efficiency when running everything on the same core (which would be strange anyway).
My question was essentially already answered by myself in the last paragraph: The process is memory bound. Hence not the CPU is the limited resource but the memory bandwidth. Allowing such process to run on multiple CPU cores in parallel will mainly have the effect that more CPU cores are waiting for data to arrive from RAM. This is counted as CPU load, since the CPU is executing the thread, but just quite slowly. All my other observations go along with this.

Threads vs cores when threads are asleep

I am looking to confirm my assumptions about threads and CPU cores.
All the threads are the same. No disk I/O is used, threads do not share memory, and each thread does CPU bound work only.
If I have CPU with 10 cores, and I spawn 10 threads, each thread will have its own core and run simultaneously.
If I launch 20 threads with a CPU that has 10 cores, then the 20 threads will "task switch" between the 10 cores, giving each thread approximately 50% of the CPU time per core.
If I have 20 threads but 10 of the threads are asleep, and 10 are active, then the 10 active threads will run at 100% of the CPU time on the 10 cores.
An thread that is asleep only costs memory, and not CPU time. While the thread is still asleep. For example 10,000 threads that are all asleep uses the same amount of CPU as 1 thread asleep.
In general if you have a series of threads that sleep frequently while working on a parallel process. You can add more threads then there are cores until get to a state where all the cores are busy 100% of the time.
Are any of my assumptions incorrect? if so why?
Edit
When I say the thread is asleep, I mean that the thread is blocked for a specific amount of time. In C++ I would use sleep_for Blocks the execution of the current thread for at least the specified sleep_duration
If we assume that you are talking about threads that are implemented using native thread support in a modern OS, then your statements are more or less correct.
There are a few factors that could cause the behavior to deviate from the "ideal".
If there are other user-space processes, they may compete for resources (CPU, memory, etcetera) with your application. That will reduce (for example) the CPU available to your application. Note that this will include things like the user-space processes responsible for running your desktop environment etc.
There are various overheads that will be incurred by the operating system kernel. There are many places where this happens including:
Managing the file system.
Managing physical / virtual memory system.
Dealing with network traffic.
Scheduling processes and threads.
That will reduce the CPU available to your application.
The thread scheduler typically doesn't do entirely fair scheduling. So one thread may get a larger percentage of the CPU than another.
There are some complicated interactions with the hardware when the application has a large memory footprint, and threads don't have good memory locality. For various reasons, memory intensive threads compete with each other and can slow each other down. These interactions are all accounted as "user process" time, but they result in threads being able to do less actual work.
So:
1) If I have CPU with 10 cores, and I spawn 10 threads, each thread will have its own core and run simultaneously.
Probably not all of the time, due to other user processes and OS overheads.
2) If I launch 20 threads with a CPU that has 10 cores, then the 20 threads will "task switch" between the 10 cores, giving each thread approximately 50% of the CPU time per core.
Approximately. There are the overheads (see above). There is also the issue that time slicing between different threads of the same priority is fairly coarse grained, and not necessarily fair.
3) If I have 20 threads but 10 of the threads are asleep, and 10 are active, then the 10 active threads will run at 100% of the CPU time on the 10 cores.
Approximately: see above.
4) An thread that is asleep only costs memory, and not CPU time. While the thread is still asleep. For example 10,000 threads that are all asleep uses the same amount of CPU as 1 thread asleep.
There is also the issue that the OS consumes CPU to manage the sleeping threads; e.g. putting them to sleep, deciding when to wake them, rescheduling.
Another one is that the memory used by the threads may also come at a cost. For instance if the sum of the memory used for all process (including all of the 10,000 threads' stacks) is larger than the available physical RAM, then there is likely to be paging. And that also uses CPU resources.
5) In general if you have a series of threads that sleep frequently while working on a parallel process. You can add more threads then there are cores until get to a state where all the cores are busy 100% of the time.
Not necessarily. If the virtual memory usage is out of whack (i.e. you are paging heavily), the system may have to idle some of the CPU while waiting for memory pages to be read from and written to the paging device. In short, you need to take account of memory utilization, or it will impact on the CPU utilization.
This also doesn't take account of thread scheduling and context switching between threads. Each time the OS switches a core from one thread to another it has to:
Save the the old thread's registers.
Flush the processor's memory cache
Invalidate the VM mapping registers, etcetera. This includes the TLBs that #bazza mentioned.
Load the new thread's registers.
Take performance hits due to having to do more main memory reads, and vm page translations because of previous cache invalidations.
These overheads can be significant. According to https://unix.stackexchange.com/questions/506564/ this is typically around 1.2 microseconds per context switch. That may not sound much, but if your application is switching threads rapidly, that could amount to many milliseconds in each second.
As already mentioned in the comments, it depends on a number of factors. But in a general sense your assumptions are correct.
Sleep
In the bad old days a sleep() might have been implemented by the C library as a loop doing pointless work (e.g. multiplying 1 by 1 until the required time had elapsed). In that case, the CPU would still be 100% busy. Nowadays a sleep() will actually result in the thread being descheduled for the requisite time. Platforms such as MS-DOS worked this way, but any multitasking OS has had a proper implementation for decades.
10,000 sleeping threads will take up more CPU time, because the OS has to make scheduling judgements every timeslice tick (every 60ms, or thereabouts). The more threads it has to check for being ready to run, the more CPU time that checking takes.
Translate Lookaside Buffers
Adding more threads than cores is generally seen as OK. But you can run into a problem with Translate Lookaside Buffers (or their equivalents on other CPUs). These are part of the virtual memory management side of the CPU, and they themselves are effectively content address memory. This is really hard to implement, so there's never that much of it. Thus the more memory allocations there are (which there will be if you add more and more threads) the more this resource is eaten up, to the point where the OS may have to start swapping in and out different loadings of the TLB in order for all the virtual memory allocations to be accessible. If this starts happenging, everything in the process becomes really, really slow. This is likely less of a problem these days than it was, say, 20 years ago.
Also, modern memory allocators in C libraries (and thence everything else built on top, e.g. Java, C#, the lot) will actually be quite careful in how requests for virtual memory are managed, minising the times they actually have to as the OS for more virtual memory. Basically they seek to provide requested allocations out of pools they've already got, rather than each malloc() resulting in a call to the OS. This takes the pressure of the TLBs.

No of threads using epoll

Let's say I have a high performance server application that's running 50 threads where each of those threads is listening for data on 1000 sockets using epoll(). When some data comes on a socket the thread processes it. Assume that this processing is very very fast i.e. does not block much. As a result all 50 threads can scale pretty well. Given this situation, I've heard that if I were to go in and reduce the number of threads to 5 (i.e. make the 1/10th) and increase the sockets per thread to 10000 (i.e. make them 10 times) the system would be faster overall although the number of connections remains the same. I'm imagining this happens because when the number of threads is reduced by 1/10th the number of context switches and the chance of CPU caches getting stale is significantly reduced which results in the speed-up.
However, what I'm wondering if this will really help.
I'm concerned about how other threads in the system outside of this particular application will influence the picture. I'm assuming all threads, regardless of in which process they are, compete with each other for the CPUs from the scheduler's point of view. If the total threads in the system in the first scenario where 1050 and after my change they come down to 1005, it doesn't change much system wide. You still have roughly similar number of threads competing for the CPUs. So in this case my application won't benefit much. My question is is my thinking correct? In other words whenever you talk about number of threads in an epoll() based application you shouldn't be thinking of total threads in the system not just the application you're concerned with?

Will my program get more cpu time if it has more threads

If currently kernel is scheduling 60 threads, they belong to 3 processes:
A: 10 threads
B: 20 threads
C: 30 threads
And they are all doing computing (no disk IO)
Will C gets more stuff done than B, and B gets more stuff done than A?
It does not seem very fair to me. If I am a irresponsible programmer, I can just spawn more threads to eat more CPU resource.
How this relate to golang:
In go, typically the go scheduler has a thread pool of #of-CPU-cores threads. Why does this make sense if a process with more threads gets more stuff done?
The situation you describe is an overloaded machine. It is only true that the process with more threads will get more work done if the CPU has no spare time.
Go was not designed to fight other processes for a greater share of an overloaded CPU. You are free to set GOMAXPROCS to any number you like if you desire to participate in such a fight.
In a more typical system where the total work is less than the total CPU time; a Go process with 8 threads and 30 goroutines will perform about equally to a process with 30 threads running at the same time.
It does not seem very fair to me. If I am a irresponsible programmer,
I can just spawn more threads to eat more CPU resource.
you can also allocate your entire free memory, cause OS-failure and hijack the network card. you can do sorts of things but then who will want to consume your softwares?
How this relate to golang: In go, typically the go scheduler has a
thread pool of #of-CPU-cores threads. Why does this make sense if a
process with more threads gets more stuff done?
Golang goroutines are basically a threadpool. each goroutine is a thread-pool work item. many things can make your thread-pool thread block, like using synchronous IO, waiting on a (non-spin-lock) lock and manually sleeping or yielding. in these cases, which are very common, having more threads than CPU's usually increase the performance of your application.
Do note that not all IO is a disk IO. writing to your console is an IO operation, but it isn't really a "Disk IO".
another thing is that context switching may not consume a large portion of your CPU, and having more threads may not make your tasks-throughput degrade. so in this case having more threads means that your parallelism is higher, and yet you don't loose performance. this is somewhat common situation. context switching between threads these days is very cheap. having a bit more threads than your cores may not necessary kill your performance or degrade it in some way.

Resources