I am running Http server using Netty I/O library on a quad-core Linux machine. Running with default worker thread pool size (which is set to 2 x number of cores internally in Netty), performance analysis show throughput caps at 1k requests/second and further increases in request rate causes increase in latency almost linearly.
As max CPU utilization shows 60%, I increased number of worker threads as per code below. However, there is hardly any change in performance and CPU is still capped at 60-70%. The process is not bounded by memory, I/O or network bandwidth. Why isn't there change in performance by increasing worker threads? What other things I can do to improve my server performance to increase it's capacity.
EventLoopGroup group = new NIOEventLoopGroup(100);
ServerBootStrap serverBootStrap = new ServerBootStrap();
serverBootStrap.group(group)
.channel(NioServerSocketChannel.class)
.localAddress(..)
...
If your code is using purely non-blocking I/O, you should be reaching more than 1k TPS with a quad-core. You should analyse what the Netty threads are doing, i.e. if they are getting blocked by any call done within the event loop. VisualVM should already give you a good idea of what's happening, e.g. here you can see Vert.x threads, which use Netty behind the scenes sleeping:
Another thing you could try is to disable hyperthreading and checking how CPU utilisation behaves: https://serverfault.com/questions/235825/disable-hyperthreading-from-within-linux-no-access-to-bios
I am looking to confirm my assumptions about threads and CPU cores.
All the threads are the same. No disk I/O is used, threads do not share memory, and each thread does CPU bound work only.
If I have CPU with 10 cores, and I spawn 10 threads, each thread will have its own core and run simultaneously.
If I launch 20 threads with a CPU that has 10 cores, then the 20 threads will "task switch" between the 10 cores, giving each thread approximately 50% of the CPU time per core.
If I have 20 threads but 10 of the threads are asleep, and 10 are active, then the 10 active threads will run at 100% of the CPU time on the 10 cores.
An thread that is asleep only costs memory, and not CPU time. While the thread is still asleep. For example 10,000 threads that are all asleep uses the same amount of CPU as 1 thread asleep.
In general if you have a series of threads that sleep frequently while working on a parallel process. You can add more threads then there are cores until get to a state where all the cores are busy 100% of the time.
Are any of my assumptions incorrect? if so why?
Edit
When I say the thread is asleep, I mean that the thread is blocked for a specific amount of time. In C++ I would use sleep_for Blocks the execution of the current thread for at least the specified sleep_duration
If we assume that you are talking about threads that are implemented using native thread support in a modern OS, then your statements are more or less correct.
There are a few factors that could cause the behavior to deviate from the "ideal".
If there are other user-space processes, they may compete for resources (CPU, memory, etcetera) with your application. That will reduce (for example) the CPU available to your application. Note that this will include things like the user-space processes responsible for running your desktop environment etc.
There are various overheads that will be incurred by the operating system kernel. There are many places where this happens including:
Managing the file system.
Managing physical / virtual memory system.
Dealing with network traffic.
Scheduling processes and threads.
That will reduce the CPU available to your application.
The thread scheduler typically doesn't do entirely fair scheduling. So one thread may get a larger percentage of the CPU than another.
There are some complicated interactions with the hardware when the application has a large memory footprint, and threads don't have good memory locality. For various reasons, memory intensive threads compete with each other and can slow each other down. These interactions are all accounted as "user process" time, but they result in threads being able to do less actual work.
So:
1) If I have CPU with 10 cores, and I spawn 10 threads, each thread will have its own core and run simultaneously.
Probably not all of the time, due to other user processes and OS overheads.
2) If I launch 20 threads with a CPU that has 10 cores, then the 20 threads will "task switch" between the 10 cores, giving each thread approximately 50% of the CPU time per core.
Approximately. There are the overheads (see above). There is also the issue that time slicing between different threads of the same priority is fairly coarse grained, and not necessarily fair.
3) If I have 20 threads but 10 of the threads are asleep, and 10 are active, then the 10 active threads will run at 100% of the CPU time on the 10 cores.
Approximately: see above.
4) An thread that is asleep only costs memory, and not CPU time. While the thread is still asleep. For example 10,000 threads that are all asleep uses the same amount of CPU as 1 thread asleep.
There is also the issue that the OS consumes CPU to manage the sleeping threads; e.g. putting them to sleep, deciding when to wake them, rescheduling.
Another one is that the memory used by the threads may also come at a cost. For instance if the sum of the memory used for all process (including all of the 10,000 threads' stacks) is larger than the available physical RAM, then there is likely to be paging. And that also uses CPU resources.
5) In general if you have a series of threads that sleep frequently while working on a parallel process. You can add more threads then there are cores until get to a state where all the cores are busy 100% of the time.
Not necessarily. If the virtual memory usage is out of whack (i.e. you are paging heavily), the system may have to idle some of the CPU while waiting for memory pages to be read from and written to the paging device. In short, you need to take account of memory utilization, or it will impact on the CPU utilization.
This also doesn't take account of thread scheduling and context switching between threads. Each time the OS switches a core from one thread to another it has to:
Save the the old thread's registers.
Flush the processor's memory cache
Invalidate the VM mapping registers, etcetera. This includes the TLBs that #bazza mentioned.
Load the new thread's registers.
Take performance hits due to having to do more main memory reads, and vm page translations because of previous cache invalidations.
These overheads can be significant. According to https://unix.stackexchange.com/questions/506564/ this is typically around 1.2 microseconds per context switch. That may not sound much, but if your application is switching threads rapidly, that could amount to many milliseconds in each second.
As already mentioned in the comments, it depends on a number of factors. But in a general sense your assumptions are correct.
Sleep
In the bad old days a sleep() might have been implemented by the C library as a loop doing pointless work (e.g. multiplying 1 by 1 until the required time had elapsed). In that case, the CPU would still be 100% busy. Nowadays a sleep() will actually result in the thread being descheduled for the requisite time. Platforms such as MS-DOS worked this way, but any multitasking OS has had a proper implementation for decades.
10,000 sleeping threads will take up more CPU time, because the OS has to make scheduling judgements every timeslice tick (every 60ms, or thereabouts). The more threads it has to check for being ready to run, the more CPU time that checking takes.
Translate Lookaside Buffers
Adding more threads than cores is generally seen as OK. But you can run into a problem with Translate Lookaside Buffers (or their equivalents on other CPUs). These are part of the virtual memory management side of the CPU, and they themselves are effectively content address memory. This is really hard to implement, so there's never that much of it. Thus the more memory allocations there are (which there will be if you add more and more threads) the more this resource is eaten up, to the point where the OS may have to start swapping in and out different loadings of the TLB in order for all the virtual memory allocations to be accessible. If this starts happenging, everything in the process becomes really, really slow. This is likely less of a problem these days than it was, say, 20 years ago.
Also, modern memory allocators in C libraries (and thence everything else built on top, e.g. Java, C#, the lot) will actually be quite careful in how requests for virtual memory are managed, minising the times they actually have to as the OS for more virtual memory. Basically they seek to provide requested allocations out of pools they've already got, rather than each malloc() resulting in a call to the OS. This takes the pressure of the TLBs.
I have two questions..
1. What is he difference between thread and thread pool? Can I have multiple thread pools (not threads) in my system.
2. I have been reading that general size of a threads in a thread pool is to be same as the number of processors or one more than the processor. I am using a quad core processor, that means I can have 4 or 5 threads in a thread pool. However under task manager my system shows more than 1000 threads active anytime..??
What is he difference between thread and thread pool?
A thread is a single flow of execution. A thread pool is a group of these threads; usually the threads in a thread pool are kept alive indefinitely (i.e. until program shutdown) so that as each new work-request comes in, it can be handed to the next available thread in the thread-pool for processing. (This is beneficial because it's more efficient to just wake up an existing thread and hand it some work than it is to always create a new thread every time a new work-request comes in, and then destroy the thread afterwards)
Can I have multiple thread pools (not threads) in my system.
Yes.
I have been reading that general size of a threads in a thread pool is to be same as the number of processors or one more than the
processor.
That's a good heuristic, but it's not a requirement; your thread pool can have as many or as few threads in it as you like. The reason people suggest that number is that if you have fewer threads in your thread pool than you have physical CPU cores, then under heavy load not all of your CPU cores will get used (e.g. if you have a 3-thread pool and 4 CPU cores, then under heavy load you'll have 3 CPU cores busy and 1 CPU core idle/wasted, and your program will take ~25% longer to finish the work than if it would have if it had 4 threads in the pool). On the other hand, if you have more threads than CPU cores, then under heavy load the "extra" threads merely end up time-sharing a CPU core together, slowing each other down and not providing any additional benefit in terms of work-completion-rate.
However under task manager my system shows more than 1000 threads
active anytime..??
The thing to notice about those 1000 threads is that probably 99% of them are actually asleep at any given moment. Most threads aren't doing work all the time; rather they spend most of their lives waiting for some particular event to occur, quickly handling it, and then going back to sleep until the next event comes along for them to handle. That's the reason why you can have 1000 threads present on just a handful of CPU cores without everything bogging down.
I have a Process that invoke Multiple threads( say 6 Thread) .
What will be the impact of its performance once If I run it on a server machine with
6 CPU OR 4 CPU
What is the relation between Threads CPU and Cores inside each CPU.
I have read that, threads run in only different cores inside one CPU.is that true?
It depends.
If your tasks are CPU-bound with no pipeline stalls, then you'll get the best performance from spawning one thread per physical CPU core.
If your CPU-bound tasks have pipeline stalls from cache misses, branch mispredictions, dependencies, etc, then you can take advantage of Hyperthreading and spawn one thread per virtual core. On a CPU without Hyperthreading the number of virtual cores is equal to the number of physical cores.
If your tasks block for IO, then you can benefit from spawning many more threads than CPU cores. The Apache web server is an example of this approach.
I'm trying to make a Scala application server based on the classic worker pool model.
Given that:
the machine has a quad-core processor
there is a scheduler actor which is dedicated to blocking network I/O to listen
worker actors are all non-blocking.
What would be the best value for corePoolSize to maximize the performance?
Ideally the performance is maximized when the size of the worker thread pool is equal to the number of processor cores.
So in this case, I guess the best value would be 5 (1 for the scheduler and the other 4 for the workers), or alternatively I could set the value to 4 and override the scheduler method of the scheduler actor so that it will not share the thread pool with the workers.
Is this correct? Any advice appreciated.
Thanks!
Just some hints.
Ideally the performance is maximized when the size of the worker thread pool is equal to the number of processor cores.
Not really. Here is how you could estimate the number of threads at which you can get maximum throughput:
N = C * U * (1 + W/C)
where N = number of threads, C = number of CPU cores, U = target CPU utilization rate, W/C = Waiting time to Computing time ratio (waiting time means IO etc.).
Note however that the above equation only considers CPU, and CPU isn't the only resource to manage. Tuning for response time would also be a bit different matter.
The cliche answer is that you have to test in order to see what't the best option. You can probably use the above formula as a starting point. Note also that core pool size != max pool size.