I am running Http server using Netty I/O library on a quad-core Linux machine. Running with default worker thread pool size (which is set to 2 x number of cores internally in Netty), performance analysis show throughput caps at 1k requests/second and further increases in request rate causes increase in latency almost linearly.
As max CPU utilization shows 60%, I increased number of worker threads as per code below. However, there is hardly any change in performance and CPU is still capped at 60-70%. The process is not bounded by memory, I/O or network bandwidth. Why isn't there change in performance by increasing worker threads? What other things I can do to improve my server performance to increase it's capacity.
EventLoopGroup group = new NIOEventLoopGroup(100);
ServerBootStrap serverBootStrap = new ServerBootStrap();
serverBootStrap.group(group)
.channel(NioServerSocketChannel.class)
.localAddress(..)
...
If your code is using purely non-blocking I/O, you should be reaching more than 1k TPS with a quad-core. You should analyse what the Netty threads are doing, i.e. if they are getting blocked by any call done within the event loop. VisualVM should already give you a good idea of what's happening, e.g. here you can see Vert.x threads, which use Netty behind the scenes sleeping:
Another thing you could try is to disable hyperthreading and checking how CPU utilisation behaves: https://serverfault.com/questions/235825/disable-hyperthreading-from-within-linux-no-access-to-bios
Related
If a single thread cost 25% workload of a multi-core CPU, are 4 threads will fill the CPU?
Besides the 4 threads, can I do something else with the CPU?
If a single thread uses 25% of the CPU, then what is using the other 75%?
If the answer is, nothing is using the other 75%—if the CPU is idle 75% of the time—then the single thread is not cpu-bound. That is to say, something other than the availability of CPU cycles is limiting the speed of the program.
What is blocking the thread? Probably the thread is awaiting some resource; messages from the network, disk I/O, user input, etc. Whatever it is, if there are at least four of them, and if your program can be structured to concurrently use four of them, then maybe running four threads will allow the program to run four times as fast.
An example would be, if your program is some kind of a server, and it spends a lot of time awaiting messages from clients, then adding more threads would allow it to simultaneously await more messages, and it would improve the throughput.
A counterexample would be if your program is processing data from a single disk file, and it spends most of its time awaiting data from the disk. Since there's only one disk, and the disk can only go so fast, adding more threads would not help in that case.
I am observing strange effects with the CPU percentage as shown in e.g. top or htop on Linux (Ubuntu 16.04) for one special application. The application uses many threads (around 1000). Each thread has one computational task. About half of these tasks need to be computed once per "trigger" - the trigger is an external event received exactly every 100ms. The other threads are mostly sleeping (waiting for user interaction) and hence do not play a big role here. So to summarise: many threads are waking up basically simultaneously within a short period of time, doing there (relatively short) computation and going back to sleep again.
Since the machine running this application has 8 virtual CPUs (4 cores each 2 threads, it's an i7-3612QE), only 8 threads can really wake up at a time, so many threads will have to wait. Also some of these tasks have interdependencies, so they anyway have to wait, but I think as an approximation one can think of this application as a bunch of threads going to the runnable state at the same time every 100ms and each doing only a short computation (way below 1ms of CPU time each).
Now coming to the strange effect: If I look at the CPU percentage in "top", it shows something like 250%. As far as I know, top looks on the CPU time (user + system) the kernel accounts for this process, so 250% would mean the process uses 3 virtual CPUs on average. So far so good. Now, if I use taskset to force the entire process to use only a single virtual CPU, the CPU percentage drops to 80%. The application has internal accounting which tells me that still all data is being processed. So the application is doing the same amount of work, but it seemingly uses less CPU resources. How can that be? Can I really trust the kernel CPU time accounting, or is this an artefact of the measurement?
The CPU percentage also goes down, if I start other processes which take a lot of CPU, even if the do nothing ("while(true);") and are running at low priority (nice). If I launch 8 of these CPU-eating processes, the application reaches again 80%. With fewer CPU-eaters, I get gradually higher CPU%.
Not sure if this plays a role: I have used the profiler vtune, which tells me my application is actually quite inefficient (only about 1 IPC), mostly because it's memory bound. This does not change if I restrict the process to a single virtual CPU, so I assume the effect is not caused by a huge increase in efficiency when running everything on the same core (which would be strange anyway).
My question was essentially already answered by myself in the last paragraph: The process is memory bound. Hence not the CPU is the limited resource but the memory bandwidth. Allowing such process to run on multiple CPU cores in parallel will mainly have the effect that more CPU cores are waiting for data to arrive from RAM. This is counted as CPU load, since the CPU is executing the thread, but just quite slowly. All my other observations go along with this.
I am looking to confirm my assumptions about threads and CPU cores.
All the threads are the same. No disk I/O is used, threads do not share memory, and each thread does CPU bound work only.
If I have CPU with 10 cores, and I spawn 10 threads, each thread will have its own core and run simultaneously.
If I launch 20 threads with a CPU that has 10 cores, then the 20 threads will "task switch" between the 10 cores, giving each thread approximately 50% of the CPU time per core.
If I have 20 threads but 10 of the threads are asleep, and 10 are active, then the 10 active threads will run at 100% of the CPU time on the 10 cores.
An thread that is asleep only costs memory, and not CPU time. While the thread is still asleep. For example 10,000 threads that are all asleep uses the same amount of CPU as 1 thread asleep.
In general if you have a series of threads that sleep frequently while working on a parallel process. You can add more threads then there are cores until get to a state where all the cores are busy 100% of the time.
Are any of my assumptions incorrect? if so why?
Edit
When I say the thread is asleep, I mean that the thread is blocked for a specific amount of time. In C++ I would use sleep_for Blocks the execution of the current thread for at least the specified sleep_duration
If we assume that you are talking about threads that are implemented using native thread support in a modern OS, then your statements are more or less correct.
There are a few factors that could cause the behavior to deviate from the "ideal".
If there are other user-space processes, they may compete for resources (CPU, memory, etcetera) with your application. That will reduce (for example) the CPU available to your application. Note that this will include things like the user-space processes responsible for running your desktop environment etc.
There are various overheads that will be incurred by the operating system kernel. There are many places where this happens including:
Managing the file system.
Managing physical / virtual memory system.
Dealing with network traffic.
Scheduling processes and threads.
That will reduce the CPU available to your application.
The thread scheduler typically doesn't do entirely fair scheduling. So one thread may get a larger percentage of the CPU than another.
There are some complicated interactions with the hardware when the application has a large memory footprint, and threads don't have good memory locality. For various reasons, memory intensive threads compete with each other and can slow each other down. These interactions are all accounted as "user process" time, but they result in threads being able to do less actual work.
So:
1) If I have CPU with 10 cores, and I spawn 10 threads, each thread will have its own core and run simultaneously.
Probably not all of the time, due to other user processes and OS overheads.
2) If I launch 20 threads with a CPU that has 10 cores, then the 20 threads will "task switch" between the 10 cores, giving each thread approximately 50% of the CPU time per core.
Approximately. There are the overheads (see above). There is also the issue that time slicing between different threads of the same priority is fairly coarse grained, and not necessarily fair.
3) If I have 20 threads but 10 of the threads are asleep, and 10 are active, then the 10 active threads will run at 100% of the CPU time on the 10 cores.
Approximately: see above.
4) An thread that is asleep only costs memory, and not CPU time. While the thread is still asleep. For example 10,000 threads that are all asleep uses the same amount of CPU as 1 thread asleep.
There is also the issue that the OS consumes CPU to manage the sleeping threads; e.g. putting them to sleep, deciding when to wake them, rescheduling.
Another one is that the memory used by the threads may also come at a cost. For instance if the sum of the memory used for all process (including all of the 10,000 threads' stacks) is larger than the available physical RAM, then there is likely to be paging. And that also uses CPU resources.
5) In general if you have a series of threads that sleep frequently while working on a parallel process. You can add more threads then there are cores until get to a state where all the cores are busy 100% of the time.
Not necessarily. If the virtual memory usage is out of whack (i.e. you are paging heavily), the system may have to idle some of the CPU while waiting for memory pages to be read from and written to the paging device. In short, you need to take account of memory utilization, or it will impact on the CPU utilization.
This also doesn't take account of thread scheduling and context switching between threads. Each time the OS switches a core from one thread to another it has to:
Save the the old thread's registers.
Flush the processor's memory cache
Invalidate the VM mapping registers, etcetera. This includes the TLBs that #bazza mentioned.
Load the new thread's registers.
Take performance hits due to having to do more main memory reads, and vm page translations because of previous cache invalidations.
These overheads can be significant. According to https://unix.stackexchange.com/questions/506564/ this is typically around 1.2 microseconds per context switch. That may not sound much, but if your application is switching threads rapidly, that could amount to many milliseconds in each second.
As already mentioned in the comments, it depends on a number of factors. But in a general sense your assumptions are correct.
Sleep
In the bad old days a sleep() might have been implemented by the C library as a loop doing pointless work (e.g. multiplying 1 by 1 until the required time had elapsed). In that case, the CPU would still be 100% busy. Nowadays a sleep() will actually result in the thread being descheduled for the requisite time. Platforms such as MS-DOS worked this way, but any multitasking OS has had a proper implementation for decades.
10,000 sleeping threads will take up more CPU time, because the OS has to make scheduling judgements every timeslice tick (every 60ms, or thereabouts). The more threads it has to check for being ready to run, the more CPU time that checking takes.
Translate Lookaside Buffers
Adding more threads than cores is generally seen as OK. But you can run into a problem with Translate Lookaside Buffers (or their equivalents on other CPUs). These are part of the virtual memory management side of the CPU, and they themselves are effectively content address memory. This is really hard to implement, so there's never that much of it. Thus the more memory allocations there are (which there will be if you add more and more threads) the more this resource is eaten up, to the point where the OS may have to start swapping in and out different loadings of the TLB in order for all the virtual memory allocations to be accessible. If this starts happenging, everything in the process becomes really, really slow. This is likely less of a problem these days than it was, say, 20 years ago.
Also, modern memory allocators in C libraries (and thence everything else built on top, e.g. Java, C#, the lot) will actually be quite careful in how requests for virtual memory are managed, minising the times they actually have to as the OS for more virtual memory. Basically they seek to provide requested allocations out of pools they've already got, rather than each malloc() resulting in a call to the OS. This takes the pressure of the TLBs.
We have a mobile app API server written with Ratpack 1.5.1 about to go live soon, and we're currently profiling the application to catch any performance bottlenecks. The app is backed by an SQL database and we're careful to always run queries using the Blocking class. The code is written in Kotlin and we wrote some coroutine glue code to force blocking operations to be executed on Ratpack's blocking threads.
Since Ratpack's thread model is unique we'd like to make sure this situation is normal: we simulated 2500 concurrent users of the application and our thread count went up to 400 (and even 600 at one point), most of these being ratpack-blocking-x-yyy threads.
Sampling the CPU we get 92% time spent in the ratpack.exec.internal.DefaultExecController$ExecControllerBindingThreadFactory.lambda$newThread$0 method, but this could be an artifact of sampling.
So, to ask concrete questions: given Ratpack's thread model, is the high blocking thread count normal and should we be worrying about the high CPU time spent in the above mentioned method?
Ratpack creates unlimited(*) thread-pool for blocking operations. It gets created in DefaultExecController:
public DefaultExecController(int numThreads) {
this.numThreads = numThreads;
this.eventLoopGroup = ChannelImplDetector.eventLoopGroup(numThreads, new ExecControllerBindingThreadFactory(true, "ratpack-compute", Thread.MAX_PRIORITY));
this.blockingExecutor = Executors.newCachedThreadPool(new ExecControllerBindingThreadFactory(false, "ratpack-blocking", Thread.NORM_PRIORITY));
}
Threads that are created in this pool don't get killed right after blocking operation is done - they are idling in the pool and waiting for the next job to do. The main reason behind it is that keeping thread in idle state is cheaper than spawning new threads when they are needed. That's why when you simulate 2500 concurrent users calling and endpoint which executes blocking operation, you will see 2500 threads in this pool. Cached thread-pool that gets created uses following ThreadPoolExecutor object:
public static ExecutorService newCachedThreadPool(ThreadFactory threadFactory) {
return new ThreadPoolExecutor(0, 2147483647, 60L, TimeUnit.SECONDS, new SynchronousQueue(), threadFactory);
}
where 2147483647 is maximum pool size, 60L is TTL expressed in seconds. It means that executor service will keep those threads for 60 seconds and when they don't get re-used after 60 seconds, it will clean them up.
High CPU in this case is actually expected. 2500 threads are utilizing a few cores of the CPU. It's also important - where your SQL database is running? If you run it on the same machine then your CPU has even harder job to do. If the operations you run on blocking thread-pool are consuming significant CPU time, then you have to optimize those blocking operations. Ratpack's power comes with async and non-blocking architecture - handlers use ratpack-compute thread-pool and delegate all blocking operations to ratpack-blocking so your application is not blocked and can handle tons of requests.
(*) unlimited in this case means limited by available memory or if you have enough memory it is limited by 2147483647 threads (this value is used in ExecutorService.newCachedThreadPool(factory)).
Just to build on Szymon's answer…
Ratpack doesn't inherently throttle any operations. That's effectively up to you. One option you have is to use https://ratpack.io/manual/current/api/ratpack/exec/Throttle.html to constrain and queue access to a resource.
Let's say I have a high performance server application that's running 50 threads where each of those threads is listening for data on 1000 sockets using epoll(). When some data comes on a socket the thread processes it. Assume that this processing is very very fast i.e. does not block much. As a result all 50 threads can scale pretty well. Given this situation, I've heard that if I were to go in and reduce the number of threads to 5 (i.e. make the 1/10th) and increase the sockets per thread to 10000 (i.e. make them 10 times) the system would be faster overall although the number of connections remains the same. I'm imagining this happens because when the number of threads is reduced by 1/10th the number of context switches and the chance of CPU caches getting stale is significantly reduced which results in the speed-up.
However, what I'm wondering if this will really help.
I'm concerned about how other threads in the system outside of this particular application will influence the picture. I'm assuming all threads, regardless of in which process they are, compete with each other for the CPUs from the scheduler's point of view. If the total threads in the system in the first scenario where 1050 and after my change they come down to 1005, it doesn't change much system wide. You still have roughly similar number of threads competing for the CPUs. So in this case my application won't benefit much. My question is is my thinking correct? In other words whenever you talk about number of threads in an epoll() based application you shouldn't be thinking of total threads in the system not just the application you're concerned with?