I am using the Throughput Shaping Timer in JMeter
I am occasionally seeing warning messages such as the below:
2020-01-21 17:02:01,007 WARN k.a.j.t.VariableThroughputTimer: No free threads left in worker pool, made 316/500.0 samples
2020-01-21 17:02:02,009 WARN k.a.j.t.VariableThroughputTimer: No free threads left in worker pool, made 164/500.0 samples
2020-01-21 17:02:03,016 WARN k.a.j.t.VariableThroughputTimer: No free threads left in worker pool, made 263/500.0 samples
2020-01-21 17:02:04,009 WARN k.a.j.t.VariableThroughputTimer: No free threads left in worker pool, made 311/500.0 samples
2020-01-21 17:02:05,009 WARN k.a.j.t.VariableThroughputTimer: No free threads left in worker pool, made 288/500.0 samples
I am using Target Concurrency: ${__tstFeedback(rate_profile,100,5000,500)}
I need to get to 500 tps, I am not able to get more than 270 tps out of a jmeter server instance with this plugin.
Monitors indicate CPU, Disk, Network, Memory resources are available, however, test start logging this warning
Could you please help ?
The error means that you set too low Spare threads ratio, try increasing:
Starting concurrency
Spare threads ration
for a factor of 2x.
Also consider correlating the number of threads with the application response time.
It also worth checking JMeter's JVM metrics like heap space usage and garbage collection intervals, JMeter's JVM heap usage needs to be between 40 and 70% for optimal performance
JMeter JVM settings can be monitored using i.e. JVisualVM
If a single JMeter instance cannot produce the required load you will have to go for Distributed Testing
Also remember that it might be the case your application isn't capable of responding fast enough so it worth checking its health metrics, performance metrics and logs as well.
With this config ${__tstFeedback(rate_profile,100,5000,500)}, the number of threads used by the jmeter process is 120 in the VM,
I used another config $ {__ tstFeedback (rate_profile, 300,3000,0.75)} so that the number of threads increases to 320,
therefore the number of TPS has increased to 500 TPS.
Related
I am running Http server using Netty I/O library on a quad-core Linux machine. Running with default worker thread pool size (which is set to 2 x number of cores internally in Netty), performance analysis show throughput caps at 1k requests/second and further increases in request rate causes increase in latency almost linearly.
As max CPU utilization shows 60%, I increased number of worker threads as per code below. However, there is hardly any change in performance and CPU is still capped at 60-70%. The process is not bounded by memory, I/O or network bandwidth. Why isn't there change in performance by increasing worker threads? What other things I can do to improve my server performance to increase it's capacity.
EventLoopGroup group = new NIOEventLoopGroup(100);
ServerBootStrap serverBootStrap = new ServerBootStrap();
serverBootStrap.group(group)
.channel(NioServerSocketChannel.class)
.localAddress(..)
...
If your code is using purely non-blocking I/O, you should be reaching more than 1k TPS with a quad-core. You should analyse what the Netty threads are doing, i.e. if they are getting blocked by any call done within the event loop. VisualVM should already give you a good idea of what's happening, e.g. here you can see Vert.x threads, which use Netty behind the scenes sleeping:
Another thing you could try is to disable hyperthreading and checking how CPU utilisation behaves: https://serverfault.com/questions/235825/disable-hyperthreading-from-within-linux-no-access-to-bios
I am trying to understand the CPU utilization report from running sar -u 1 3. On my system, it says ~50% utilization. I am surprised because ours is a high performance software where we create lots of threads. So I am thinking that context switches and cache misses could be the reason why we're seeing low CPU utilization numbers.
Question is, if a thread tries to read data and if that data has to be fetched from memory, is the thread considered to be "utilizing" the CPU while waiting for the data or not?
We have a mobile app API server written with Ratpack 1.5.1 about to go live soon, and we're currently profiling the application to catch any performance bottlenecks. The app is backed by an SQL database and we're careful to always run queries using the Blocking class. The code is written in Kotlin and we wrote some coroutine glue code to force blocking operations to be executed on Ratpack's blocking threads.
Since Ratpack's thread model is unique we'd like to make sure this situation is normal: we simulated 2500 concurrent users of the application and our thread count went up to 400 (and even 600 at one point), most of these being ratpack-blocking-x-yyy threads.
Sampling the CPU we get 92% time spent in the ratpack.exec.internal.DefaultExecController$ExecControllerBindingThreadFactory.lambda$newThread$0 method, but this could be an artifact of sampling.
So, to ask concrete questions: given Ratpack's thread model, is the high blocking thread count normal and should we be worrying about the high CPU time spent in the above mentioned method?
Ratpack creates unlimited(*) thread-pool for blocking operations. It gets created in DefaultExecController:
public DefaultExecController(int numThreads) {
this.numThreads = numThreads;
this.eventLoopGroup = ChannelImplDetector.eventLoopGroup(numThreads, new ExecControllerBindingThreadFactory(true, "ratpack-compute", Thread.MAX_PRIORITY));
this.blockingExecutor = Executors.newCachedThreadPool(new ExecControllerBindingThreadFactory(false, "ratpack-blocking", Thread.NORM_PRIORITY));
}
Threads that are created in this pool don't get killed right after blocking operation is done - they are idling in the pool and waiting for the next job to do. The main reason behind it is that keeping thread in idle state is cheaper than spawning new threads when they are needed. That's why when you simulate 2500 concurrent users calling and endpoint which executes blocking operation, you will see 2500 threads in this pool. Cached thread-pool that gets created uses following ThreadPoolExecutor object:
public static ExecutorService newCachedThreadPool(ThreadFactory threadFactory) {
return new ThreadPoolExecutor(0, 2147483647, 60L, TimeUnit.SECONDS, new SynchronousQueue(), threadFactory);
}
where 2147483647 is maximum pool size, 60L is TTL expressed in seconds. It means that executor service will keep those threads for 60 seconds and when they don't get re-used after 60 seconds, it will clean them up.
High CPU in this case is actually expected. 2500 threads are utilizing a few cores of the CPU. It's also important - where your SQL database is running? If you run it on the same machine then your CPU has even harder job to do. If the operations you run on blocking thread-pool are consuming significant CPU time, then you have to optimize those blocking operations. Ratpack's power comes with async and non-blocking architecture - handlers use ratpack-compute thread-pool and delegate all blocking operations to ratpack-blocking so your application is not blocked and can handle tons of requests.
(*) unlimited in this case means limited by available memory or if you have enough memory it is limited by 2147483647 threads (this value is used in ExecutorService.newCachedThreadPool(factory)).
Just to build on Szymon's answer…
Ratpack doesn't inherently throttle any operations. That's effectively up to you. One option you have is to use https://ratpack.io/manual/current/api/ratpack/exec/Throttle.html to constrain and queue access to a resource.
Recently I have tried to use G1GC from jdk1.7.0-17 in my java processor which is processing a lot of similar messages received from an MQ (about 15-20 req/sec). Every message is processed in the separate thread (about 100 threads in stable state) that serviced by Java limited thread pool. Surprisingly, I detected the strange behaviour - as soon as GC starts the full gc cycle it begins to use significant processing time (up to 100% CPU and even more). I was doing refactoring of the code several times having a goal to optimizing it and doing it more lightweight. But without any significant result - the behaviour is the same. I use the 4-core 64-bit machine with Debian OS (2.6.32-5 kernel). May someone help me to understand and resolve the situation?
Below are depicted some illustrations for listed above issue.
Surprisingly, I detected the strange behaviour - as soon as GC starts
the full gc cycle...
Unfortunately, this is not a surprise because for the G1 GC implemented within the JVM uses just one hardware thread (vCPU) to execute the Full GC so the idea is to minimize the number of Full GCs. Please, you should keep in mind this collector is recommended for configurations with several cores (of course it does not impact on the Full GC, but impacts on allocation and parallel collections) and big heaps I think bigger than 8GB.
According to Oracle:
https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/g1_gc.html
The Garbage-First (G1) garbage collector is a server-style garbage
collector, targeted for multiprocessor machines with large memories.
It attempts to meet garbage collection (GC) pause time goals with high
probability while achieving high throughput. Whole-heap operations,
such as global marking, are performed concurrently with the
application threads. This prevents interruptions proportional to heap
or live-data size.
In this article there is an explanation about the Full GC single thread in this collector.
https://www.redhat.com/en/blog/part-1-introduction-g1-garbage-collector
Finally and unfortunately, G1 also has to deal with the dreaded Full
GC. While G1 is ultimately trying to avoid Full GC’s, they are still a
harsh reality especially in improperly tuned environments. Given that
G1 is targeting larger heap sizes, the impact of a Full GC can be
catastrophic to in-flight processing and SLAs. One of the primary
reasons is that Full GCs are still a single-threaded operation in G1.
Looking at causes, the first, and most avoidable, is related to
Metaspace.
By the way, it seems to be the newest version of Java (10) is going to include a G1 with the capability of executing Full GCs in parallel.
https://www.opsian.com/blog/java-10-with-g1/
Java 10 reduces Full GC pause times by iteratively improving on its
existing algorithm. Until Java 10 G1 Full GCs ran in a single thread.
That’s right - your 32 core server and it’s 128GB will stop and pause
until a single thread takes out the garbage.
Perhaps, you should tune the metaspace or increase the heap or you can use other collector such as the parallel GC.
I need to change the thread pool size of JVM . Is there any possibility to do this.
I am running a high threaded jar on JVM. Thats why the some threads goes under sleep or block stages.
The jvm doesn't have a global thread pool per se. If you are using one of the java.util.concurrent.Executor-implementations, read up on the javadoc for that class/method. This is adjusted in java-code per pool you have created (from your code) and is not related to the JVM.
That said, please consider that each thread (typically) consumes 512k of virtual memory for it's stack, which limits the number of maximum available threds for a 32-bit jvm (but it doesn't sound like this is your problem at all).
When your threads block a lot you probably also have some kind of contention, meaning that you have some common resource that they are waiting for. Perhaps you are you using "synchronized" a lot? More threads won't solve that problem, but rather just consume more resources in the OS and JVM.
Please get back with a bit more details of what your code is doing and how, and perhaps I can help a bit more.