I need to change the thread pool size of JVM . Is there any possibility to do this.
I am running a high threaded jar on JVM. Thats why the some threads goes under sleep or block stages.
The jvm doesn't have a global thread pool per se. If you are using one of the java.util.concurrent.Executor-implementations, read up on the javadoc for that class/method. This is adjusted in java-code per pool you have created (from your code) and is not related to the JVM.
That said, please consider that each thread (typically) consumes 512k of virtual memory for it's stack, which limits the number of maximum available threds for a 32-bit jvm (but it doesn't sound like this is your problem at all).
When your threads block a lot you probably also have some kind of contention, meaning that you have some common resource that they are waiting for. Perhaps you are you using "synchronized" a lot? More threads won't solve that problem, but rather just consume more resources in the OS and JVM.
Please get back with a bit more details of what your code is doing and how, and perhaps I can help a bit more.
Related
My understanding of threads is that you can only have one thread per core, two with hyper threading, before you start losing efficiency.
This computer has eight cores and so should work best with 8/16 threads then, yet many applications use several times that, especially Dropbox.
It also uses 95 threads while idling on my laptop, which only has 4 cores.
Why is this the case? Does it have so many threads for programming convenience, have I misunderstood threading efficiency or is it something else entirely?
I took a peek at the Mac version of the client, and it seems to be written in Python and it uses several frameworks.
A bunch of threads seem to be used in some in house actor system
They use nucleus for app analytics
There seems to be a p2p network
some networking threads (one per hype core)
a global pool (one per physical core)
many threads for file monitoring and thumbnail generation
task schedulers
logging
metrics
db checkpointing
something called infinite configuration
etc.
Most are idle.
It looks like a hodgepodge of subsystems, each starting their own threads, but they don't seem too expensive in terms of memory or CPU.
My understanding of threads is that you can only have one thread per core, two with hyper threading, before you start losing efficiency.
Nope, this is not true. I'm not sure why you think that, but it's not true.
As just the most obvious way to show that it's false, suppose you had that number of threads and one of them accessed a page of memory that wasn't in RAM and had to be loaded to disk. If you don't have any other threads that can run, then one core is wasted for the entire time it takes to read that page of memory from disk.
It's hard to address the misconception directly without knowing what flawed chain of reasoning led to it. But the most common one is that if you have more threads ready-to-run than you can execute at once, then you have lots of context switches and context switches are expensive.
But that is obviously wrong. If all the threads are ready-to-run, then no context switches are necessary. A context switch is only necessary if a running thread stops being ready-to-run.
If all context switches are voluntary, then the implementation can select the optimum number of context switches. And that's precisely what it does.
Having large numbers of threads causes you to lose efficiency if, and only if, lots of threads do a small amount of work and then become no longer ready-to-run while other waiting threads are ready-to-run. That forces the implementation to do a context even where it is not optimal.
Some applications that use lots of threads do in fact do this. And that does result in poor performance. But Dropbox doesn't.
Suppose I have a multi-threaded application (say ~40 threads) running on a multiprocessor system (say 8 cores) with Linux as the operating system where different threads are more essentially LWP (Light Weight Processes) being scheduled by the kernel.
What would be benefits/drawbacks of using the CPU affinity? Whether CPU affinity is going to help by localizing the threads to a subset of cores thus minimizing cache sharing/misses?
If you use strict affinity, then a particular thread MUST run on that processor (or set of processors). If you have many threads that work completely independently, and they work on larger chunks of memory than a few kilobytes, then it's unlikely you'll benefit much from running on one particular core - since it's quite possible the other threads running on this particular CPU would have thrown out any L1 cache, and quite possibly L2 caches too. Which is more important for performance - cahce content or "getting to run sooner"? Are some CPU's always idle, or is the CPU load 100% on every core?
However, only you know (until you tell us) what your threads are doing. How big is the "working set" (how much memory - code and data) are they touching each time they get to run? How long does each thread run when they are running? What is the interaction with other threads? Are other threads using shared data with "this" thread? How much and what is the pattern of sharing?
Finally, the ultimate answer is "What makes it run faster?" - an answer you can only find by having good (realistic) benchmarks and trying the different possible options. Even if you give us every single line of code, running time measurements for each thread, etc, etc, we could only make more or less sophisticated guesses - until these have been tried and tested (with VARYING usage patterns), it's almost impossible to know.
In general, I'd suggest that having many threads either suggest that each thread isn't very busy (CPU-wise), or you are "doing it wrong"... More threads aren't better if they are all running flat out - better to have fewer threads in that case, because they are just going to fight each other.
The scheduler already tries to keep threads on the same cores, and to avoid migrations. This suggests that there's probably not a lot of mileage in managing thread affinity manually, unless:
you can demonstrate that for some reason the kernel is doing a bad a job for your particular application; or
there's some specific knowledge about your application that you can exploit to good effect.
localizing the threads to a subset of cores thus minimizing cache
sharing/misses
Not necessarily, you have to consider cache coherence too, if two or more threads access a shared memory buffer and each one is bound to a different CPU core their caches have to be synchronized if one thread writes to a shared cache line there will be a significant overhead to invalidate other caches.
I'm currently learning about actors in Scala. The book recommends using the react method instead of receive, because it allows the system to use less threads.
I have read why creating a thread is expensive. But what are the reasons that, once you have the threads (which should hold for the actor system in Scala after the initialization), having them around is expensive?
Is it mainly the memory consumption? Or are there any other reasons?
Using many threads may be more expensive than you would expect because:
each thread consumes memory outside of heap which places restriction on how many threads can be created at all for JVM;
switch from one thread to another consumes some CPU time, so if you have activity which can be performed in a single thread, you will save CPU cycles;
there's JVM scheduler which has more work to do if there are more threads. Same applies to underlying OS scheduler;
at last, it makes little sense to use more threads than you have CPU cores for CPU-bound tasks and it makes little sense to use more I/O threads than you have I/O activities (e.g., network clients).
Besides the memory overhead of having a thread around (which may or may not be small), having more threads around will also usually mean that the schedule will have more elements to consider when the time comes for it to pick which thread will get the CPU next.
Some Operating Systems / JVMs may also have constraints on the amount of threads that can concurrently exist.
Eventually, it's an accumulation of small overheads that can eventually account to a lot. And none of this is actually specific to Java.
Having threads around is not "expensive". Of course, it kinda depends on how many we're talking about here. I'd suspect billions of threads would be a problem. I think generally speaking, having a lot of threads is considered expensive because you can do more parallel work so CPU goes up, memory goes up, etc... But if they are correctly managed (pooled for example to protect the system resources) then it's ok. The JVM does not necessarily use native threads so a Java thread is not necessarily mapped to an OS native threads (i.e. look at green threads for example, or lightweight threads). In my opinion, there's no implicit cost to threads in the JVM. The cost comes from poor thread management and overuse of the resources by carelessly assigning them work.
I saw in one application (java) which has irregular memory usage in a Solaris box. When I took a thread dump, I saw there were 31 "GC task thread"...
This is very strange as in other Solaris box, same application only had 2 "GC task thread".
Wondered if anybody know, under which circumstance, the jvm will create so many GC task threads and could this cause memory issue?
Any ideas is appreciated.
Some more background on my case:
Each time I will have two similar Java applications running at the same time in same box. I will keep sending requests to application A , and no request to application B. So, app B should be in-active. And it is alway "sleep" when using prstat.
Strange thing is, in one Solaris box, app B keeps consuming memory while app A is processing request. And in app B's thread dump, I can see 31 GC task threads.
And in another Solaris box, app B is normal, the memory is normal and only 2 GC task threads.
Thanks a lot.
The GC task thread is related with the parallel garbage collector behaviour.
The value of number of parallel GC threads are defined by the
-XX:ParallelGCThreads=n
command line parameter. In the documentation of Java hotspot VM, it says:
-XX:ParallelGCThreads=n
Sets the number of threads used during parallel phases of the garbage
collectors. The default value varies with the platform on which the
JVM is running.
(see: http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html)
The documentation does not explicitly state, but the default value varies with the number of processors/cores the platform has, and I think the number is equal to the number of CPU cores.
So far, this was the answer of "why JVM creates that many GC task threads.
Does this cause any memory issues? The answer is no. In fact, each of the GC threads will have very low memory footprints, and thus they are not expected to cause any considerable memory issues. However, you may want to finetune the number of garbage collection threads by using the parameter described above.
I've just started programming with POSIX threads on dual-core x86_64 Linux system. It seems that 256 threads is about the optimum for performance with the way I've done it. I'm wondering how this could be? And if it could mean that my approach is wrong and a better approach would require far fewer threads and be just as fast or faster?
For further background (the program in question is a skeleton for a multi-threaded M-set image generator) see the following questions I've asked already:
Using threads, how should I deal with something which ideally should happen in sequential order?
How can my threaded image generating app get it’s data to the gui?
Perhaps I should mention that the skeleton (in which I've reproduced minimal functionality for testing and comparison) is now displaying the image, and the actual calculations are done almost twice as fast as the non-threaded program.
So if 256 threads running faster than 8 threads is not indicative of a poor approach to threading, how come 256 threads does outperform 8 threads?
The speed test case is a portion of the Mandelbrot Set located at:
xmin -0.76243636067708333333333328
xmax -0.7624335575810185185185186
ymax 0.077996663411458333333333929
calculated to a maximum of 30000 iterations.
On the non-threaded version rendering time on my system is around 15 seconds. On the threaded version, averages speed for 8 threads is 7.8 seconds, while 256 threads is 7.6 seconds.
Well, probably yes, you're doing something wrong.
However, there are circumstances where 256 threads would run better than 8 without you necessarily having a bad threading model. One must remember that having 8 threads does not mean all 8 threads are actually running all the time. Anytime one thread makes a blocking syscall to the operating system, the thread will stop running and wait for the result. In the meantime, another thread can often do work.
There's this myth that one can't usefully use more threads than contexts on the CPU, but that's just not true. If your threads block on a syscall, it can be critical to have another thread available to do more work. (In practice when threads block there tends to be less work to do, but this is not always the case.)
It's all very dependent on work-load and there's no one right number of threads for any particular application. Generally you never want less threads available than the OS will run, and that's the only true rule. (Unfortunately this can be very hard to find out and so people tend to just fire up as many threads as contexts and then use non-blocking syscalls where possible.)
Could it be your app is io bound? How is the image data generated?
A performance improvement gained by allocating more threads than cores suggests that the CPU is not the bottleneck. If I/O access such as disk, memory or even network access are involved your results make perfect sense.
You are probably benefitting from Simultaneous Multithreading (SMT). Your operating system schedules more threads than cores available, and will swap in and out the threads that are not stalled waiting for resources (such as a memory load). This can very effectively hide the latencies of your memory system from your program and is the technique used to great effect for massive parallelization in CUDA for general purpose GPU programming.
If you are seeing a performance increase with the jump to 256 threads, then what you are probably dealing with is a resource bottleneck. At some point, your code is waiting for some slow device (a hard disk or a network connection, for example) in order to continue. With multiple threads, waiting on this slow device isn't a problem because instead of sitting idle and twiddling its electronic thumbs, the CPU can process another thread while the first thread is waiting on the slow device. The more parallel threads that are running, the more work the CPU can do while it is waiting on something else.
If you are seeing performance improve all the way up to 256 threads, I am tempted to say that you have a major performance bottleneck somewhere and it's not the CPU. To test this, try to see if you can measure the idle time of individual threads. I suspect that you will see your threads are stuck in a "blocked" or "waiting" state for a longer portion of their lifetime than they spend in the "running" or "active" state. Some debuggers or function profiling tools will let you do this, and I think there are also Linux tools to do this on the command line.