How can I limit number of threads that are being executed at the same time?
Here is sample of my algorithm:
for(i = 0; i < 100000; i++) {
Thread.start {
// Do some work
I would like to make sure that once number of threads in my application hits 100, algorithm will pause/wait until number of threads in the app goes below 100.
Currently "some work" takes some time to do and I end up with few thousands of threads in my app. Eventually it runs out of threads and "some work" crashes. I would like to fix it by limiting number of pools that it can use at one time.
Please let me know how to solve my issue.

I believe you are looking for a ThreadPoolExecutor in the Java Concurrency API. The idea here is that you can define a maximum number of threads in a pool and then instead of starting new Threads with a Runnable, just let the ThreadPoolExecutor take care of managing the upper limit for Threads.
Start here:
import java.util.concurrent.*;
import java.util.*;
def queue = new ArrayBlockingQueue<Runnable>( 50000 )
def tPool = new ThreadPoolExecutor(5, 500, 20, TimeUnit.SECONDS, queue);
for(i = 0; i < 5000; i++) {
tPool.execute {
println "Blah"
Parameters for the ThreadBlockingQueue constructor: corePoolSize (5), this is the # of threads to create and to maintain if the system is idle, maxPoolSize (500) max number of threads to create, 3rd and 4th argument states that the pool should keep idle threads around for at least 20 seconds, and the queue argument is a blocking queue that stores queued tasks.
What you'll want to play around with is the queue sizes and also how to handle rejected tasks. If you need to execute 100k tasks, you'll either have to have a queue that can hold 100k tasks, or you'll have to have a strategy for handling a rejected tasks.


Increasing parallelism level of scala .par operations

When I call par on collections, it seems to create about 5-10 threads, which is fine for CPU bound tasks.
But sometimes I have tasks which are IO bound, in which case I'd like to have 500-1000 threads pulling from IO concurrently - doing 10-15 threads is very slow and I see my CPUs mostly sitting idle.
How can I achieve this?
You could wrap your blocking io operations in blocking block:
(0 to 1000){ i =>
blocking {
}.max // yield 67 on my pc, while without blocking it's 10
But you should ask yourself a question if you should use parallel collections for IO operations. Their use case is to perform a CPU heavy task.
I would suggest you to consider using futures for IO calls.
You should also consider using a custom execution context for that task because the global execution context is a public singleton and you don't have control what code uses it and for which purpose. You could easily starve parallel computations created by external libraries if you used all threads from it.
// or just use if you don't care
implicit val blockingIoEc: ExecutionContextExecutor = ExecutionContext.fromExecutor(
def fetchData(index: Int): Future[Int] = Future {
//if you use global ec, then it's required to mark computation as blocking to increase threads,
//if you use custom cached thread pool it should increase thread number even without it
blocking {
val futures = (0 to 1000).map(fetchData)
Future.sequence(futures).onComplete {
case Success(data) => println(data.max) //prints about 1000 on my pc
There is also a possibility to use custom ForkJoinPool using ForkJoinTaskSupport:
import java.util.concurrent.ForkJoinPool //scala.concurrent.forkjoin.ForkJoinPool is deprecated
import scala.util.Random
import scala.collection.parallel
val fjpool = new ForkJoinPool(2)
val customTaskSupport = new parallel.ForkJoinTaskSupport(fjpool)
val numbers = List(1,2,3,4,5).par
numbers.tasksupport = customTaskSupport //assign customTaskSupport

Why does this program run faster when it's allocated fewer threads?

I have a fairly simple Go program designed to compute random Fibonacci numbers to test some strange behavior I observed in a worker pool I wrote. When I allocate one thread, the program finishes in 1.78s. When I allocate 4, it finishes in 9.88s.
The code is as follows:
var workerWG sync.WaitGroup
func worker(fibNum chan int) {
for {
var tgt = <-fibNum
var a, b float64 = 0, 1
for i := 0; i < tgt; i++ {
a, b = a+b, a
func main() {
var fibNum = make(chan int)
for i := 0; i < 4; i++ {
go worker(fibNum)
for i := 0; i < 500000; i++ {
fibNum <- rand.Intn(1000)
If I replace runtime.GOMAXPROCS(1) with 4, the program takes four times as long to run.
What's going on here? Why does adding more available threads to a worker pool slow the entire pool down?
My personal theory is that it has to do with the processing time of the worker being less than the overhead of thread management, but I'm not sure. My reservation is caused by the following test:
When I replace the worker function with the following code:
for {
time.Sleep(500 * time.Millisecond)
both one available thread and four available threads take the same amount of time.
I revised your program to look like the following:
package main
import (
var workerWG sync.WaitGroup
func worker(fibNum chan int) {
for tgt := range fibNum {
var a, b float64 = 0, 1
for i := 0; i < tgt; i++ {
a, b = a+b, a
func main() {
var fibNum = make(chan int)
for i := 0; i < 4; i++ {
go worker(fibNum)
for i := 0; i < 500000; i++ {
fibNum <- rand.Intn(100000)
I cleaned up the wait group usage.
I changed rand.Intn(1000) to rand.Intn(100000)
On my machine that produces:
$ time go run threading.go (GOMAXPROCS=1)
real 0m20.934s
user 0m20.932s
sys 0m0.012s
$ time go run threading.go (GOMAXPROCS=8)
real 0m10.634s
user 0m44.184s
sys 0m1.928s
This means that in your original code, the work performed vs synchronization (channel read/write) was negligible. The slowdown came from having to synchronize across threads instead of one and only perform a very small amount of work inbetween.
In essence, synchronization is expensive compared to calculating fibonacci numbers up to 1000. This is why people tend to discourage micro-benchmarks. Upping that number gives a better perspective. But an even better idea is to benchmark actual work being done i.e. including IO, syscalls, processing, crunching, writing output, formatting, etc.
Edit: As an experiment, I upped the number of workers to 8 with GOMAXPROCS set to 8 and the result was:
$ time go run threading.go
real 0m4.971s
user 0m35.692s
sys 0m0.044s
The code written by #thwd is correct and idiomatic Go.
Your code was being serialized due to the atomic nature of sync.WaitGroup. Both workerWG.Add(1) and workerWG.Done() will block until they're able to atomically update the internal counter.
Since the workload is between 0 and 1000 recursive calls, the bottleneck of a single core was enough to keep data races on the waitgroup counter to a minimum.
On multiple cores, the processor spends a lot of time spinning to fix the collisions of waitgroup calls. Add that to the fact that the waitgroup counter is kept on one core and you now have added communication between cores (taking up even more cycles).
A couple hints for simplifying code:
For a small, set number of goroutines, a complete channel (chan struct{} to avoid allocations) is cheaper to use.
Use the send channel close as a kill signal for goroutines and have them signal that they've exited (waitgroup or channel). Then, close to complete channel to free them up for the GC.
If you need a waitgroup, aggressively minimize the number of calls to it. Those calls must be internally serialized, so extra calls forces added synchronization.
Your main computation routine in worker does not allow the scheduler to run.
Calling the scheduler manually like
for i := 0; i < tgt; i++ {
a, b = a+b, a
if i%300 == 0 {
Reduces wall clock by 30% when switching from one to two threads.
Such artificial microbenchmarks are really hard to get right.

Address certain core for threads in Perl

I have a list of 40 files, which I want to modify through my script.
Since every file processed in the same way, I want to use Threads to speed it up.
Therefore I have this construct :
my $threads_ = sub
while (defined(my $taskRef = $q->dequeue()))
my $work= shift(#{$workRef});
my $open= $q->open() - 1;
my #Working;
for( my $i = 1; $i < 8; $i++)
push #Working, threads->new($threads_);
And I have this code for starting a thread for every file
foreach my $File (#Filelist)
But it still takes way to long time.
My question is, is there a certain way to assign each thread to a single Core, in order to speed it up?
I'd use Parallel::ForkManager for something like this; it works great. I'd recommend not brewing your own when an accepted standard solution exists. By "address certain core", I take it to mean your purpose is to limit the number of concurrent tasks to the number of available processors and ForkManager will do this for you -- just set the max number of processes when you initialize your ForkManager object.
The commenters above were absolutely correct to point out that I/O will eventually limit your throughput, but it's easy enough to determine when adding more processes fails to speed things up.

Design pattern for asynchronous while loop

I have a function that boils down to:
config = generateConfigurationForTesting();
result = executeWork(config);
doWork = isDone(result);
How can I rewrite this for efficient asynchronous execution, assuming all functions are thread safe, independent of previous iterations, and probably require more iterations than the maximum number of allowable threads ?
The problem here is we don't know how many iterations are required in advance so we can't make a dispatch_group or use dispatch_apply.
This is my first attempt, but it looks a bit ugly to me because of arbitrarily chosen values and sleeping;
int thread_count = 0;
bool doWork = true;
int max_threads = 20; // arbitrarily chosen number
dispatch_queue_t queue =
dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
if(thread_count < max_threads)
dispatch_async(queue, ^{ Config myconfig = generateConfigurationForTesting();
Result myresult = executeWork();
dispatch_async(queue, checkResult(myresult)); });
usleep(100); // don't consume too much CPU
void checkResult(Result value)
if(value == good) doWork = false;
Based on your description, it looks like generateConfigurationForTesting is some kind of randomization technique or otherwise a generator which can make a near-infinite number of configuration (hence your comment that you don't know ahead of time how many iterations you will need). With that as an assumption, you are basically stuck with the model that you've created, since your executor needs to be limited by some reasonable assumptions about the queue and you don't want to over-generate, as that would just extend the length of the run after you have succeeded in finding value ==good measurements.
I would suggest you consider using a queue (or OSAtomicIncrement* and OSAtomicDecrement*) to protect access to thread_count and doWork. As it stands, the thread_count increment and decrement will happen in two different queues (main_queue for the main thread and the default queue for the background task) and thus could simultaneously increment and decrement the thread count. This could lead to an undercount (which would cause more threads to be created than you expect) or an overcount (which would cause you to never complete your task).
Another option to making this look a little nicer would be to have checkResult add new elements into the queue if value!=good. This way, you load up the initial elements of the queue using dispatch_apply( 20, queue, ^{ ... }) and you don't need the thread_count at all. The first 20 will be added using dispatch_apply (or an amount that dispatch_apply feels is appropriate for your configuration) and then each time checkResult is called you can either set doWork=false or add another operation to queue.
dispatch_apply() works for this, just pass ncpu as the number of iterations (apply never uses more than ncpu worker threads) and keep each instance of your worker block running for as long as there is more work to do (i.e. loop back to generateConfigurationForTesting() unless !doWork).

Limiting object allocation over multiple threads

I have an application which retrieves and caches the results of a clients query and sends the results out to a client from a cache.
I have a limit on the number of items which may be cached at any one time and keeping track of this limit has has drastically reduced the applications performance when processing a large number of concurrent requests. Is there a better way to solve this problem without locking so often which may improve performance?
Edit: I've gone with the CAS approach and it seems to work pretty well.
First, rather than using a lock, use atomic decrements and compare-and-exchange to manipulate your counter. The syntax for this varies with your compiler; in GCC you might do something like:
long remaining_cache_slots;
void release() {
__sync_add_and_fetch(&remaining_cache_slots, 1);
// Returns false if we've hit our cache limit
bool acquire() {
long prev_value, new_value;
do {
prev_value = remaining_cache_slots;
if (prev_value <= 0) return false;
new_value = prev_value - 1;
} while(!__sync_bool_compare_and_swap(&remaining_cache_slots, prev_value, new_value));
return true;
This should help reduce the window for contention. However, you'll still be bouncing that cache line all over the place, which at a high request rate can severely hurt your performance.
If you're willing to accept a certain amount of waste (ie, allowing the number of cached results - or rather, pending responses - to go slightly below the limit), you have some other options. One is to make the cache thread-local (if possible in your design). Another is to have each thread reserve a pool of 'cache tokens' to use.
What I mean by reserving a pool of cache tokens is that each thread can reserve ahead of time the right to insert N entries into the cache. When that thread removes an entry from the cache it adds it to its set of tokens; if it runs out of tokens, it tries to get them from a global pool, and if it has too many, it puts some of them back. The code might look a bit like this:
long global_cache_token_pool;
__thread long thread_local_token_pool = 0;
// Release 10 tokens to the global pool when we go over 20
// The maximum waste for this scheme is 20 * nthreads
// If we run out, acquire 5 tokens from the global pool
void release() {
if (thread_local_token_pool > THREAD_TOKEN_POOL_HIGHWATER) {
thread_local_token_pool -= THREAD_TOKEN_POOL_RELEASECT;
__sync_fetch_and_add(&global_token_pool, THREAD_TOKEN_POOL_RELEASECT);
bool acquire() {
if (thread_local_token_pool > 0) {
return true;
long prev_val, new_val, acquired;
do {
prev_val = global_token_pool;
acquired = std::min(THREAD_TOKEN_POOL_ACQUIRECT, prev_val);
if (acquired <= 0) return false;
new_val = prev_val - acquired;
} while (!__sync_bool_compare_and_swap(&remaining_cache_slots, prev_value, new_value));
thread_local_token_pool = acquired - 1;
return true;
Batching up requests like this reduces the frequency at which threads access shared data, and thus the amount of contention and cache churn. However, as mentioned before, it makes your limit a bit less precise, and so requires careful tuning to get the right balance.
In SendResults, try updating totalResultsCached only once after you process the results. This will minimize the time spent acquiring/releasing the lock.
void SendResults( int resultsToSend, Request *request )
for (int i=0; i<resultsToSend; ++i)
lock totalResultsCached
totalResultsCached -= resultsToSend;
unlock totalResultsCached
If resultsToSend is typically 1, then my suggestion will not make much of a difference.
Also, after hitting the cache limit, some extra requests may be dropped in ResultCallback, because SendResults is not updating totalResultsCached immediately after sending each request.
