Is there a Python pool for individual iteration in loops?

Is there any way to Python pool an iteration itself?
For example, I tried the following:
def x():
for i in range(10):
The number of iterations is 10 (0 to 9), can we create a pool which creates 10 separate processes for iteration i=0, i=1, ... i=9?

The language does not have it in that simple form. But I remember seeing a small package that would provide an iterator for just that once. I will try to find it,but I doubt it is still maintained.
Here it is:
And present in Pypi:
If it works, it may be a matter of 'unmainained due to being complete' - just test your stuff - a lot.


Optimising python3 multiprocessing when each process requires multiple threads

I have a list with 12 items (each of which is a sample), and I need to analyse each of these samples using a function. The function is a wrapper for an external pipeline which needs four threads to run. Some pseudocode:
def my_wrapper(sample, other_arg):
cmd = 'external_pipeline --threads 4 --sample {0} --other {1}'.format(sample, other_arg)
Previously I ran the function for each sample serially using a loop, which worked, but is relatively inefficient given the my CPU has 20 cores. Example code:
sample_list = ['sample_'+str(x) for x in range(1,13)]
for sample in sample_list:
I've tried to use multiprocessing to up the efficiency. I've done this successfully in the past using the starmap function from multiprocessing. Example:
with mp.Pool(mp.cpu_count()) as pool:
results = pool.starmap(my_wrapper, [(sample, other_arg) for sample in sample_list])
This approach has worked well previously when the function I'm calling requires only 1 thread/core per process. However, it doesn't seem to work as I naively expect/hope in my current circumstance. There are 12 samples, each needing to be analysed with 4 threads, but I only have 20 threads in total. Accordingly, I'd expect/hope for 5 samples to be run at a time (5 samples * 4 threads for each = 20 threads total). Instead, all samples appear to be analysed simultaneously, with all 20 threads being used, despite 48 threads being required for this to be efficient.
How might I efficiently run these samples so that only 5 are run in parallel (with each of these processes/jobs using 4 threads)? Do I need to specify a chunk size, or am I barking up the wrong tree with this thought?
Apologies for the vague title and post content, I wasn't sure how to word any of it better!
Limiting the number of cores in your multiprocessing pool will then save some cores free to run when running your wrapper. This will do the chunking for you:
with mp.Pool(mp.cpu_count() / num_cores_used_in_wrapper) as pool:
results = pool.starmap(my_wrapper, [(sample, other_arg) for sample in sample_list])

Performance difference between multithread using queue and futures.ThreadPoolExecutor using list in python3?

I was trying various approaches with python multi-threading to see which one fits my requirements. To give an overview, I have a bunch of items that I need to send to an API. Then based on the response, some of the items will go to a database and all the items will be logged; e.g., for an item if the API returns success, that item will only be logged but when it returns failure, that item will be sent to database for future retry along with logging.
Now based on the API response I can separate out success items from failure and make a batch query with all failure items, which will improve my database performance. To do that, I am accumulating all requests at one place and trying to perform multithreaded API calls(since this is an IO bound task, I'm not even thinking about multiprocessing) but at the same time I need to keep track of which response belongs to which request.
Coming to the actual question, I tried two different approaches which I thought would give nearly identical performance, but there turned out to be a huge difference.
To simulate the API call, I created an API in my localhost with a 500ms sleep(for avg processing time). Please note that I want to start logging and inserting to database after all API calls are complete.
Approach - 1(With threading.Thread and queue.Queue())
import requests
import datetime
import threading
import queue
def target(data_q):
while not data_q.empty():
response = requests.get("")
if __name__ == "__main__":
data_q = queue.Queue()
for i in range(0, 20):
start =
num_thread = 5
for _ in range(num_thread):
worker = threading.Thread(target=target(data_q))
print('Time taken multi-threading: '+str( - start))
I tried with 5, 10, 20 and 30 times and the results are below correspondingly,
Time taken multi-threading: 0:00:06.625710
Time taken multi-threading: 0:00:13.326969
Time taken multi-threading: 0:00:26.435534
Time taken multi-threading: 0:00:40.737406
What shocked me here is, I tried the same without multi-threading and got almost same performance.
Then after some googling around, I was introduced to futures module.
Approach - 2(Using concurrent.futures)
def fetch_url(im_url):
response = requests.get(im_url)
return response.status_code
except Exception as e:
if __name__ == "__main__":
data = []
for i in range(0, 20):
start =
urls = ["" + str(item) for item in data]
with futures.ThreadPoolExecutor(max_workers=5) as executor:
responses =, urls)
for ret in responses:
print('Time taken future concurrent: ' + str( - start))
Again with 5, 10, 20 and 30 times and the results are below correspondingly,
Time taken future concurrent: 0:00:01.276891
Time taken future concurrent: 0:00:02.635949
Time taken future concurrent: 0:00:05.073299
Time taken future concurrent: 0:00:07.296873
Now I've heard about asyncio, but I've not used it yet. I've also read that it gives even better performance than futures.ThreadPoolExecutor().
Final question, If both approaches are using threads(or so I think) then why there is a huge performance gap? Am I doing something terribly wrong? I looked around. Was not able to find a satisfying answer. Any thoughts on this would be highly appreciated. Thanks for going through the question.
[Edit 1]The whole thing is running on python 3.8.
[Edit 2] Updated code examples and execution times. Now they should run on anyone's system.
The documentation of ThreadPoolExecutor explains in detail how many threads are started when the max_workers parameter is not given, as in your example. The behaviour is different depending on the exact Python version, but the number of tasks started is most probably more than 3, the number of threads in the first version using a queue. You should use futures.ThreadPoolExecutor(max_workers= 3) to compare the two approaches.
To the updated Approach - 1 I suggest to modify the for loop a bit:
for _ in range(num_thread):
target_to_run= target(data_q)
print('target to run: {}'.format(target_to_run))
worker = threading.Thread(target= target_to_run)
The output will be like this:
target to run: None
target to run: None
target to run: None
target to run: None
target to run: None
Time taken multi-threading: 0:00:10.846368
The problem is that the Thread constructor expects a callable object or None as its target. You do not give it a callable, rather queue processing happens on the first invocation of target(data_q) by the main thread, and 5 threads are started that do nothing because their target is None.

Threading in Python 3

I write Python 3 code, in which I have 2 functions. The first function insertBlock() inserts data in MongoDB collection 1, the second function insertTransactionData() takes data from collection 1 and inserts it into collection 2. Data is in very large amount so I use threading to increase performance. But when I use threading it is taking more time to insert data than without threading. I am so confused that exactly how threading will work in my code and how to increase performance? Here is the main function :
if __name__ == '__main__':
t1 = threading.Thread(target=insertBlock())
t2 = threading.Thread(target=insertTransactionData())
From the python documentation for threading:
target is the callable object to be invoked by the run() method. Defaults to None, meaning nothing is called.
So the correct usage is
(without the () after insertBlock), because otherwise insertBlock is called, executed normally (blocking the main thread) and target is set to it's return value None. This causes t1.start() not to do anything and you don't get any performance improvement.
Be aware that multithreading gives you no guarantee on what the order of execution in different threads will be. You can not rely on the data that insertBlock has inserted into the database inside the insertTransactionData function, because at the time insertTransactionData uses this data, you can not be sure that it was already inserted. So, maybe multithreading does not work at all for this code or you need to restructure your code and only parallelize those parts that do not depend on each other.
I solved this problem by merging these two functionalities into one new function
insertBlockAndTransaction(startrange,endrange). As these two functionalities depend on each other so what I did is I insert transaction information immediately below where block information is inserted (block number was common and needed for both functionalities).Then did multithreading by creating 10 threads for single function:
for i in range(10):
t1 = threading.Thread(target=insertBlockAndTransaction,args(5000000+i*10000,5000000+(i+1)*10000))
It helps me to deal with increasing execution time for more than 1lakh data.

Python multiprocessing: with and without pooling

I'm trying to understand Python's multiprocessing, and have devised the following code to test it:
import multiprocessing
def F(n):
if n == 0: return 0
elif n == 1: return 1
else: return F(n-1)+F(n-2)
def G(n):
print(f'Fibbonacci of {n}: {F(n)}')
processes = []
for i in range(25, 35):
processes.append(multiprocessing.Process(target=G, args=(i, )))
for pro in processes:
When I run it, I tells me that the computing time was roughly of 6.65s.
I then wrote the following code, which I thought to be functionally equivalent to the latter:
from multiprocessing.dummy import Pool as ThreadPool
def F(n):
if n == 0: return 0
elif n == 1: return 1
else: return F(n-1)+F(n-2)
def G(n):
print(f'Fibbonacci of {n}: {F(n)}')
in_data = [i for i in range(25, 35)]
pool = ThreadPool(10)
results =, in_data)
and its running time was almost 12s.
Why is it that the second takes almost twice as the first one? Aren't they supposed to be equivalent?
(NB. I'm running Python 3.6, but also tested a similar code on 3.52 with same results.)
The reason the second takes twice as long as the first is likely due to the CPython Global Interpreter Lock.
[...] the GIL effectively restricts bytecode execution to a single core, thus rendering pure Python threads an ineffective tool for distributing CPU bound work across multiple cores.
As you know, multiprocessing.dummy is a wrapper around the threading module, so you're creating threads, not processes. The Global Interpreter Lock, with a CPU-bound task as here, is not much different than simply executing your Fibonacci calculations sequentially in a single thread (except that you've added some thread-management/context-switching overhead).
With the "true multiprocessing" version, you only have a single thread in each process, each of which is using its own GIL. Hence, you can actually make use of multiple processors to improve the speed.
For this particular processing task, there is no significant advantage to using multiple threads over multiple processes. If you only have a single processor, there is no advantage to using either multiple processes or multiple threads over a single thread/process (in fact, both merely add context-switching overhead to your task).
(FWIW: A join in the true multiprocessing version is apparently being done automatically by the python runtime so adding an explicit join doesn't seem to make any difference in my tests using time(1). And, by the way, if you did want to add join, you should add a second loop for the join processing. Adding join to the existing loop will simply serialize your processes.)

Multithreading and recursion

so I've got this multithreaded, recursive application. It's coded in Pharo Smalltalk but the logical solution to the issue is likely to be the same across most languages.
I have 4 of the same process running relatively simultaneously. It's the last iteration of a recursive call. I'd like to print the result calculated by my recursive function (it's a dictionary being modified in the argument of the recursive function/message). The issue I'm facing right now is that the print is called in the base case terminator of the recursion, so the result is printed 4 times.
I tried setting a global variable which allows for me to print the result of the process which finishes first, but of course that means that the result is wrong. It needs to print the result of the last process to execute of all the processes in that last iteration of the recursion.
How could I go about this without going too deep into the Process class? Thanks for any help.
Do you know the number of threads? (Supposedly, 4)
Then you can use an atomic long (in java, for example):
AtomicLong myAtomicLong = new AtomicLong(0);
// do my work
if (totalThreadCount == myAtomicLong.getAndIncrement() -1)
//my print
The increment and get is atomic, so the last thread to want to print, will get there and the condition will be true after all other threads have finished their jobs. Please notice that it is important to place the increment and check after the job, is done.
