Let's consider a function whose elaboration time depends on one of its parameter, size, and depending on its value the running time can take from hours to (few) days. In a script I launch multiple instances of this function in parallel to use all the cores and save time.
Let's say that the total elaboration time is limited by the longest running time of function instance with biggest size, and that I can also parallelize some parts of the function: this would probably not be beneficial if all the function instances are running, but it could be beneficial when only one remains (or if I have tons of cores). How would you do this in Python, that is, organize a two-level multiprocessing (one at the script level, the other inside the function)? Or would you just parallelize the function and launch multiple scripts with different configurations?
I provide a MWE, but I understand the answer on the (absolute) fastest execution is probably problem dependent. Here the function consists of nested loops, but you can adapt it as long as it verifies the preceding assumptions, or change the parameters. I am not interested in this particular MWE, but on how to set an inner parallelization.
import numpy as np
import time
import timeit
import multiprocessing as mp
import copy
def function_wrapper(matrix):
"""Puts to zero the multiples of 3."""
side = matrix.shape[0]
# consider to parallelize the following
# NOTE that I am not interested in parallelizing this particular example
# (you can change this part as long as by parallelizing it you better use the cores)
# with list comprehensions or built-in functions, but on how to setup a second level of multiprocessing
for x in range(side):
for y in range(side):
for z in range(side):
matrix[x, y, z] = timeit.timeit("numpy.linalg.eig(numpy.random.randint(0, 10, (L, L)))", setup='import numpy; L='+str(matrix[x, y, z]), number=10)
return matrix
num_cores = mp.cpu_count()
if __name__ == "__main__":
matrices_number = 20 # depending on the values of matrices_number and max_side the
max_side = 10 # parallelization setup is more or less important
matrices = [np.random.randint(1, 101, (side, side, side))
for side in np.random.randint(2, max_side, matrices_number)]
# sorts matrices by decreasing shape
args_generator = sorted([(m,) for m in matrices], key=lambda x: x[0].shape[0], reverse=True)
iterations = 10 # iterations to reduce caching effects
start = time.time()
for k in range(iterations):
results = [function_wrapper(*args) for args in copy.deepcopy(args_generator)]
elapsed = time.time() - start
print(f"loop: elapsed={elapsed} sec.")
start = time.time()
for k in range(iterations):
with mp.Pool(num_cores) as pool:
results = pool.starmap_async(function_wrapper, copy.deepcopy(args_generator)).get()
elapsed = time.time() - start
print(f"mp.Pool: elapsed={elapsed} sec.")
You're not taking advantage of the built-in functions in the numpy library. You are iterating over the array instead of broadcasting your logic across the whole matrix at once. When you take advantage of the built-in numpy functions, you take advantage of the underlying code written in C. The wrapper function should be the following.
def broadcaster(matrix):
return np.where(matrix % 3 != 0, matrix, 0)
start = time.time()
for k in range(iterations):
results = [broadcaster(*args) for args in copy.deepcopy(args_generator)]
elapsed = time.time() - start
print(f"Broadcast: elapsed={elapsed} sec.")
When I add that snippet to your code I get the following.
loop: elapsed=15.675757884979248 sec.
mp.Pool: elapsed=14.439897060394287 sec.
Broadcast: elapsed=0.6325647830963135 sec.
As you can see, in terms of performance, it is not even close.
Related
I have the following snippets of code which is a subroutine of the K-means clustering algorithm; specifically, it tries to assign each point to the closest centroid.
import numpy as np
n = 20000
D = 30
K = 250
points = np.random.rand(n, D)
centroids = np.random.rand(K, D)
membership = np.zeros(shape=n, dtype=int)
for i in range(n):
distances = np.apply_along_axis(lambda x: np.linalg.norm(x, ord=2), 1, centroids - points[i])
membership[i] = np.argmin(distances)
The running time here should be O(NKD) where D is the dimension of the data points, so naturally I expect when D increases or decreases, the running time would change proportionally as well. To my surprise, I see very little time being changed when changing D, for example when testing on my local machine:
D = 1
python3 benchmark.py 12.10s user 0.39s system 118% cpu 10.564 total
D = 30
python3 benchmark.py 12.17s user 0.36s system 117% cpu 10.703 total
D = 300
python3 benchmark.py 13.30s user 0.31s system 115% cpu 11.784 total
D = 1000
python3 benchmark.py 16.51s user 1.76s system 110% cpu 16.524 total
Is there something that I'm missing here?
Edit: per #Warren's suggestion, I modified the code to use np.linalg.norm with axis parameter directly; the performance is following:
D = 1
python3 benchmark.py 1.45s user 0.37s system 634% cpu 0.287 total
D = 30
python3 benchmark.py 1.67s user 0.29s system 592% cpu 0.331 total
D = 300
python3 benchmark.py 3.03s user 0.32s system 234% cpu 1.428 total
D = 1000
python3 benchmark.py 6.32s user 2.73s system 126% cpu 7.177 total
so the performance was better.
This is due to the overhead of Numpy functions.
Indeed, np.apply_along_axis is called 20_000 times and each call to this function internally does a loop calling the target Python function 250 times (ie. it is not vectorized), and so np.linalg.norm. In the end, np.linalg.norm is called, 20_000 * 250 = 5000000 times. The thing is each call to a Numpy function takes typically about 1 µs. On my machine, np.linalg.norm takes 4-5 µs on an array of size 1. This time is due to many internal checks (types and values), allocations, functions calls, conversion, etc.
There are two simple ways to reduce this overhead: vectorization and using a JIT compiler like Numba. The later is often more efficient as it avoid creating expensive big temporary arrays.
Here is a much faster implementation:
import numpy as np
import numba as nb
#nb.njit('(float64[:,::1], float64[:,::1], int_[::1])')
def compute(points, centroids, membership):
n, K, D = points.shape[0], centroids.shape[0], points.shape[1]
assert centroids.shape[1] == D and membership.shape[0] == n
distances = np.empty(K, np.float64)
for i in range(n):
for j in range(K):
distances[j] = np.linalg.norm(centroids[j] - points[i], ord=2)
membership[i] = np.argmin(distances)
n = 20000
D = 30
K = 250
points = np.random.rand(n, D)
centroids = np.random.rand(K, D)
membership = np.zeros(shape=n, dtype=int)
compute(points, centroids, membership)
In fact, while this code is much faster, it still have a similar issue: the cost of allocating the temporary arrays centroids[j] - points[i] is significant compared to the actual time required to compute the norm. In fact, each allocations takes only few hundred of nanoseconds, but the number of loop iteration is huge. One solution is simply to compute the norm manually:
from math import sqrt
#nb.njit('(float64[:,::1], float64[:,::1], int_[::1])', fastmath=True)
def compute_fast(points, centroids, membership):
n, K, D = points.shape[0], centroids.shape[0], points.shape[1]
assert centroids.shape[1] == D and membership.shape[0] == n
distances = np.empty(K, np.float64)
for i in range(n):
for j in range(K):
s = 0.0
for k in range(D):
tmp = centroids[j,k] - points[i,k]
s += tmp * tmp
distances[j] = sqrt(s)
membership[i] = np.argmin(distances)
Here are results on my i5-9600KF processor:
D=1:
initial code: 26.56 seconds
compute: 1.44 seconds
compute_fast: 0.02 seconds (x1328)
D=30:
initial code: 27.09 seconds
compute: 1.65 seconds
compute_fast: 0.13 seconds (x208)
D=1000:
initial code: 39.34 seconds
compute: 3.74 seconds
compute_fast: 4.57 seconds (x8.6)
The last implementation is much faster for small values of D since the Numpy overhead are the main bottleneck in this case and the implementation can almost completely remove such overheads (thanks to the JIT compilation).
It is probably O(NKD).
But the thing is you are iterating 3 loops here. One explicitly. One semi-explicitly. And the last one implicitly, inside numpy functions.
The outer one is your explicit for loop, for N.
The middle one is the np.apply_along_axis one, which applies on the K rows of centroids-points[i] (btw, there is another one here, with some broadcasting. But we don't need to count all of them for big-O consideration)
And the inner one is the one on the D columns that occur inside norm.
The inner one is obviously the most important to optimized, and that's good, because it is the only one that is vectorized here.
But that means that for small enough value of D, what we really see is more some constant overhead (times N×K, since it is inside a double for loop). Your inefficient outer for loops drive most of the cost, which, then, looks like O(NK).
Note that np.apply_along_axis is just a for loop by another name. It is not as bad. But almost so. It is still calling several times some python code. It is not vectorization.
But, well, I bet that with D big enough, you'll see that it is O(NKD)
Edit:
Here is what I get when I increase D (with smaller n, so that it remains computable in realistic time)
You see that it looks really linear (affine, to be accurate, since it doesn't pass through 0, which is the reason why it doesn't look very linear to you; and which is explained by my previous comment: most of the inner cost inside the for/along_axis double loop is mainly constant overhead of those loops, when D is small. The "proportional to D" part begins to show when the overhead become negligible)
I have a piece of code which iterates over a three-dimensional array and writes into each cell a value based on the indices and the current value itself:
import numpy as np
nx = ny = nz = 100
array = np.zeros((nx, ny, nz))
def fun(val, k):
# Do something with the indices
return val + (k[0] * k[1] * k[2])
with np.nditer(array, flags=['multi_index'], op_flags=['readwrite']) as it:
for x in it:
x[...] = fun(x, it.multi_index)
Note, that fun might do something more sophisticated, which takes most of the total runtime, and that the input arrays might have different lengths per axis.
However, this code could run in multiple threads, as fun can be assumed to be threadsafe (Only the value and index of the current cell are required). But finding a method to iterate over all cells and have the current index available seems to be hard.
A possible solution might be https://stackoverflow.com/a/58012407/446140, where the array is split by the x-axis into chunks and passed to a Pool.
However, the solution is not universally applicable and I wonder if there is a more general solution for this problem (which could also work with nD arrays)?
The first issue is to split up the 3D array into equally sized chunks. np.array_split can be used, but the offset of each of the splits has to be stored to get the correct indices again.
An interesting question, with a few possible solutions. As you indicated, it is possible to use np.array_split, but since we are only interested in the indices, we can also use np.unravel_index, which would mean that we only have to loop over all the indices (the size) of the array to get the index.
Now there are two great ideas for multiprocessing:
Create a (thread safe) shared memory of the array and splitting the indices across the different processes.
Only update the array in a main thread, but provide a copy of the required data to the processes and let them return the value that has to be updated.
Both solutions will work for any np.ndarray, but have different advantages. Creating a shared memory doesn't create copies, but can have a large insertion penalty if it has to wait on other processes (the computational time, is small compared to the write time.)
There are probably many more solutions, but I will work out the first solution, where a Shared Memory object is created and a range of indices is provided to every process.
Required imports:
import itertools
import numpy as np
import multiprocessing as mp
from multiprocessing import shared_memory
Shared Numpy arrays
The main problem with applying multiprocessing on np.ndarray's is that memory sharing between processes can be difficult. For this the following class can be used:
class SharedNumpy:
__slots__ = ('arr', 'shm', 'name', 'shared',)
def __init__(self, arr: np.ndarray = None):
if arr is not None:
self.shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
self.arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=self.shm.buf)
self.name = self.shm.name
np.copyto(self.arr, arr)
def __getattr__(self, item):
if hasattr(self.arr, item):
return getattr(self.arr, item)
raise AttributeError(f"{self.__class__.__name__}, doesn't have attribute {item!r}")
def __str__(self):
return str(self.arr)
#classmethod
def from_name(cls, name, shape, dtype):
memory = cls(arr=None)
memory.shm = shared_memory.SharedMemory(name)
memory.arr = np.ndarray(shape, dtype=dtype, buffer=memory.shm.buf)
memory.name = name
return memory
#property
def dtype(self):
return self.arr.dtype
#property
def shape(self):
return self.arr.shape
This makes it possible to create a shared memory object in the main process and then use SharedNumpy.from_name to get it in other processes.
Simple test
A quick (non threaded) test would be:
def simple_test():
data = np.array(np.zeros((5,) * 2))
mem_primary = SharedNumpy(arr=data)
mem_second = SharedNumpy.from_name(name=mem_primary.name, shape=data.shape, dtype=data.dtype)
assert mem_primary.name == mem_second.name, "Different memory names"
assert np.array_equal(mem_primary.arr, mem_second.arr), "Different array values."
mem_primary.arr[2] = 5
assert np.array_equal(mem_primary.arr, mem_second.arr), "Different array values."
print("Completed 3/3 tests...")
A threaded test will follow later!
Distribution
The next part is focused on providing the processes with the necessary data. In this case we will provide every process with a range of indices that it has to calculate and all the data that is required to load the shared memory.
The input of this function is a dim the number of numpy axis, and the size, which are the number of elements per axis.
def distributed(size, dim):
memory = SharedNumpy(arr=np.zeros((size,) * dim))
split_size = np.int64(np.ceil(memory.arr.size / mp.cpu_count()))
settings = dict(
memory=itertools.repeat(memory.name),
shape=itertools.repeat(memory.arr.shape),
dtype=itertools.repeat(memory.arr.dtype),
start=np.arange(mp.cpu_count()),
num=itertools.repeat(split_size)
)
with mp.Pool(mp.cpu_count()) as pool:
pool.starmap(fun, zip(*settings.values()))
print(f"\n\nDone {dim}D, size: {size}, elements: {size ** dim}")
return memory
Notes:
By using starmap instead of map, it is possible to provide multiple input arguments (a list of arguments for every process).
(also see docs starmap)
itertools.repeat is used to add constants to the starmap
(also see: zip() in python, how to use static values)
By using np.unravel_index, we only need to have a start index and the chunk size per process.
The start and num tell the chunks of indices that have to be converted per process, by applying range(start * num, (start + 1) * num).
Testing
For the testing I am using different input sizes and dimensions. Since the data increases with the formula sizes ^ dimensions, I limited the test to a size of 128 and 3 dimensions (that is 2,097,152 points, and already start taking quit a bit of time.)
Code
fun
def fun(name, shape, dtype, start, num):
memory = SharedNumpy.from_name(name, shape=shape, dtype=dtype)
for idx in range(start * num, min((start + 1) * num, memory.arr.size)):
# Do something with the indices
indices = np.unravel_index([idx], shape)
memory.arr[indices] += np.product(indices)
memory.shm.close() # Closes the shared memory for this process.
Running the example
if __name__ == '__main__':
for size in [5, 10, 15]:
for dim in [1, 2, 3]:
memory = distributed(size, dim)
print(memory)
memory.shm.unlink()
For the OP's code, I used his code with a small addition that I allow the array to have different sizes and dimensions, in any case I use:
def sequential(size, dim):
array = np.zeros((size,) * dim)
...
And looking at the output array of both codes, will result in the same outcomes.
Plots
The code for the graphs have been taken from the reply in:
https://codereview.stackexchange.com/questions/165245/plot-timings-for-a-range-of-inputs
With the minor alteration that labels was changed to codes in
empty_multi_index = pd.MultiIndex(levels=[[], []], codes=[[], []], names=['func', 'result'])
Where the 1d, 2d and 3d reference the dimensions and the input is the size.
Sequentially (OP code):
Distributed (this code):
Results
This method works on an arbitrary sized numpy array, and is able to perform an operation on the indices of the array. It provides you with full access of the whole numpy array, so it can also be used to perform different kind of statistical analysis, which do not change the array.
From the timings it can be seen that for small data shapes the distributed version has no to little advantages, because of the extra complexity of creating the processes. However for larger amount of data it starts to become more effective.
I only timed it on short delays in the computational time (simple fun), but on more complex calculations, it should outperform the sequential version much sooner.
Extra
If you are only interested in operations that are performed over or along axis, these numpy functions might help to vectorize your solutions instead of using multiprocessing:
np.apply_over_axes
np.apply_along_axis
I wanted to see how drastic is the difference in time complexity between the iterative and recursive approaches to sum an array. So I plotted a 'time' versus 'size of the list' graph for a pretty decent range of values for size(995). What I got was pretty much what I wanted except something unexpected caught my eye.
The graph can be seen here 1
What's confusing me here are those bumps that the green line suddenly takes only for certain values and then comes back down. Why does that happen?
Here is the code I had written:
import matplotlib.pyplot as plt
from time import time
def sum_rec(lst): # Sums recursively
if len(lst) == 0:
return 0
return lst[0]+sum_rec(lst[1:])
def sum_iter(lst): # Sums iteratively
Sum = 0
for i in range(len(lst)):
Sum += i
return Sum
def check_time(lst): # Returns the time taken for both algorithms
start = time()
Sum = sum_iter(lst)
end = time()
t_iter = end - start
start = time()
Sum = sum_rec(lst)
end = time()
t_rec = end - start
return t_iter, t_rec
N = [n for n in range(995)]
T1 = [] # for iterative function
T2 = [] # for recursive function
for n in N: # values on the x-axis
lst = [i for i in range(n)]
t_iter, t_rec = check_time(lst)
T1.append(t_iter)
T2.append(t_rec)
plt.plot(N,T1)
plt.plot(N,T2) # Both plotted on graph
plt.show()
I'd say both the algorithms have a linear runtime but the recursive one has a higher constant factor, which causes the steeper slope.
Other than that:-
(1) You're mixing up the two plots.
The iterative one stays grounded while the recursive one increases.
One possible explanation may be that recursive calls make stack entries and require more computational time than iterative calls.
(2) You need to increase the size of the array as small sizes are more likely to cause spikes due to locality of reference.
(3) You need to repeat the experiment over multiple epochs to make sure random spikes due to some other process hogging the resource is distributed evenly.
Cython starter here. I am trying to speed up a calculation of a certain pairwise statistic (in several bins) by using multiple threads. In particular, I am using prange from cython.parallel, which internally uses openMP.
The following minimal example illustrates the problem (compilation via Jupyter notebook Cython magic).
Notebook setup:
%load_ext Cython
import numpy as np
Cython code:
%%cython --compile-args=-fopenmp --link-args=-fopenmp -a
from cython cimport boundscheck
import numpy as np
from cython.parallel cimport prange, parallel
#boundscheck(False)
def my_parallel_statistic(double[:] X, double[:,::1] bins, int num_threads):
cdef:
int N = X.shape[0]
int nbins = bins.shape[0]
double Xij,Yij
double[:] Z = np.zeros(nbins,dtype=np.float64)
int i,j,b
with nogil, parallel(num_threads=num_threads):
for i in prange(N,schedule='static',chunksize=1):
for j in range(i):
#some pairwise quantities
Xij = X[i]-X[j]
Yij = 0.5*(X[i]+X[j])
#check if in bin
for b in range(nbins):
if (Xij < bins[b,0]) or (Xij > bins[b,1]):
continue
Z[b] += Xij*Yij
return np.asarray(Z)
mock data and bins
X = np.random.rand(10000)
bin_edges = np.linspace(0.,1,11)
bins = np.array([bin_edges[:-1],bin_edges[1:]]).T
bins = bins.copy(order='C')
Timing via
%timeit my_parallel_statistic(X,bins,1)
%timeit my_parallel_statistic(X,bins,4)
yields
1 loop, best of 3: 728 ms per loop
1 loop, best of 3: 330 ms per loop
which is not a perfect scaling, but that is not the main point of the question. (But do let me know if you have suggestions beyond adding the usual decorators or fine-tuning the prange arguments.)
However, this calculation is apparently not thread-safe:
Z1 = my_parallel_statistic(X,bins,1)
Z4 = my_parallel_statistic(X,bins,4)
np.allclose(Z1,Z4)
reveals a significant difference between the two results (up to 20% in this example).
I strongly suspect that the problem is that multiple threads can do
Z[b] += Xij*Yij
at the same time. But what I don't know is how to fix this without sacrificing the speed-up.
In my actual use case, the calculation of Xij and Yij is more expensive, hence I would like to do them only once per pair. Also, pre-computing and storing Xij and Yij for all pairs and then simply looping through bins is not a good option either because N can get very large, and I can't store 100,000 x 100,000 numpy arrays in memory (this was actually the main motivation for rewriting it in Cython!).
System info (added following suggestion in comments):
CPU(s): 8
Model name: Intel(R) Core(TM) i7-4790K CPU # 4.00GHz
OS: Red Hat Linux v6.8
Memory: 16 GB
Yes, Z[b] += Xij*Yij is indeed a race condition.
There are a couple of options of making this atomic or critical. Implementation issues with Cython aside, you would in any case have bad performance due to false sharing on the shared Z vector.
So the better alternative is to reserve a private array for each thread. There are a couple of (non-)options again. One could use a private malloc'd pointer, but I wanted to stick with np. Memory slices cannot be assigned as private variables. A two dimensional (num_threads, nbins) array works, but for some reason generates very complicated inefficient array index code. This works but is slower and does not scale.
A flat numpy array with manual "2D" indexing works well. You get a little bit extra performance by avoiding padding the private parts of the array to 64 byte, which is a typical cache line size. This avoids false sharing between the cores. The private parts are simply summed up serially outside of the parallel region.
%%cython --compile-args=-fopenmp --link-args=-fopenmp -a
from cython cimport boundscheck
import numpy as np
from cython.parallel cimport prange, parallel
cimport openmp
#boundscheck(False)
def my_parallel_statistic(double[:] X, double[:,::1] bins, int num_threads):
cdef:
int N = X.shape[0]
int nbins = bins.shape[0]
double Xij,Yij
# pad local data to 64 byte avoid false sharing of cache-lines
int nbins_padded = (((nbins - 1) // 8) + 1) * 8
double[:] Z_local = np.zeros(nbins_padded * num_threads,dtype=np.float64)
double[:] Z = np.zeros(nbins)
int i,j,b, bb, tid
with nogil, parallel(num_threads=num_threads):
tid = openmp.omp_get_thread_num()
for i in prange(N,schedule='static',chunksize=1):
for j in range(i):
#some pairwise quantities
Xij = X[i]-X[j]
Yij = 0.5*(X[i]+X[j])
#check if in bin
for b in range(nbins):
if (Xij < bins[b,0]) or (Xij > bins[b,1]):
continue
Z_local[tid * nbins_padded + b] += Xij*Yij
for tid in range(num_threads):
for bb in range(nbins):
Z[bb] += Z_local[tid * nbins_padded + bb]
return np.asarray(Z)
This performs quite well on my 4 core machine, with 720 ms / 191 ms, a speedup of 3.6. The remaining gap may be due to turbo mode. I don't have access to a proper machine for testing right now.
You are right that the access to Z is under a race condition.
You might be better off defining num_threads copies of Z, as cdef double[:] Z = np.zeros((num_threads, nbins), dtype=np.float64) and perform a sum along axis 0 after the prange loop.
return np.sum(Z, axis=0)
Cython code can have a with gil statement in a parallel region but it is only documented for error handling. You could have a look at the general C code to see whether that would trigger an atomic OpenMP operation but I doubt it.
I use a multiprocessing.Pool().imap_unordered(...) to perform some tasks in parallel and measure the time it takes by calculating the difference of time.time() before and after starting the pool tasks.
However, it returns wrong results! When I watch my wall clock while the program runs, it tells me a run time of around 5 seconds. But the program itself outputs a run time of only 0.1 seconds.
I also have a variant of this code without any multiprocessing which takes double the time, but outputs the correct run times.
Here is my code:
if __name__ == "__main__":
n = int(input("How many grids to create? "))
use_multiprocessing = None
while use_multiprocessing is None:
answer = input("Use multiprocessing to speed things up? (Y/n) ").strip().lower()
if len(answer) == 1 and answer in "yn":
use_multiprocessing = True if answer == "y" else False
t0 = time.time()
if use_multiprocessing:
processes = cpu_count()
worker_pool = Pool(processes)
print("Creating {} sudokus using {} processes. Please wait...".format(n, processes))
sudokus = worker_pool.imap_unordered(create_sudoku, range(n), n // processes + 1)
else:
progress_bar, progress_bar_length = 0, 10
sudokus = []
print("Creating {} sudokus".format(n), end="", flush=True)
for i in range(n):
p = int((i / n) * progress_bar_length)
if p > progress_bar:
print("." * (p-progress_bar), end="", flush=True)
progress_bar = p
new_sudoku = create_sudoku()
sudokus.append(new_sudoku)
t = time.time() - t0
l = len(list(sudokus))
print("\nSuccessfully created {} grids in {:.6f}s (average {:.3f}ms per grid)!".format(
l, t, 1000*t/l
))
And here an example run, which took around 5-6 seconds in reality (after entering the number of grids to create and whether to use multiprocessing, of course):
How many grids to create? 100000
Use multiprocessing to speed things up? (Y/n) y
Creating 100000 sudokus using 4 processes. Please wait...
Successfully created 100000 grids in 0.122141s (average 0.001ms per grid)!
Process finished with exit code 0
Are multiprocessing and time.time() incompatible? I've heard that time.clock() can make problems under these circumstances, but I thought time.time() should be safe. Or is there any other problem?
I figured it out.
Pool.imap_unordered(...) returns a generator and no list. That means, its elements are not already created when the method finishes, but only as soon as I access them.
I did this in the line l = len(list(sudokus)), where I converted the generator into a list to get the length. And the finish time got measured one line before that, so it correctly reported the time it took to initialize the generator. This was not what I want, so swapping those two lines results in correct times.
I know I may not convert a generator into a list just to find out the length and then discard the list again. I must either rely on the saved requested length if I want a generator, or I must use Pool.map(...) instead which produces a list and blocks until it's ready.