Having thread-local arrays in cython so that I can resize them? - multithreading

I have an interval-treeish algorithm I would like to run in parallel for many queries using threads. Problem is that then each thread would need its own array, since I cannot know in advance how many hits there will be.
There are other questions like this, and the solution suggested is always to have an array of size (K, t) where K is output length and t is number of threads. This does not work for me as K might be different for each thread and each thread might need to resize the array to fit all the results it gets.
Pseudocode:
for i in prange(len(starts)):
qs, qe, qx = starts[i], ends[i], index[i]
results = t.search(qs, qe)
if len(results) + nfound < len(output):
# add result to output
else:
# resize array
# then add results

An usual pattern is that every thread gets its own container, which is a trade-off between speed/complexity and memory-overhead:
there is no need to lock for access to this container, because only one thread accesses it.
there is much less overhead compared to "own container for every task (i.e. every i-value)".
After the parallel section, the data must be either collected in a final container in a post processing step (which also could happen in parallel) or the subsequent algorithms should be able to handle a collection of containers.
Here is an example using c++-vector (which already has memory management and increasing size built-in):
%%cython -+ -c=/openmp --link-args=/openmp
from cython.parallel import prange, threadid
from libcpp.vector cimport vector
cimport openmp
def calc_in_parallel(N):
cdef int i,k,tid
cdef int n = N
cdef vector[vector[int]] vecs
# every thread gets its own container
vecs.resize(openmp.omp_get_max_threads())
for i in prange(n, nogil=True):
tid = threadid()
for k in range(i):
# use container of the thread
vecs[tid].push_back(k) # dummy for calculation
return vecs
Using omp_get_max_threads() for the number of threads will overestimate the real number of threads in many cases. It is probably more robust to set the number of threads explicitly in prange, i.e.
...
NUM_THREADS = 2
vecs.resize(NUM_THREADS)
for i in prange(n, nogil=True, num_threads = NUM_THREADS):
...
A similar approach can be applied using pure C, but more boiler plate code (memory management) will be needed in this case.

Related

Two level multiprocessing in Python

Let's consider a function whose elaboration time depends on one of its parameter, size, and depending on its value the running time can take from hours to (few) days. In a script I launch multiple instances of this function in parallel to use all the cores and save time.
Let's say that the total elaboration time is limited by the longest running time of function instance with biggest size, and that I can also parallelize some parts of the function: this would probably not be beneficial if all the function instances are running, but it could be beneficial when only one remains (or if I have tons of cores). How would you do this in Python, that is, organize a two-level multiprocessing (one at the script level, the other inside the function)? Or would you just parallelize the function and launch multiple scripts with different configurations?
I provide a MWE, but I understand the answer on the (absolute) fastest execution is probably problem dependent. Here the function consists of nested loops, but you can adapt it as long as it verifies the preceding assumptions, or change the parameters. I am not interested in this particular MWE, but on how to set an inner parallelization.
import numpy as np
import time
import timeit
import multiprocessing as mp
import copy
def function_wrapper(matrix):
"""Puts to zero the multiples of 3."""
side = matrix.shape[0]
# consider to parallelize the following
# NOTE that I am not interested in parallelizing this particular example
# (you can change this part as long as by parallelizing it you better use the cores)
# with list comprehensions or built-in functions, but on how to setup a second level of multiprocessing
for x in range(side):
for y in range(side):
for z in range(side):
matrix[x, y, z] = timeit.timeit("numpy.linalg.eig(numpy.random.randint(0, 10, (L, L)))", setup='import numpy; L='+str(matrix[x, y, z]), number=10)
return matrix
num_cores = mp.cpu_count()
if __name__ == "__main__":
matrices_number = 20 # depending on the values of matrices_number and max_side the
max_side = 10 # parallelization setup is more or less important
matrices = [np.random.randint(1, 101, (side, side, side))
for side in np.random.randint(2, max_side, matrices_number)]
# sorts matrices by decreasing shape
args_generator = sorted([(m,) for m in matrices], key=lambda x: x[0].shape[0], reverse=True)
iterations = 10 # iterations to reduce caching effects
start = time.time()
for k in range(iterations):
results = [function_wrapper(*args) for args in copy.deepcopy(args_generator)]
elapsed = time.time() - start
print(f"loop: elapsed={elapsed} sec.")
start = time.time()
for k in range(iterations):
with mp.Pool(num_cores) as pool:
results = pool.starmap_async(function_wrapper, copy.deepcopy(args_generator)).get()
elapsed = time.time() - start
print(f"mp.Pool: elapsed={elapsed} sec.")
You're not taking advantage of the built-in functions in the numpy library. You are iterating over the array instead of broadcasting your logic across the whole matrix at once. When you take advantage of the built-in numpy functions, you take advantage of the underlying code written in C. The wrapper function should be the following.
def broadcaster(matrix):
return np.where(matrix % 3 != 0, matrix, 0)
start = time.time()
for k in range(iterations):
results = [broadcaster(*args) for args in copy.deepcopy(args_generator)]
elapsed = time.time() - start
print(f"Broadcast: elapsed={elapsed} sec.")
When I add that snippet to your code I get the following.
loop: elapsed=15.675757884979248 sec.
mp.Pool: elapsed=14.439897060394287 sec.
Broadcast: elapsed=0.6325647830963135 sec.
As you can see, in terms of performance, it is not even close.

multithreaded iteration over numpy array indices

I have a piece of code which iterates over a three-dimensional array and writes into each cell a value based on the indices and the current value itself:
import numpy as np
nx = ny = nz = 100
array = np.zeros((nx, ny, nz))
def fun(val, k):
# Do something with the indices
return val + (k[0] * k[1] * k[2])
with np.nditer(array, flags=['multi_index'], op_flags=['readwrite']) as it:
for x in it:
x[...] = fun(x, it.multi_index)
Note, that fun might do something more sophisticated, which takes most of the total runtime, and that the input arrays might have different lengths per axis.
However, this code could run in multiple threads, as fun can be assumed to be threadsafe (Only the value and index of the current cell are required). But finding a method to iterate over all cells and have the current index available seems to be hard.
A possible solution might be https://stackoverflow.com/a/58012407/446140, where the array is split by the x-axis into chunks and passed to a Pool.
However, the solution is not universally applicable and I wonder if there is a more general solution for this problem (which could also work with nD arrays)?
The first issue is to split up the 3D array into equally sized chunks. np.array_split can be used, but the offset of each of the splits has to be stored to get the correct indices again.
An interesting question, with a few possible solutions. As you indicated, it is possible to use np.array_split, but since we are only interested in the indices, we can also use np.unravel_index, which would mean that we only have to loop over all the indices (the size) of the array to get the index.
Now there are two great ideas for multiprocessing:
Create a (thread safe) shared memory of the array and splitting the indices across the different processes.
Only update the array in a main thread, but provide a copy of the required data to the processes and let them return the value that has to be updated.
Both solutions will work for any np.ndarray, but have different advantages. Creating a shared memory doesn't create copies, but can have a large insertion penalty if it has to wait on other processes (the computational time, is small compared to the write time.)
There are probably many more solutions, but I will work out the first solution, where a Shared Memory object is created and a range of indices is provided to every process.
Required imports:
import itertools
import numpy as np
import multiprocessing as mp
from multiprocessing import shared_memory
Shared Numpy arrays
The main problem with applying multiprocessing on np.ndarray's is that memory sharing between processes can be difficult. For this the following class can be used:
class SharedNumpy:
__slots__ = ('arr', 'shm', 'name', 'shared',)
def __init__(self, arr: np.ndarray = None):
if arr is not None:
self.shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
self.arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=self.shm.buf)
self.name = self.shm.name
np.copyto(self.arr, arr)
def __getattr__(self, item):
if hasattr(self.arr, item):
return getattr(self.arr, item)
raise AttributeError(f"{self.__class__.__name__}, doesn't have attribute {item!r}")
def __str__(self):
return str(self.arr)
#classmethod
def from_name(cls, name, shape, dtype):
memory = cls(arr=None)
memory.shm = shared_memory.SharedMemory(name)
memory.arr = np.ndarray(shape, dtype=dtype, buffer=memory.shm.buf)
memory.name = name
return memory
#property
def dtype(self):
return self.arr.dtype
#property
def shape(self):
return self.arr.shape
This makes it possible to create a shared memory object in the main process and then use SharedNumpy.from_name to get it in other processes.
Simple test
A quick (non threaded) test would be:
def simple_test():
data = np.array(np.zeros((5,) * 2))
mem_primary = SharedNumpy(arr=data)
mem_second = SharedNumpy.from_name(name=mem_primary.name, shape=data.shape, dtype=data.dtype)
assert mem_primary.name == mem_second.name, "Different memory names"
assert np.array_equal(mem_primary.arr, mem_second.arr), "Different array values."
mem_primary.arr[2] = 5
assert np.array_equal(mem_primary.arr, mem_second.arr), "Different array values."
print("Completed 3/3 tests...")
A threaded test will follow later!
Distribution
The next part is focused on providing the processes with the necessary data. In this case we will provide every process with a range of indices that it has to calculate and all the data that is required to load the shared memory.
The input of this function is a dim the number of numpy axis, and the size, which are the number of elements per axis.
def distributed(size, dim):
memory = SharedNumpy(arr=np.zeros((size,) * dim))
split_size = np.int64(np.ceil(memory.arr.size / mp.cpu_count()))
settings = dict(
memory=itertools.repeat(memory.name),
shape=itertools.repeat(memory.arr.shape),
dtype=itertools.repeat(memory.arr.dtype),
start=np.arange(mp.cpu_count()),
num=itertools.repeat(split_size)
)
with mp.Pool(mp.cpu_count()) as pool:
pool.starmap(fun, zip(*settings.values()))
print(f"\n\nDone {dim}D, size: {size}, elements: {size ** dim}")
return memory
Notes:
By using starmap instead of map, it is possible to provide multiple input arguments (a list of arguments for every process).
(also see docs starmap)
itertools.repeat is used to add constants to the starmap
(also see: zip() in python, how to use static values)
By using np.unravel_index, we only need to have a start index and the chunk size per process.
The start and num tell the chunks of indices that have to be converted per process, by applying range(start * num, (start + 1) * num).
Testing
For the testing I am using different input sizes and dimensions. Since the data increases with the formula sizes ^ dimensions, I limited the test to a size of 128 and 3 dimensions (that is 2,097,152 points, and already start taking quit a bit of time.)
Code
fun
def fun(name, shape, dtype, start, num):
memory = SharedNumpy.from_name(name, shape=shape, dtype=dtype)
for idx in range(start * num, min((start + 1) * num, memory.arr.size)):
# Do something with the indices
indices = np.unravel_index([idx], shape)
memory.arr[indices] += np.product(indices)
memory.shm.close() # Closes the shared memory for this process.
Running the example
if __name__ == '__main__':
for size in [5, 10, 15]:
for dim in [1, 2, 3]:
memory = distributed(size, dim)
print(memory)
memory.shm.unlink()
For the OP's code, I used his code with a small addition that I allow the array to have different sizes and dimensions, in any case I use:
def sequential(size, dim):
array = np.zeros((size,) * dim)
...
And looking at the output array of both codes, will result in the same outcomes.
Plots
The code for the graphs have been taken from the reply in:
https://codereview.stackexchange.com/questions/165245/plot-timings-for-a-range-of-inputs
With the minor alteration that labels was changed to codes in
empty_multi_index = pd.MultiIndex(levels=[[], []], codes=[[], []], names=['func', 'result'])
Where the 1d, 2d and 3d reference the dimensions and the input is the size.
Sequentially (OP code):
Distributed (this code):
Results
This method works on an arbitrary sized numpy array, and is able to perform an operation on the indices of the array. It provides you with full access of the whole numpy array, so it can also be used to perform different kind of statistical analysis, which do not change the array.
From the timings it can be seen that for small data shapes the distributed version has no to little advantages, because of the extra complexity of creating the processes. However for larger amount of data it starts to become more effective.
I only timed it on short delays in the computational time (simple fun), but on more complex calculations, it should outperform the sequential version much sooner.
Extra
If you are only interested in operations that are performed over or along axis, these numpy functions might help to vectorize your solutions instead of using multiprocessing:
np.apply_over_axes
np.apply_along_axis

Usage of threadpoolexecutor in conjunction with cython's nogil

I have read this question and answer -Cython nogil with ThreadPoolExecutor not giving speedups and I have a similar problem with my Cython code not getting the speedup that is expected in spite of my system having multiple cores. I have 4 physical cores on a Ubuntu 18.04 instance and if I make the number of jobs to be 1 in the code below it runs faster than when I make it 4. Looking at the CPU usage using top I see the CPU usage go upto 300 %. I am doing the lookup of a data structure in a C++ class that does not get modified i.e. I am only doing read-only queries on the C++ data structure via Cython. There are no mutex locks whatsoever on the C++ side.
This is my first experience with the GIL and I am wondering whether I have incorrectly used it. Also the output of the time is a bit confusing as I do not think it correctly profiles the actual time taken by the each of the worker threads.
I appear to have missed something crucial but I cannot figure out what it is as I have pretty much used the same template for the usage of the GIL as seen in the linked SO answer.
import psutil
import numpy as np
from concurrent.futures import ThreadPoolExecutor
from functools import partial
cdef extern from "Rectangle.h" namespace "shapes":
cdef cppclass Rectangle:
Rectangle(int, int, int, int)
int x0, y0, x1, y1
int getArea() nogil
cdef class PyRectangle:
cdef Rectangle *rect
def __cinit__(self, int x0, int y0, int x1, int y1):
self.rect = new Rectangle(x0, y0, x1, y1)
def __dealloc__(self):
del self.rect
def testThread(self):
latGrid = np.arange(minLat,maxLat,0.05)
lonGrid = np.arange(minLon,maxLon,0.05)
gridLon,gridLat = np.meshgrid(latGrid,lonGrid)
grid_points = np.c_[gridLon.ravel(),gridLat.ravel()]
n_jobs = psutil.cpu_count(logical=False)
chunk = np.array_split(grid_points,n_jobs,axis=0)
x = ThreadPoolExecutor(max_workers=n_jobs)
t0 = time.time()
func = partial(self.performCalc,maxDistance)
results = x.map(func,chunk)
results = np.vstack(list(results))
t1 = time.time()
print(t1-t0)
def performCalc(self,maxDistance,chunk):
cdef int area
cdef double[:,:] gPoints
gPoints = memoryview(chunk)
for i in range(0,len(gPoints)):
with nogil:
area = self.getArea2(gPoints[i])
return area
cdef int getArea2(self,double[:] p) nogil :
cdef int area
area = self.rect.getArea()
return area
My suggestion (in the comments) was to ensure that the entire performCalc loop was nogil. To do this a few changes were needed:
cdef Py_ssize_t i # set type of "i" (although Cython can possibly deduce this anyway)
with nogil:
for i in range(0,gPoints.shape[0]):
area = self.getArea2(gPoints[i])
The most important of which is swapping len(gPoints) for gPoints.shape[0] which replaces a call to a Python function with an array lookup (also I personally don't think len makes sense for a 2D array).
Essentially there's a cost to acquiring and releasing the GIL. You want to make sure that the work done without the GIL is worth the time spent handling it. Simply calculating an area of a rectangle is pretty trivial (two subtractions and a multiplication) and so doesn't really justify the time spent coordinating the GIL between threads - remember that once every loop each thread must (briefly) hold the GIL, during which time no other thread can hold it. However with the whole loop as nogil the time spent on administering it becomes tiny.

Cython: make prange parallelization thread-safe

Cython starter here. I am trying to speed up a calculation of a certain pairwise statistic (in several bins) by using multiple threads. In particular, I am using prange from cython.parallel, which internally uses openMP.
The following minimal example illustrates the problem (compilation via Jupyter notebook Cython magic).
Notebook setup:
%load_ext Cython
import numpy as np
Cython code:
%%cython --compile-args=-fopenmp --link-args=-fopenmp -a
from cython cimport boundscheck
import numpy as np
from cython.parallel cimport prange, parallel
#boundscheck(False)
def my_parallel_statistic(double[:] X, double[:,::1] bins, int num_threads):
cdef:
int N = X.shape[0]
int nbins = bins.shape[0]
double Xij,Yij
double[:] Z = np.zeros(nbins,dtype=np.float64)
int i,j,b
with nogil, parallel(num_threads=num_threads):
for i in prange(N,schedule='static',chunksize=1):
for j in range(i):
#some pairwise quantities
Xij = X[i]-X[j]
Yij = 0.5*(X[i]+X[j])
#check if in bin
for b in range(nbins):
if (Xij < bins[b,0]) or (Xij > bins[b,1]):
continue
Z[b] += Xij*Yij
return np.asarray(Z)
mock data and bins
X = np.random.rand(10000)
bin_edges = np.linspace(0.,1,11)
bins = np.array([bin_edges[:-1],bin_edges[1:]]).T
bins = bins.copy(order='C')
Timing via
%timeit my_parallel_statistic(X,bins,1)
%timeit my_parallel_statistic(X,bins,4)
yields
1 loop, best of 3: 728 ms per loop
1 loop, best of 3: 330 ms per loop
which is not a perfect scaling, but that is not the main point of the question. (But do let me know if you have suggestions beyond adding the usual decorators or fine-tuning the prange arguments.)
However, this calculation is apparently not thread-safe:
Z1 = my_parallel_statistic(X,bins,1)
Z4 = my_parallel_statistic(X,bins,4)
np.allclose(Z1,Z4)
reveals a significant difference between the two results (up to 20% in this example).
I strongly suspect that the problem is that multiple threads can do
Z[b] += Xij*Yij
at the same time. But what I don't know is how to fix this without sacrificing the speed-up.
In my actual use case, the calculation of Xij and Yij is more expensive, hence I would like to do them only once per pair. Also, pre-computing and storing Xij and Yij for all pairs and then simply looping through bins is not a good option either because N can get very large, and I can't store 100,000 x 100,000 numpy arrays in memory (this was actually the main motivation for rewriting it in Cython!).
System info (added following suggestion in comments):
CPU(s): 8
Model name: Intel(R) Core(TM) i7-4790K CPU # 4.00GHz
OS: Red Hat Linux v6.8
Memory: 16 GB
Yes, Z[b] += Xij*Yij is indeed a race condition.
There are a couple of options of making this atomic or critical. Implementation issues with Cython aside, you would in any case have bad performance due to false sharing on the shared Z vector.
So the better alternative is to reserve a private array for each thread. There are a couple of (non-)options again. One could use a private malloc'd pointer, but I wanted to stick with np. Memory slices cannot be assigned as private variables. A two dimensional (num_threads, nbins) array works, but for some reason generates very complicated inefficient array index code. This works but is slower and does not scale.
A flat numpy array with manual "2D" indexing works well. You get a little bit extra performance by avoiding padding the private parts of the array to 64 byte, which is a typical cache line size. This avoids false sharing between the cores. The private parts are simply summed up serially outside of the parallel region.
%%cython --compile-args=-fopenmp --link-args=-fopenmp -a
from cython cimport boundscheck
import numpy as np
from cython.parallel cimport prange, parallel
cimport openmp
#boundscheck(False)
def my_parallel_statistic(double[:] X, double[:,::1] bins, int num_threads):
cdef:
int N = X.shape[0]
int nbins = bins.shape[0]
double Xij,Yij
# pad local data to 64 byte avoid false sharing of cache-lines
int nbins_padded = (((nbins - 1) // 8) + 1) * 8
double[:] Z_local = np.zeros(nbins_padded * num_threads,dtype=np.float64)
double[:] Z = np.zeros(nbins)
int i,j,b, bb, tid
with nogil, parallel(num_threads=num_threads):
tid = openmp.omp_get_thread_num()
for i in prange(N,schedule='static',chunksize=1):
for j in range(i):
#some pairwise quantities
Xij = X[i]-X[j]
Yij = 0.5*(X[i]+X[j])
#check if in bin
for b in range(nbins):
if (Xij < bins[b,0]) or (Xij > bins[b,1]):
continue
Z_local[tid * nbins_padded + b] += Xij*Yij
for tid in range(num_threads):
for bb in range(nbins):
Z[bb] += Z_local[tid * nbins_padded + bb]
return np.asarray(Z)
This performs quite well on my 4 core machine, with 720 ms / 191 ms, a speedup of 3.6. The remaining gap may be due to turbo mode. I don't have access to a proper machine for testing right now.
You are right that the access to Z is under a race condition.
You might be better off defining num_threads copies of Z, as cdef double[:] Z = np.zeros((num_threads, nbins), dtype=np.float64) and perform a sum along axis 0 after the prange loop.
return np.sum(Z, axis=0)
Cython code can have a with gil statement in a parallel region but it is only documented for error handling. You could have a look at the general C code to see whether that would trigger an atomic OpenMP operation but I doubt it.

How can I determine if this write access is coalesced?

How can I determine if the following memory access is coalesced or not:
// Thread-ID
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Offset:
int offset = gridDim.x * blockDim.x;
while ( idx < NUMELEMENTS )
{
// Do Something
// ....
// Write to Array which contains results of calculations
results[ idx ] = df2;
// Next Element
idx += offset;
}
NUMELEMENTS is the complete number of single dataelements to process. The array results is passed as pointer to the kernel function and allocated before in global memory.
My Question: Is the write access in the line results[ idx ] = df2; coalesced?
I believe it is as each thread processes consecutive indexed items but I'm not completely sure about it & I don't know how to tell.
Thanks!
Depends if the length of the lines of your matrix is a multiple of half the warp size for devices of compute capability 1.x or a multiple of the warp size for devices of compute capability 2.x. If it is not you can use padding to make it fully coalesced. The function cudaMallocPitch can be used for this purpose.
edit:
Sorry for the confusion. You write 'offset' elements at a time which I interpreted as lines of a matrix.
What I mean is, after each iteration of your cycle you increase the idx by offset. If offset is a multiple of half the warp size for devices of compute capability 1.x or a multiple of the warp size for devices of compute capability 2.x then you it is coalesced, if not then you need padding to make it so.
Probably it is already coalesced because you should choose the number of threads per block and thus the blockDim as a multiple of the warp size.

Resources