Cython prange with function reading from numpy array (memoryview) freezes, why? - multithreading

I have to use the rows of a numpy array sequentially so I am doing that with a loop. I tried adding prange for speed but it ends up freezing. I am most likely doing something wrong, I may be misunderstanding the meaning of a race condition in this context. What can be done to fix the problem or to treat the array rows in parallel fashion correctly? Reproducible code below for IPython cells:
%load_ext cython
import numpy as np
%%cython --compile-args=-fopenmp --link-args=-fopenmp --force
# if on Windows, replace line above with: %%cython --compile-args=/openmp --link-args=/openmp --force
cimport cython
from cython.parallel cimport prange
#cython.boundscheck(False)
#cython.wraparound(False)
cdef double sum_row(double [:] arr) nogil:
cdef int i
cdef int size = arr.shape[0]
cdef double s = 0
for i in range(size):
s += arr[i]
return s
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef double sum_all_rows(double [:,:] arr) nogil:
cdef int i
cdef int n_rows = arr.shape[1] + 1
cdef double s = 0
for i in range(n_rows):
s += sum_row(arr[i])
return s
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef double sum_all_rows_parallel(double [:,:] arr) nogil:
cdef int i
cdef int n_rows = arr.shape[1] + 1
cdef double s = 0
for i in prange(n_rows):
s += sum_row(arr[i])
return s
X = np.array([[3.14, 2.71, 0.002],
[0.5, 1, 4.21],
[0.001, 0.002, 0.003],
[-0.1, -0.11, -0.12]])
%timeit sum_all_rows(X)
754 ns ± 38.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
sum_all_rows_parallel(X)
And it hangs... I stopped it after roughly 5 minutes. I did not observe any increasing memory usage nor CPU usage.
I have used the toy examples of prange before seeing the expected speedups, so I am able to compile properly.
Also, any suggested readings to better understand this kind of issue?
Thank you for your help.

This is because nogil means two slightly different things in different contexts. In
cpdef double sum_all_rows_parallel(double [:,:] arr) nogil:
it means "this function does not require the GIL and can be called without it", while in:
with nogil:
# or
prange(..., nogil=True)
it means "release the GIL".
Therefore the GIL is never actually released in your example (since sum_all_rows_parallel is presumably called from Python, which holds the GIL), but prange is working under the assumption that it has been released.
What you should do is change prange(...) to prange(...,nogil=True). You possibly also need to remove the nogil from the sum_all_rows_parallel function definition.
Arguably it's a bug in Cython that it doesn't warn you about the issue, and should be reported. However, I suspect it may not be easily fixed.

Related

Multiprocessing pool map for a BIG array computation go very slow than expected

I've experienced some difficulties when using multiprocessing Pool in python3. I want to do BIG array calculation by using pool.map. Basically, I've a 3D array which I need to do computation for 10 times and it generates 10 output files sequentially. This task can be done 3 times i,e, in the output we get 3*10=30 output files(*.txt). To do this, I've prepared the following script for small array calculation (a sample problem). However, when I use this script for a BIG array calculation or array come out from a series of files, then this piece of code (maybe pool) capture the memory, and it does not save any .txt file at the destination directory. There is no error message when I run the file with command mpirun python3 sample_prob_func.py
Can anybody suggest what is the problem in the sample script and how to write code to get rid of stuck? I've not received any error message, but don't know where the problem occurs. Any help is appreciated. Thanks!
import numpy as np
import multiprocessing as mp
from scipy import signal
import matplotlib.pyplot as plt
import contextlib
import os, glob, re
import random
import cmath, math
import time
import pdb
#File Storing path
save_results_to = 'File saving path'
arr_x = [0, 8.49, 0.0, -8.49, -12.0, -8.49, -0.0, 8.49, 12.0]
arr_y = [0, 8.49, 12.0, 8.49, 0.0, -8.49, -12.0, -8.49, -0.0]
N=len(arr_x)
np.random.seed(12345)
total_rows = 5000
arr = np.reshape(np.random.rand(total_rows*N),(total_rows, N))
arr1 = np.reshape(np.random.rand(total_rows*N),(total_rows, N))
arr2 = np.reshape(np.random.rand(total_rows*N),(total_rows, N))
# Finding cross spectral density (CSD)
def my_func1(data):
# Do something here
return array1
t0 = time.time()
my_data1 = my_func1(arr)
my_data2 = my_func1(arr1)
my_data3 = my_func1(arr2)
print('Time required {} seconds to execute CSD--For loop'.format(time.time()-t0))
mydata_list = [my_data1,my_data3,my_data3]
def my_func2(data2):
# Do something here
return from_data2
start_freq = 100
stop_freq = 110
freq_range= np.around(np.linspace(start_freq,stop_freq,11)/10, decimals=2)
no_of_freq = len(freq_range)
list_arr =[]
def my_func3(csd):
list_csd=[]
for fr_count in range(start_freq, stop_freq):
csd_single = csd[:,:, fr_count]
list_csd.append(csd_single)
print('Shape of list is :', np.array(list_csd).shape)
return list_csd
def parallel_function(BIG_list_data):
with contextlib.closing(mp.Pool(processes=10)) as pool:
dft= pool.map(my_func2, BIG_list_data)
pool.close()
pool.join()
data_arr = np.array(dft)
print('shape of data :', data_arr.shape)
return data_arr
count_day = 1
count_hour =0
for count in range(3):
count_hour +=1
list_arr = my_func3(mydata_list[count]) # Load Numpy files
print('Array shape is :', np.array(arr).shape)
t0 = time.time()
data_dft = parallel_function(list_arr)
print('The hour number={} data is processing... '.format(count_hour))
print('Time in parallel:', time.time() - t0)
for i in range(no_of_freq-1): # (11-1=10)
jj = freq_range[i]
#print('The hour_number {} and frequency number {} data is processing... '.format(count_hour, jj))
dft_1hr_complx = data_dft[i,:,:]
np.savetxt(save_results_to + f'csd_Day_{count_day}_Hour_{count_hour}_f_{jj}_hz.txt', dft_1hr_complx.view(float))
As #JérômeRichard suggested,to aware your job scheduler you need to define the number of processors will engage to perform this task. So, the following command could help you: ncpus = int(os.getenv('SLURM_CPUS_PER_TASK', 1))
You need to use this line inside your python script. Also, inside the parallel_function use with contextlib.closing(mp.Pool(ncpus=10)) as pool: instead of with contextlib.closing(mp.Pool(processes=10)) as pool:. Thanks

read-only numpy array in threading python

I have multiple threads that uses numpy array.
import threading
import numpy as np
import time
shared_array = np.ones((5, 5))
def run(shared_array, nb_iters):
k = shared_array**2
for i in range(nb_iters):
k+=2
def multi_thread():
jobs = []
for _ in range(5):
thread = threading.Thread(target=run, args=(shared_array, 1000000))
jobs.append(thread)
for j in jobs:
j.start()
for j in jobs:
j.join()
t0 = time.time()
multi_thread()
print(time.time() - t0)
#result: 6.502177000045776
t0 = time.time()
# we used 1000000 iterations for each thread => total nb of iterations = 5 * 1000000
run(shared_array, 1000000 * 5)
print(time.time() - t0)
#result: 6.6372435092926025
the problem is after adding the numpy array as an argument, the execution time of 5 parallel threads is equal to a sequential execution!
so I want to know how to make a program (similar to this one) parallel,
That's a poor example. Python has an internal lock (the global interpreter lock, GIL) that means only one thread at a time can be executing Python code. When you go into numpy, that can run in parallel, but because your array is so small, you are spending almost no time in numpy, so you aren't getting any parallelism to speak of.

Fast way to apply a function on each pixel of a PIL Image

I need to apply a function to each pixel in large PIL Images. I found similar questions here, but somehow the answers never worked for me (mostly, because they were specific to the function).
Going through every pixel with two for-loops works, but is insanely slow. So I thought, there may be a faster way like numpy.apply_along_axis or vectorization. However, the first one is slow, too, and I cannot get the second one to work. I appreciate any suggestions!
from PIL import Image
import numpy as np
# example functions to be applied on each pixel RGB value
# (lets just imagine for now, that l and r cannot be 0)
def blueness(rgb):
r, g, b = int(rgb[0]), int(rgb[1]), int(rgb[2])
l = (r+g+b)/3
return (b/r)*(1/l)*100
input_path = 'input.png'
img = Image.open(input_path)
img_np = np.asarray(orig_img)
h, w, z = img_np.shape
converted_img = np.zeros((h, w))
# apply blueness on img_np
for x in range(w):
for y in range(h):
converted_img[y, x] = blueness(img_np[y, x])
In general, you have probably already gone wrong if you think about converting images to lists and using for loops in Python. You really need to be vectorising with Numpy or Numba or numexpr or somesuch.
Here is a way to do that on your function:
#!/usr/bin/env python3
import numpy as np
def loopy(na):
# Create output image in greyscale
res = np.zeros_like(na[..., 0])
# apply blueness on na
for x in range(w):
for y in range(h):
res[y, x] = blueness(na[y, x])
return res
def blueness(rgb):
r, g, b = int(rgb[0]), int(rgb[1]), int(rgb[2])
l = (r+g+b)/3
return (b/r)*(1/l)*100
def me(na):
# Take mean of RGB values
l = np.mean(na, axis=2)
res = (na[..., 2] / na[..., 0]) * 100/l
return res.astype(np.uint8)
# Guess the height and width
h, w = 1080, 1920
# Create a random image in Numpy array
na = np.random.randint(1,256, (h,w,3), np.uint8)
# Double for loop method
res1 = loopy(na)
# Numpy method
res2 = me(na)
Here are the results. Numpy is around 65x faster:
%timeit me(na)
37 ms ± 523 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit loopy(na)
2.36 s ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can try this:
from PIL import Image
import numpy as np
from pathlib import Path
def blueness(*rgb):
r, g, b = int(rgb[0]), int(rgb[1]), int(rgb[2])
l = (r+g+b)/3
return (b/r)*(1/l)*100
def test():
fname = Path().absolute() / "test.jpg"
pil_img = Image.open(fname)
print(pil_img.size)
img = np.array(pil_img)
print(img.shape)
result = [blueness(r,g,b) for r,g,b in zip(np.nditer(img[...,0]), np.nditer(img[...,1]), np.nditer(img[...,2]))]
print(len(result))
if __name__ == "__main__":
test()
np.nditer(img[...,0]) will give you all the red values, np.nditer(img[...,1]) will give all the green values and np.nditer(img[...,2]) will give all the blue values.
But do cross check your blueness function, for divide by zero case.
It is possible that r,g,b values be zero and then that 1/l term will infinity.

Vectorized shapely operations using Cython : ImportError GEOSPreparedContains_r [duplicate]

I met the problem "undefined symbol" when using mlpack in Cython. Here is my test case:
cdef extern from "<mlpack/core.hpp>" namespace "arma":
ctypedef unsigned uword
cdef cppclass vec:
vec()
vec(uword)
cdef cppclass mat:
mat()
mat(uword, uword)
void matprint "print" ()
double& operator() (const uword, const uword)
cdef extern from "<mlpack/methods/pca/pca.hpp>" namespace "mlpack::pca":
cdef cppclass ExactSVDPolicy:
ExactSVDPolicy()
cdef cppclass PCA[ExactSVDPolicy]:
PCA()
void Apply(const mat&, mat&, vec&, mat&)
cdef mat m = mat(4, 2)
(<double*>&m(0, 0))[0] = 1.2
(<double*>&m(1, 0))[0] = 1.0
(<double*>&m(2, 0))[0] = 0.8
(<double*>&m(3, 0))[0] = 0.6
(<double*>&m(0, 1))[0] = 0.6
(<double*>&m(1, 1))[0] = 0.8
(<double*>&m(2, 1))[0] = 1.0
(<double*>&m(3, 1))[0] = 1.2
cdef vec eig = vec(2)
cdef mat coeff = mat(4, 2)
cdef PCA[ExactSVDPolicy] pca
m.matprint()
pca.Apply(m, m, eig, coeff)
m.matprint()
Here is the setup file:
from distutils.core import setup
from Cython.Build import cythonize
from distutils.extension import Extension
setup(ext_modules = cythonize([Extension("pca", ["pca.pyx"], language='c++')]))
Compilation was OK, but when I import the module, python complains that:
undefined symbol: _ZN6mlpack5Timer5StartERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
I looked for the symbol, it is defined in the libmlpack.so. I put it in /usr/local/lib, which is included in LD_LIBRARY_PATH, but it seems Python does not find the symbol during runtime. Is there anyone who can help? Thanks.
The extension must be linked to the library it is using.
setup(ext_modules=cythonize([Extension(
"pca", ["pca.pyx"], language='c++'),
libraries='mlpack',
]))
That all symbols can be found, and libraries linked correctly, can be checked by ldd <.so>.
See Compiling and Linking Cython documentation.

some python3 behavior i am unable to understood

I have used following codes.
from collections import defaultdict
from random import randint, randrange,choice, shuffle
def random_array(low, high, step, size):
lst = []
while len(lst)<size:
nexts = randrange(low, high, step)
if nexts in lst:continue
lst.append(nexts)
return lst
def find_pair_from_two_list(a, b, val):
b_dict = defaultdict(int)
for i,v in enumerate(b): b_dict[v] = i
for v in a:
if (val - v) in b_dict:
return v, val-v
return -1, -1
arr1 = random_array(1, 100, 1, 99)
arr2 = random_array(1, 100, 1, 99)
val1 = choice(arr1)
val2 = choice(arr2)
val = val1 + val2
print(find_pair_from_two_list(arr1,arr2, val))
However if i change size value in
arr1 = random_array(1, 100, 1, 99)
arr2 = random_array(1, 100, 1, 99)
upto 99 it works instantly but if i change any of the size value to 100 or more it just seems to hang in there.
I am curious to know why this is happening.I mean it works well till 99 but what causes it to hang for even 100.
Why is yours slow:
Using arr1 = random_array(1, 100, 1, 100) your method can take lots of time to draw the last missing numbers because you draw new random values over and over and discard them when they are already inside your resultlist:
while len(lst)<size:
nexts = randrange(low, high, step)
if nexts in lst:continue # discards already inside numbers
lst.append(nexts)
return lst
With inputs like this you essentially draw "all" possible numbers until done and the more your result contains the longer it takes to draw another "fitting" one.
You can even produce endless loops if your range(low,high,steps) has less total values then your size demands.
(1,100,5,100) # => only 20 in this range with this stepper -> endless loop
Possible simplification (not optimal)
You could simplyfy and speedup the code by:
import random
def random_array(low, high, step, size):
poss = list(range(low,high,step)) # this does not contain duplicates
random.shuffle(poss) # shuffle it
return poss[:size] # return size (or all) elements from it
print(random_array(1,100,1,10))
This code will return if you specify "wrong" combinations to it, but the resulting list is then shorter as whatever you specified as size.
Even better
jonsharpes suggestion to use
random.sample(range(low,high,step),size)
like so:
def ra(low,high,step,size):
return random.sample(range(low,high,step),size)
Performance test
Performancewise they the random.sample outperforms mine for big lists easily:
import random
def random_array(low, high, step, size):
poss = list(range(low,high,step))
random.shuffle(poss)
return poss[:size]
def ra(low,high,step,size):
return random.sample(range(low,high,step),size)
import timeit
if __name__ == '__main__':
import timeit
# create 100 times 495 randoms of range (1,1000000,22)
print(timeit.timeit("ra(1,1000000,22,495)", setup="from __main__ import ra",number = 10000))
print(timeit.timeit("random_array(1,1000000,22,495)", setup="from __main__ import random_array",number = 10000))
Output:
1.1825043768664596 # random.sample(...) of range(...)
92.12594874871951 # mine
Reason probably being I create actual lists from ranges, random.sample uses ranges with iterators smartly...
Doku:
https://docs.python.org/3.1/library/random.html
https://docs.python.org/3/library/timeit.html

Resources