Enabling Parallelism with Cython

Enabling Parallelism with Cython - multithreading

I am trying to get the prange function of Cython's parallel package to work and it seems like there is no parallelism in effect. To have a MWE, I have taken the example code from the book Cython: A Guide for Python Programmers and modified it a little bit by adding a few print statements. The example code is freely available at github and the code I'm referring to is at: examples/12-parallel-cython/02-prange-parallel-loops/.
The following is my modification of the julia.pyx file.
# distutils: extra_compile_args = -fopenmp
# distutils: extra_link_args = -fopenmp
from cython cimport boundscheck, wraparound
from cython cimport parallel
import numpy as np
cdef inline double norm2(double complex z) nogil:
return z.real * z.real + z.imag * z.imag
cdef int escape(double complex z,
double complex c,
double z_max,
int n_max) nogil:
cdef:
int i = 0
double z_max2 = z_max * z_max
while norm2(z) < z_max2 and i < n_max:
z = z * z + c
i += 1
return i
#boundscheck(False)
#wraparound(False)
def calc_julia(int resolution, double complex c,
double bound=1.5, double z_max=4.0, int n_max=1000):
cdef:
double step = 2.0 * bound / resolution
int i, j
double complex z
double real, imag
int[:, ::1] counts
counts = np.zeros((resolution+1, resolution+1), dtype=np.int32)
for i in parallel.prange(resolution + 1, nogil=True,
schedule='static', chunksize=1):
real = -bound + i * step
for j in range(resolution + 1):
imag = -bound + j * step
z = real + imag * 1j
counts[i,j] = escape(z, c, z_max, n_max)
return np.asarray(counts)
#boundscheck(False)
#wraparound(False)
def julia_fraction(int[:,::1] counts, int maxval=1000):
cdef:
unsigned int thread_id
int total = 0
int i, j, N, M
N = counts.shape[0]; M = counts.shape[1]
print("N = %d" % N)
with nogil:
for i in parallel.prange(N, schedule="static", chunksize=10):
thread_id = parallel.threadid()
with gil:
print("Thread %d." % (thread_id))
for j in range(M):
if counts[i,j] == maxval:
total += 1
return total / float(counts.size)
When I compile using the setup_julia.py given by
from distutils.core import setup
from Cython.Build import cythonize
from distutils.extension import Extension
setup(name="julia",
ext_modules=cythonize(Extension('julia', ['julia.pyx'], extra_compile_args=['-fopenmp'], extra_link_args=['-fopenmp'])))
with the command
python setup_julia.py build_ext --inplace
and run the run_julia.py file, I see that all instances of the for loop only use one thread -- Thread 0. The terminal output looks like below.
poulin8:02-prange-parallel-loops poulingroup$ python run_julia.py
time: 0.892143
julia fraction: N = 1001
Thread 0.
Thread 0.
Thread 0.
Thread 0.
.
.
.
.
Thread 0.
0.236994773458
As I understand, the for loop is simply running in parallel. Could someone guide me on what I must do for initiating the for loop to distribute load amongst many threads?
I have also tried to set the system variable OMP_NUM_THREADS to some number greater than 1 and there is no effect of this.
I am running the tests on a OSX 10.11.6, with Python 2.7.10 and gcc 5.2.0.

I've got the same problem on Windows 7.
It was running serial.
Noticed compilation message:
python setup_julia.py build_ext --inplace
cl : Command line warning D9002 : ignoring unknown option '-fopenmp'
Apparently in Visual Studio it has to be -openmp
# distutils: extra_compile_args = -openmp
# distutils: extra_link_args = -openmp
Now runs in parallel.
As #danny noted you can use fprintf:
from cython.parallel cimport prange, threadid
from libc.stdio cimport stdout, fprintf
def julia_fraction(int[:,::1] counts, int maxval=1000):
...
thread_id = threadid()
fprintf(stdout, "%d\n", <int>thread_id)
...

Related

Optimizing asymmetrically reweighted penalized least squares smoothing (from matlab to python)

I'm trying to apply the method for baselinining vibrational spectra, which is announced as an improvement over asymmetric and iterative re-weighted least-squares algorithms in the 2015 paper (doi:10.1039/c4an01061b), where the following matlab code was provided:
function z = baseline(y, lambda, ratio)
% Estimate baseline with arPLS in Matlab
N = length(y);
D = diff(speye(N), 2);
H = lambda*D'*D;
w = ones(N, 1);
while true
W = spdiags(w, 0, N, N);
% Cholesky decomposition
C = chol(W + H);
z = C \ (C' \ (w.*y) );
d = y - z;
% make d-, and get w^t with m and s
dn = d(d<0);
m = mean(d);
s = std(d);
wt = 1./ (1 + exp( 2* (d-(2*s-m))/s ) );
% check exit condition and backup
if norm(w-wt)/norm(w) < ratio, break; end
end
that I rewrote into python:
def baseline_arPLS(y, lam, ratio):
# Estimate baseline with arPLS
N = len(y)
k = [numpy.ones(N), -2*numpy.ones(N-1), numpy.ones(N-2)]
offset = [0, 1, 2]
D = diags(k, offset).toarray()
H = lam * numpy.matmul(D.T, D)
w_ = numpy.ones(N)
while True:
W = spdiags(w_, 0, N, N, format='csr')
# Cholesky decomposition
C = cholesky(W + H)
z_ = spsolve(C.T, w_ * y)
z = spsolve(C, z_)
d = y - z
# make d- and get w^t with m and s
dn = d[d<0]
m = numpy.mean(dn)
s = numpy.std(dn)
wt = 1. / (1 + numpy.exp(2 * (d - (2*s-m)) / s))
# check exit condition and backup
norm_wt, norm_w = norm(w_-wt), norm(w_)
if (norm_wt / norm_w) < ratio:
break
w_ = wt
return(z)
Except for the input vector y the method requires parameters lam and ratio and it runs ok for values lam<1.e+07 and ratio>1.e-01, but outputs poor results. When values are changed outside this range, for example lam=1e+07, ratio=1e-02 the CPU starts heating up and job never finishes (I interrupted it after 1min). Also in both cases the following warning shows up:
/usr/local/lib/python3.9/site-packages/scipy/sparse/linalg/dsolve/linsolve.py: 144: SparseEfficencyWarning: spsolve requires A to be CSC or CSR matrix format warn('spsolve requires A to be CSC or CSR format',
although I added the recommended format='csr' option to the spdiags call.
And here's some synthetic data (similar to one in the paper) for testing purposes. The noise was added along with a 3rd degree polynomial baseline The method works well for parameters bl_1 and fails to converge for bl_2:
import numpy
from matplotlib import pyplot
from scipy.sparse import spdiags, diags, identity
from scipy.sparse.linalg import spsolve
from numpy.linalg import cholesky, norm
import sys
x = numpy.arange(0, 1000)
noise = numpy.random.uniform(low=0, high = 10, size=len(x))
poly_3rd_degree = numpy.poly1d([1.2e-06, -1.23e-03, .36, -4.e-04])
poly_baseline = poly_3rd_degree(x)
y = 100 * numpy.exp(-((x-300)/15)**2)+\
200 * numpy.exp(-((x-750)/30)**2)+ \
100 * numpy.exp(-((x-800)/15)**2) + noise + poly_baseline
bl_1 = baseline_arPLS(y, 1e+07, 1e-01)
bl_2 = baseline_arPLS(y, 1e+07, 1e-02)
pyplot.figure(1)
pyplot.plot(x, y, 'C0')
pyplot.plot(x, poly_baseline, 'C1')
pyplot.plot(x, bl_1, 'k')
pyplot.show()
sys.exit(0)
All this is telling me that I'm doing something very non-optimal in my python implementation. Since I'm not knowledgeable enough about the intricacies of scipy computations I'm kindly asking for suggestions on how to achieve convergence in this calculations.
(I encountered an issue in running the "straight" matlab version of the code because the line D = diff(speye(N), 2); truncates the last two rows of the matrix, creating dimension mismatch later in the function. Following the description of matrix D's appearance I substituted this line by directly creating a tridiagonal matrix using the diags function.)

Guided by the comment #hpaulj made, and suspecting that the loop exit wasn't coded properly, I re-visited the paper and found out that the authors actually implemented an exit condition that was not featured in their matlab script. Changing the while loop condition provides an exit for any set of parameters; my understanding is that algorithm is not guaranteed to converge in all cases, which is why this condition is necessary but was omitted by error. Here's the edited version of my python code:
def baseline_arPLS(y, lam, ratio):
# Estimate baseline with arPLS
N = len(y)
k = [numpy.ones(N), -2*numpy.ones(N-1), numpy.ones(N-2)]
offset = [0, 1, 2]
D = diags(k, offset).toarray()
H = lam * numpy.matmul(D.T, D)
w_ = numpy.ones(N)
i = 0
N_iterations = 100
while i < N_iterations:
W = spdiags(w_, 0, N, N, format='csr')
# Cholesky decomposition
C = cholesky(W + H)
z_ = spsolve(C.T, w_ * y)
z = spsolve(C, z_)
d = y - z
# make d- and get w^t with m and s
dn = d[d<0]
m = numpy.mean(dn)
s = numpy.std(dn)
wt = 1. / (1 + numpy.exp(2 * (d - (2*s-m)) / s))
# check exit condition and backup
norm_wt, norm_w = norm(w_-wt), norm(w_)
if (norm_wt / norm_w) < ratio:
break
w_ = wt
i += 1
return(z)

Chudnovsky algorithm python incorrect decimals

my goal is to get 100.000 or 200.000 correct decimals of Pi in Python. For this, I have tried using the Chudnovsky algorithm, but I've got some issues along the way.
First, the program only gives me 29 chars, instead of the 50 I want to test the correctness. I know this is a small issue, but I don't understand what I've done wrong.
Second, only the first 14 decimals are correct. After those, I start getting inaccurate Pi decimals according to about all Pi numbers on the internet. How do I get way more correct decimals?
And last, how do I let my code run on all 4 of the threads I have? I've tried using Pool, but it doesn't seem to work. (Checked it with Windows task manager)
This is my code:
from math import *
from decimal import Decimal, localcontext
from multiprocessing import Pool
import time
k = 0
s = 0
c = Decimal(426880*sqrt(10005))
if __name__ == '__main__':
start = time.time()
pi = 0
with localcontext() as ctx:
ctx.prec = 50
with Pool(None) as pool:
for k in range(0,500):
m = Decimal((factorial(6 * k)) / (factorial(3 * k) * Decimal((factorial(k) ** 3))))
l = Decimal((545140134 * k) + 13591409)
x = Decimal((-262537412640768000) ** k)
subPi = Decimal(((m*l)/x))
s = s + subPi
print(c*(s**-1))
print(time.time() - start)

In addition to the small details discussed in the comments and proposed by #mark-dickinson I think I've fixed the multithreading but I haven't had a chance to test it, let me know if it works properly
UPDATE: Problems after the 28th digits were due to the assignment of sq and c before the decimal context change. Reassign their value after changing the context precision solved the problem
from math import *
import decimal
from decimal import Decimal, localcontext
from multiprocessing import Pool
import time
k = 0
s = 0
sq = Decimal(10005).sqrt() #useless here
c = Decimal(426880*sq) #useless here
def calculate():
global s, k
for k in range(0,500):
m = Decimal((factorial(6 * k)) / (factorial(3 * k) * Decimal((factorial(k) ** 3))))
l = Decimal((545140134 * k) + 13591409)
x = Decimal((-262537412640768000) ** k)
subPi = Decimal((m*l)/x)
s = s + subPi
print(c*(s**-1))
if __name__ == '__main__':
start = time.time()
pi = 0
decimal.getcontext().prec = 100 #change the precision to increse the result digits
sq = Decimal(10005).sqrt()
c = Decimal(426880*sq)
pool = Pool()
result = pool.apply_async(calculate)
result.get()
print(time.time() - start)

Pointer from Python (ctypes) to C to save function output

I'm new to C integration in Python. I'm currently wrapping a .dll library into my Python code using ctypes and I'm having issues passing a pointer to save the output of a particular function.
My C function has the following structure:
function (int* w, int* h, unsigned short* data_output)
Where h and w are inputs and data_output is an array of size (w x h, 1).
I was able to successfully integrate and retrieve results from the function in Matlab by creating a zeros(w x h,1) array and passing it as a pointer using libpointer('uint16Ptr', zeros(w x h, 1)).
How can I do that in Python?
For other functions, where the output was of int* type I was able to successfully retrieve the values using create_string_buffer. But I haven't managed to make it work for this one.
Thank you.

According to [SO]: How to create a Minimal, Complete, and Verifiable example (mcve), your question doesn't contain basic info (e.g. your attempts). Make sure to correct that in the next ones.
Also, [Python.Docs]: ctypes - A foreign function library for Python.
Here's a dummy example that illustrates the concept.
dll00.c:
#if defined(_WIN32)
# define DLL_EXPORT __declspec(dllexport)
#else
# define DLL_EXPORT
#endif
DLL_EXPORT int function(int w, int h, unsigned short *pData) {
int k = 1;
for (int i = 0; i < h; i++)
for (int j = 0; j < w; j++, k++)
pData[i * w + j] = k;
return w * h;
}
code00.py:
#!/usr/bin/env python3
import sys
import ctypes as ct
DLL = "./dll00.so"
def print_array(data, h, w):
for i in range(h):
for j in range(w):
print("{:2d}".format(data[i * w + j]), end=" ")
print()
def main(*argv):
dll_dll = ct.CDLL(DLL)
function = dll_dll.function
function.argtypes = [ct.c_int, ct.c_int, ct.POINTER(ct.c_ushort)]
function.restype = ct.c_int
h = 3
w = 5
ArrayType = ct.c_ushort * (h * w) # Dynamically declare the array type: `unsigned short[15]` in our case
array = ArrayType() # The array type instance
print_array(array, h, w)
res = function(w, h, ct.cast(array, ct.POINTER(ct.c_ushort)))
print("{:} returned: {:d}".format(function.__name__, res))
print_array(array, h, w)
if __name__ == "__main__":
print("Python {:s} {:03d}bit on {:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")),
64 if sys.maxsize > 0x100000000 else 32, sys.platform))
rc = main(*sys.argv[1:])
print("\nDone.")
sys.exit(rc)
Output:
[cfati#cfati-5510-0:/cygdrive/e/Work/Dev/StackOverflow/q054753828]> ~/sopr.sh
### Set shorter prompt to better fit when pasted in StackOverflow (or other) pages ###
[064bit prompt]> ls
code00.py dll00.c
[064bit prompt]> gcc -fPIC -shared -o dll00.so dll00.c
[064bit prompt]> ls
code00.py dll00.c dll00.so
[064bit prompt]> python3 code00.py
Python 3.6.4 (default, Jan 7 2018, 15:53:53) [GCC 6.4.0] 064bit on cygwin
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
function returned: 15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Done.

Parallel C wrapper for cython code

As i was recommended by DavidW in this Topic,
I'm trying to make a C wrapper function using OpenMP in order to multithread Cython code.
Here is what i have :
The C file "paral.h":
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void paral(void (*func)(int,int), int nthreads){
int t;
#pragma omp parallel for
for (t = 0; t < nthreads; t++){
(*func)(t, nthreads);
}
}
The test.pyx file:
import time
import random
cimport cython
from libc.stdlib cimport malloc, realloc, free
ctypedef void (*func)(int,int)
cdef extern from "paral.h":
void paral(func function, int nthreads) nogil
cdef double *a = <double *> malloc ( 1000000 * sizeof(double) )
cdef double *b = <double *> malloc ( 1000000 * sizeof(double) )
cdef double *c = <double *> malloc ( 1000000 * sizeof(double) )
cdef int i
for i in range(1000000):
a[i] = random.random()
b[i] = random.random()
cdef void sum_ab(int thread, int nthreads):
cdef int start, stop, i
start = thread * (1000000 / nthreads)
stop = start + (1000000 / nthreads)
for i in range(start, stop):
c[i] = a[i] + b[i]
t0 = time.clock()
with nogil:
paral(sum_ab,4)
print(time.clock()-t0)
t0 = time.clock()
with nogil:
paral(sum_ab,1)
print(time.clock()-t0)
I have Visual Studio, so in the setup.py I have add:
extra_compile_args=["/openmp"],
extra_link_args=["/openmp"]
Results:
The 4-threaded is slightly slower than the 1-threaded.
If someone know what i'm doing wrong here.
Edit:
In response to Zultan.
To ensure that the time measured by time.clock() is correct, i make the execution last a few seconds, to be able to compare the time i get with time.clock() and time i measure with a stopwtach.
Somthing like this:
print("start timer 1")
t1 = time.clock()
for i in range(10000):
with nogil:
paral(sum_ab,4)
t2 = time.clock()
print(t2-t1)
print("strart timer 2")
t1 = time.clock()
for i in range(10000):
with nogil:
paral(sum_ab,1)
t2 = time.clock()
print(t2-t1)
print("stop")
Results with time.clock() are 15.0s 4-threaded, 14.5s 1-threaded and i see no noticable difference with what i measure.
Edit 2:
I think i've figured out what is happenng here. I read in some cases memory bandwidth can be saturated.
If i replace:
c[i] = a[i] + b[i]
by a more complex operation, for example:
c[i] = a[i]**b[i]
Now i have significant speedup between the single and the multi threaded (near x2).
However, i'm still 2x slower than a classic prange loop!
I see no reason why the prange is that faster. Maybe i need to change the C code...

Numpy in Cython, no improvement

I am writing a simple function in cython using numpy but it seems that cython is producing a ton of API while converting to C++. Could anyone help me with the error? I did not find anything more in the cython docs.
operations.pyx:
import numpy as np
cimport numpy as np
import cython
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
def diff(np.ndarray[np.float64_t, ndim=2] a,
np.ndarray[np.float64_t, ndim=2] b):
cdef int cols = 100
cdef int rows = 100
for _ in range(1000):
for i in range(rows):
b[i, 0] = (a[i, 1] - a[i, cols - 1]) / 2
for i in range(1, cols - 1):
b[:, i] = (a[:, i + 1] - a[:, i - 1]) / 2
for i in range(rows):
b[i, cols - 1] = (a[i, 0] - a[i, cols - 2]) / 2
return
I get almost the same speed in python and cython. If I change the column selection (:), it becomes much worse (5x slower). could someone show me where the error might be?
html output from cython annotation:

The loops use i and j (and _) as python objects, try cdef-ing them; for example here:
cdef int cols = 100
cdef int rows = 100
cdef int i = 0
cdef int j = 0
Since you do not do operations over _, I think Cython handles it right and isn't needed to be cdef, but you could try (anyway it is just a line).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Enabling Parallelism with Cython - multithreading

Related

Optimizing asymmetrically reweighted penalized least squares smoothing (from matlab to python)

Chudnovsky algorithm python incorrect decimals

Pointer from Python (ctypes) to C to save function output

Parallel C wrapper for cython code

Numpy in Cython, no improvement

Categories

Resources