Thread-local arrays in cython's prange without huge memory allocation - multithreading

I have some independent computations I would like to do in parallel using Cython.
Right now I'm using this approach:
import numpy as np
cimport numpy as cnp
from cython.parallel import prange
[...]
cdef cnp.ndarray[cnp.float64_t, ndim=2] temporary_variable = \
np.zeros((INPUT_SIZE, RESULT_SIZE), np.float64)
cdef cnp.ndarray[cnp.float64_t, ndim=2] result = \
np.zeros((INPUT_SIZE, RESULT_SIZE), np.float64)
for i in prange(INPUT_SIZE, nogil=True):
for j in range(RESULT_SIZE):
[...]
temporary_variable[i, j] = some_very_heavy_mathematics(my_input_array)
result[i, j] = some_more_maths(temporary_variable[i, j])
This methodology works but my problem comes from the fact that I in fact need several temporary_variables. This results in huge memory usage when INPUT_SIZE grows. But I believe what is really needed is a temporary variable in each thread instead.
Am I facing a limitation of Cython's prange and do I need to learn proper C or am I doing/understanding something terribly wrong?
EDIT: The functions I was looking for were openmp.omp_get_max_threads() and openmp.omp_get_thread_num() to create a reasonably sized temporary array. I had to cimport openmp first.

This is something that Cython tries to detect, and actually gets right most of the time. If we take a more complete example code:
import numpy as np
from cython.parallel import prange
cdef double f1(double[:,:] x, int i, int j) nogil:
return 2*x[i,j]
cdef double f2(double y) nogil:
return y+10
def example_function(double[:,:] arr_in):
cdef double[:,:] result = np.zeros(arr_in.shape)
cdef double temporary_variable
cdef int i,j
for i in prange(arr_in.shape[0], nogil=True):
for j in range(arr_in.shape[1]):
temporary_variable = f1(arr_in,i,j)
result[i,j] = f2(temporary_variable)
return result
(this is basically the same as yours, but compilable). This compiles to the C code:
#pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i) lastprivate(__pyx_v_j) lastprivate(__pyx_v_temporary_variable)
#endif /* _OPENMP */
for (__pyx_t_8 = 0; __pyx_t_8 < __pyx_t_9; __pyx_t_8++){
You can see that temporary_variable is set to be thread-local. If Cython does not detect this correctly (I find it's often too keen to make variables a reduction) then my suggestion is to encapsulate (some of) the contents of the loop in a function:
cdef double loop_contents(double[:,:] arr_in, int i, int j) nogil:
cdef double temporary_variable
temporary_variable = f1(arr_in,i,j)
return f2(temporary_variable)
Doing so forces temporary_variable to be local to the function (and hence to the thread)
With respect to creating a thread-local array: I'm not 100% clear exactly what you want to do but I'll try to take a guess...
I don't believe it's possible to create a thread-local memoryview.
You could create a thread-local C array with malloc and free but unless you have a good understanding of C then I would not recommend it.
The easiest way is to allocate a 2D array where you have one column for each thread. The array is shared, but since each thread only touches its own column that doesn't matter. A simple example:
cdef double[:] f1(double[:,:] x, int i) nogil:
return x[i,:]
def example_function(double[:,:] arr_in):
cdef double[:,:] temporary_variable = np.zeros((arr_in.shape[1],openmp.omp_get_max_threads()))
cdef int i
for i in prange(arr_in.shape[0],nogil=True):
temporary_variable[:,openmp.omp_get_thread_num()] = f1(arr_in,i)

Related

Passing python objects to C Gstreamer functions, using Cython

I'm using Python3.6 with GStreamer-1.0 and PyGObject (for python access) to read video frames from a camera (tiscamera).
The frames are gotten via python code and eventually I get a GstBuffer:
import gi
gi.require_version("Gst", "1.0")
from gi.repository import
# Set up
Gst.init([])
pipeline = Gst.parse_launch("tcambin serial=12345678 name=source ! video/x-raw,format=GRAY8,width=1920,height=1080,framerate=18/1 ! appsink name=sink")
sink = pipeline.get_by_name("sink")
sink.set_property("max-buffers", 10)
sink.set_property("drop", 1)
sink.set_property("emit-signals", 1)
pipeline.set_state(Gst.State.PLAYING)
# Get sample
sample = sink.emit("pull-sample")
buffer = sample.get_buffer()
meta = buffer.get_meta("TcamStatisticsMetaApi")
The type of meta is gi.repository.Gst.Meta, but in C it's actually TcamStatisticsMeta*, as can be understood by looking at tiscamera's c code example 10-metadata.c. The C code there is:
GstBuffer* buffer = gst_sample_get_buffer(sample);
GstMeta* meta = gst_buffer_get_meta(buffer, g_type_from_name("TcamStatisticsMetaApi"));
GstStructure* struc = ((TcamStatisticsMeta*)meta)->structure;
My problem is that in Python, I can't access struct attributes defined in TcamStatisticsMeta. I'm simply missing the casting bit from GstMeta* to TcamStatisticsMeta* and the translation of TcamStatisticsMeta* into a PyObject.
Does anyone have a direction of how this can be done without needing to modify/recompile the gstreamer-1.0 C code? Using Cython perhaps?
I've started using Cython to try and call a C function with data I get from Python. The python object is of type gi.repository.Gst.Buffer and the function should get a GstBuffer*, but I can't find a way to get the struct pointer from the Python object.
Here is my .pxd file:
cdef extern from "gstreamer-1.0/gstmetatcamstatistics.h":
ctypedef unsigned long GType
ctypedef struct GstBuffer:
pass
ctypedef struct GstMeta:
pass
GstMeta* gst_buffer_get_meta(GstBuffer* buffer, GType api)
GType g_type_from_name(const char* name)
And my .pyx file:
from my_pxd_file cimport GType, g_type_from_name, GstMeta, gst_buffer_get_meta
cdef void c_a(buffer):
cdef char* tcam_statistics_meta_api = "TcamStatisticsMetaApi"
cdef GType gt = g_type_from_name(tcam_statistics_meta_api)
cdef GstMeta* meta = gst_buffer_get_meta(buffer, gt)
def a(buffer):
c_a(buffer)
And my python file:
# No need to use pyximport as I've cythonized the code in setup.py
from . import my_pyx_file
...
buffer = sample.get_buffer()
my_pyx_file.a(buffer)
This results in a SIGSEGV error:
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
The problem is that I can't cast buffer into a GstBuffer*. Does anyone know how to do that?
Debugging in Pycharm, I can actually see the GstBuffer* address:
<Gst.Buffer object at 0x7f3a0b4fc348 (GstBuffer at 0x7f39ec007060)>
But how do I get this address so I can pass it to gst_buffer_get_meta?
Is there a canonical Cython way to do that?

How to print the progress of cython's prange?

I'm using prange for parallelizing a loop in my cython code. As this loop takes a long time, I want to print its progress as it goes. A progress bar would be nice but anything that shows the progress would do. I tried to add some form of logging in the following fashion:
for index in prange(n, nogil = True, num_threads = 5):
if index == (index//1000)*1000:
printf("%d percent done", index*100/n)
This should print the progress whenever index % 1000 == 0.
The output of this is kind of random.
0 percent done
80 percent done
20 percent done
100 percent done
60 percent done
I assume that this is because prange does not assign threads to indexes starting from 0 and going up.
Is there any way to implement such a thing in Cython?
Thanks!
There are probably ways in Cython to achieve what you want, however I find it easier to use OpenMP from C/C++ and to call the functionality from Cython (thanks to C-verbatim-code since Cython 0.28 it is quite convenient).
One thing is clear, to achieve your goal you need some kind of synchronization between threads, which might impact performance. My strategy is simple: every thread reports "done", all done task are registered and they number is reported from time to time. The count of reported tasks is protected with a mutex/lock:
%%cython -c=/openmp --link-args=/openmp
cdef extern from * nogil:
r"""
#include <omp.h>
#include <stdio.h>
static omp_lock_t cnt_lock;
static int cnt = 0;
void reset(){
omp_init_lock(&cnt_lock);
cnt = 0;
}
void destroy(){
omp_destroy_lock(&cnt_lock);
}
void report(int mod){
omp_set_lock(&cnt_lock);
// start protected code:
cnt++;
if(cnt%mod == 0){
printf("done: %d\n", cnt);
}
// end protected code block
omp_unset_lock(&cnt_lock);
}
"""
void reset()
void destroy()
void report(int mod)
from cython.parallel import prange
def prange_with_report(int n):
reset() # reset counter and init lock
cdef int index
for index in prange(n, nogil = True, num_threads = 5):
report(n//10)
destroy() # release lock
Now calling: prange_with_report(100) would print done: 10\n, done: 20\n, ..., done: 100\n.
As a slight optimization, report could be called not for every index but only for example for index%100==0 - there will be less impact on the performance but also the reporting will be less precise.

__sync_bool_compare_and_swap with different parameter types in Cython

I am using Cython for fast parallel processing of data, adding items to a shared memory linked list from multiple threads. I use __sync_bool_compare_and_swap, which provides an atomic compare and swap (CAS) operation to compare if the value was not modified (by another thread) before replacing it with a new value.
cdef extern int __sync_bool_compare_and_swap (void **ptr, void *oldval, void *newval) nogil
cdef bint firstAttempt = 1
cdef type *next = NULL
cdef type *newlink = ....
while firstAttempt or not __sync_bool_compare_and_swap( <void**> c, <void*>next, <void*>newlink):
firstAttempt = 0
next = c[0]
newlink.next = next
This works very well. However, now I also want to keep track of the size of the linked list, and want to use the same CAS function for the updates, however, this time it is not pointers that need to be updated but an int. How can use the same external function twice in Cython, once with void** parameter and once with an int* parameter?
EDIT
What I have in mind is two separate atomic operations, in one atomic operation I want to update the linked list, in the other I want to update the size. You can do it in C, but for Cython it means you have to reference the same external function twice with different parameters, can that be done?
CONCLUSION
The answer suggested by DavidW works. In case anyone is thinking to use a similar construction, you should be aware that when using two separate update functions there is no guarantee these are processed in sequence (i.e. another thread can update in between), however, if the objective is to update a cumulative value for instance to monitor progress while multithreading or to create an aggregated result that is not used until all threads are finished, CAS does guarantee that all updates are done exactly once. Unexpectedly, gcc refuses to compile without casting to void*, so either define separate hard-typed versions, or you need to cast. A snippet from my code:
in some_header.h:
#define sync_bool_compare_and_swap_int __sync_bool_compare_and_swap
#define sync_bool_compare_and_swap_vp __sync_bool_compare_and_swap
in some_prog.pxd:
cdef extern from "some_header.h":
cdef extern int sync_bool_compare_and_swap_vp (void **ptr, void *oldval, void *newval) nogil
cdef extern int sync_bool_compare_and_swap_int (int *ptr, int oldval, int newval) nogil
in some_prog.pyx:
cdef void updateInt(int *value, int incr) nogil:
cdef cINT previous = value[0]
cdef cINT updated = previous + incr
while not sync_bool_compare_and_swap_int( c, previous, updated):
previous = value[0]
updated = previous + incr
So the issue (as I understand it) is that it's __sync_bool_compare_and_swap is a compiler intrinsic rather than a function, so doesn't really have a fixed signature, because the compiler just figures it out. However, Cython demands to know the types, and because you want to use it with two different types, you have a problem.
I can't see a simpler way than resorting to a (very) small amount of C to "help" Cython. Create a header file with a bunch of #defines
/* compare_swap.h */
#define sync_bool_compare_swap_voidp __sync_bool_compare_and_swap
#define sync_bool_compare_swap_int __sync_bool_compare_and_swap
You can then tell Cython that each of these is a separate function
cdef extern from "compare_swap.h":
int sync_bool_compare_swap_voidp(void**, void*, void*)
int sync_bool_compare_swap_int(int*, int, int)
At this stage you should be able to use them naturally as plain functions without any type casting (i.e. no <void**> in your code, since this tends to hide real errors). The C preprocessor will generate the code you want and all is well.
Edit: Looking at this a few years later I can see a couple of simpler ways you could probably use (untested, but I don't see why they shouldn't work). First you could use Cython's ability to map a name to a "cname" to avoid the need for an extra header:
cdef extern from *:
int sync_bool_compare_swap_voidp(void**, void*, void*) "__sync_bool_compare_and_swap"
int sync_bool_compare_swap_int(int*, int, int) "__sync_bool_compare_and_swap"
Second (and probably best) you could use a single generic definition (just telling Cython that it's a varargs function):
cdef extern from "compare_swap.h":
int __sync_bool_compare_and_swap(...)
This way Cython won't try to understand the types used, and will just defer it entirely to C (which is what you want).
I wouldn't like to comment on whether it's safe for you to use two atomic operations like this, or whether that will pass through a state with dangerously inconsistent data....

PyBuffer_New: Do I need to free manually?

I am searching for a memory leak in code of someone else. I found:
def current(self):
...
data = PyBuffer_New(buflen)
PyObject_AsCharBuffer(data, &data_ptr, &buflen)
...
return VideoFrame(data, self.frame_size, self.frame_mode,
timestamp=<double>self.frame.pts/<double>AV_TIME_BASE,
frameno=self.frame.display_picture_number)
cdef class VideoFrame:
def __init__(self, data, size, mode, timestamp=0, frameno=0):
self.data = data
...
In function current() is no free or similar, neither in VideoFrame. Is the PyBuffer automatically freed when the VideoFrame object gets deleted?
The answer is: "it depends; we don't have enough code to tell from your question." It's governed by what type you've told Cython that PyBuffer_New returns. I'll give two simplified illustrating cases and hopefully you should be able to work it out for your more complicated case.
If you tell Cython that it's a PyObject* it has no innate knowledge of that type, and doesn't do anything to keep track of the memory:
# BAD - memory leak!
cdef extern from "Python.h":
ctypedef struct PyObject
PyObject* PyBuffer_New(int size)
def test():
cdef int i
for i in range(100000): # call lots of times to allocate lots of memory
# (type of a is automatically inferred to be PyObject*
# to match the function definition)
a = PyBuffer_New(1000)
and the generated code for the loop pretty much looks like:
for (__pyx_t_1 = 0; __pyx_t_1 < 1000; __pyx_t_1+=1) {
__pyx_v_i = __pyx_t_1;
__pyx_v_a = PyBuffer_New(1000);
}
i.e. memory is being allocated but never freed. If you run test() and look at a task manager you can see the memory usage jump up and not return.
Alternatively, if you tell Cython it's an object that lets Cython deal with it like any other Python object, and manage the reference count correctly:
# Good - no memory leak
cdef extern from "Python.h":
object PyBuffer_New(int size)
def test():
cdef int i
for i in range(100000):# call lots of times to allocate lots of memory
a = PyBuffer_New(1000)
The generated code for the loop is then
for (__pyx_t_1 = 0; __pyx_t_1 < 100000; __pyx_t_1+=1) {
__pyx_v_i = __pyx_t_1;
__pyx_t_2 = PyBuffer_New(1000); if (unlikely(!__pyx_t_2)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 7; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
__Pyx_GOTREF(__pyx_t_2);
__Pyx_XDECREF_SET(__pyx_v_a, __pyx_t_2);
__pyx_t_2 = 0;
}
Note the DECREF, which will be where the object is deallocated. If you run test() here you see no long-term jump in memory usage.
It may be possible to jump between these two cases by using cdef for variable (for example in the definition of VideoFrame). If they use PyObject* without careful DECREFs then they're probably leaking memory...

cython.parallel: how to initialise thread-local ndarray buffer?

I am struggling to initialise thread-local ndarrays with cython.parallel:
Pseudo-code:
cdef:
ndarray buffer
with nogil, parallel():
buffer = np.empty(...)
for i in prange(n):
with gil:
print "Thread %d: data address: 0x%x" % (threadid(), <uintptr_t>buffer.data)
some_func(buffer.data) # use thread-local buffer
cdef void some_func(char * buffer_ptr) nogil:
(... works on buffer contents...)
My problem is that in all threads buffer.data points to the same address. Namely the address of the thread that last assigned buffer.
Despite buffer being assigned within the parallel() (or alternatively prange) block, cython does not make buffer a private or thread-local variable but keeps it as a shared variable.
As a result, buffer.data points to the same memory region wreaking havoc on my algorithm.
This is not a problem exclusively with ndarray objects but seemingly with all cdef class defined objects.
How do I solve this problem?
I think I have finally found a solution to this problem that I like.
The short version is that you create an array that has shape:
(number_of_threads, ...<whatever shape you need in the thread>...)
Then, call openmp.omp_get_thread_num and use that to index the array to get a "thread-local" sub-array. This avoids having a separate array for every loop index (which could be enormous) but also prevents threads overwriting each other.
Here's a rough version of what I did:
import numpy as np
import multiprocessing
from cython.parallel cimport parallel
from cython.parallel import prange
cimport openmp
cdef extern from "stdlib.h":
void free(void* ptr)
void* malloc(size_t size)
void* realloc(void* ptr, size_t size)
...
cdef int num_items = ...
num_threads = multiprocessing.cpu_count()
result_array = np.zeros((num_threads, num_items), dtype=DTYPE) # Make sure each thread uses separate memory
cdef c_numpy.ndarray result_cn
cdef CDTYPE ** result_pointer_arr
result_pointer_arr = <CDTYPE **> malloc(num_threads * sizeof(CDTYPE *))
for i in range(num_threads):
result_cn = result_array[i]
result_pointer_arr[i] = <CDTYPE*> result_cn.data
cdef int thread_number
for i in prange(num_items, nogil=True, chunksize=1, num_threads=num_threads, schedule='static'):
thread_number = openmp.omp_get_thread_num()
some_function(result_pointer_arr[thread_number])

Resources