cython.parallel: how to initialise thread-local ndarray buffer?

cython.parallel: how to initialise thread-local ndarray buffer? - multithreading

I am struggling to initialise thread-local ndarrays with cython.parallel:
Pseudo-code:
cdef:
ndarray buffer
with nogil, parallel():
buffer = np.empty(...)
for i in prange(n):
with gil:
print "Thread %d: data address: 0x%x" % (threadid(), <uintptr_t>buffer.data)
some_func(buffer.data) # use thread-local buffer
cdef void some_func(char * buffer_ptr) nogil:
(... works on buffer contents...)
My problem is that in all threads buffer.data points to the same address. Namely the address of the thread that last assigned buffer.
Despite buffer being assigned within the parallel() (or alternatively prange) block, cython does not make buffer a private or thread-local variable but keeps it as a shared variable.
As a result, buffer.data points to the same memory region wreaking havoc on my algorithm.
This is not a problem exclusively with ndarray objects but seemingly with all cdef class defined objects.
How do I solve this problem?

I think I have finally found a solution to this problem that I like.
The short version is that you create an array that has shape:
(number_of_threads, ...<whatever shape you need in the thread>...)
Then, call openmp.omp_get_thread_num and use that to index the array to get a "thread-local" sub-array. This avoids having a separate array for every loop index (which could be enormous) but also prevents threads overwriting each other.
Here's a rough version of what I did:
import numpy as np
import multiprocessing
from cython.parallel cimport parallel
from cython.parallel import prange
cimport openmp
cdef extern from "stdlib.h":
void free(void* ptr)
void* malloc(size_t size)
void* realloc(void* ptr, size_t size)
...
cdef int num_items = ...
num_threads = multiprocessing.cpu_count()
result_array = np.zeros((num_threads, num_items), dtype=DTYPE) # Make sure each thread uses separate memory
cdef c_numpy.ndarray result_cn
cdef CDTYPE ** result_pointer_arr
result_pointer_arr = <CDTYPE **> malloc(num_threads * sizeof(CDTYPE *))
for i in range(num_threads):
result_cn = result_array[i]
result_pointer_arr[i] = <CDTYPE*> result_cn.data
cdef int thread_number
for i in prange(num_items, nogil=True, chunksize=1, num_threads=num_threads, schedule='static'):
thread_number = openmp.omp_get_thread_num()
some_function(result_pointer_arr[thread_number])

Related

Passing python objects to C Gstreamer functions, using Cython

I'm using Python3.6 with GStreamer-1.0 and PyGObject (for python access) to read video frames from a camera (tiscamera).
The frames are gotten via python code and eventually I get a GstBuffer:
import gi
gi.require_version("Gst", "1.0")
from gi.repository import
# Set up
Gst.init([])
pipeline = Gst.parse_launch("tcambin serial=12345678 name=source ! video/x-raw,format=GRAY8,width=1920,height=1080,framerate=18/1 ! appsink name=sink")
sink = pipeline.get_by_name("sink")
sink.set_property("max-buffers", 10)
sink.set_property("drop", 1)
sink.set_property("emit-signals", 1)
pipeline.set_state(Gst.State.PLAYING)
# Get sample
sample = sink.emit("pull-sample")
buffer = sample.get_buffer()
meta = buffer.get_meta("TcamStatisticsMetaApi")
The type of meta is gi.repository.Gst.Meta, but in C it's actually TcamStatisticsMeta*, as can be understood by looking at tiscamera's c code example 10-metadata.c. The C code there is:
GstBuffer* buffer = gst_sample_get_buffer(sample);
GstMeta* meta = gst_buffer_get_meta(buffer, g_type_from_name("TcamStatisticsMetaApi"));
GstStructure* struc = ((TcamStatisticsMeta*)meta)->structure;
My problem is that in Python, I can't access struct attributes defined in TcamStatisticsMeta. I'm simply missing the casting bit from GstMeta* to TcamStatisticsMeta* and the translation of TcamStatisticsMeta* into a PyObject.
Does anyone have a direction of how this can be done without needing to modify/recompile the gstreamer-1.0 C code? Using Cython perhaps?
I've started using Cython to try and call a C function with data I get from Python. The python object is of type gi.repository.Gst.Buffer and the function should get a GstBuffer*, but I can't find a way to get the struct pointer from the Python object.
Here is my .pxd file:
cdef extern from "gstreamer-1.0/gstmetatcamstatistics.h":
ctypedef unsigned long GType
ctypedef struct GstBuffer:
pass
ctypedef struct GstMeta:
pass
GstMeta* gst_buffer_get_meta(GstBuffer* buffer, GType api)
GType g_type_from_name(const char* name)
And my .pyx file:
from my_pxd_file cimport GType, g_type_from_name, GstMeta, gst_buffer_get_meta
cdef void c_a(buffer):
cdef char* tcam_statistics_meta_api = "TcamStatisticsMetaApi"
cdef GType gt = g_type_from_name(tcam_statistics_meta_api)
cdef GstMeta* meta = gst_buffer_get_meta(buffer, gt)
def a(buffer):
c_a(buffer)
And my python file:
# No need to use pyximport as I've cythonized the code in setup.py
from . import my_pyx_file
...
buffer = sample.get_buffer()
my_pyx_file.a(buffer)
This results in a SIGSEGV error:
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
The problem is that I can't cast buffer into a GstBuffer*. Does anyone know how to do that?
Debugging in Pycharm, I can actually see the GstBuffer* address:
<Gst.Buffer object at 0x7f3a0b4fc348 (GstBuffer at 0x7f39ec007060)>
But how do I get this address so I can pass it to gst_buffer_get_meta?
Is there a canonical Cython way to do that?

How to get format-string for data of a ctypes-pointer

Given a ctypes-pointer, for example double**:
import ctypes
data=(ctypes.POINTER(ctypes.c_double)*4)() # results in [NULL, NULL, NULL, NULL]
is it possible to obtain a format string, which describes the memory layout of the data?
Right now, I create a memoryview to get this information, which feels somewhat silly:
view=memoryview(data)
print(view.format) # prints: &<d
Is there a more direct way with less overhead? Maybe through using the C-API?
One could fill data with meaningful values, if this is of any help:
import ctypes
data=(ctypes.POINTER(ctypes.c_double)*2)(
(ctypes.c_double*2)(1.0,2.0),
(ctypes.c_double*1)(3.0))
# results in [
# ptr0 -> [1,2],
# ptr1 -> [3]
# ]
print(data[1][0]) # prints 3.0

It seems as if there is nothing fundamentally better than memoryview(data).format. However, one could speed-up this a little bit by using C-API.
The format-string (which extends the struct format-string-syntax as described in PEP3118) is calculated recursively and is stored in the format-member of the StgDictObject-object, which can be found in the tp_dict-field of
ctypes-arrays/pointers:
typedef struct {
PyDictObject dict; /* first part identical to PyDictObject */
...
/* pep3118 fields, pointers neeed PyMem_Free */
char *format;
int ndim;
Py_ssize_t *shape;
...
} StgDictObject;
This format-field is accessed only during the recursive calculation and when a buffer is exported - that is how memoryview gets this info:
static int PyCData_NewGetBuffer(PyObject *myself, Py_buffer *view, int flags)
{
...
/* use default format character if not set */
view->format = dict->format ? dict->format : "B";
...
return 0;
}
Now we could use C-API to populate a buffer (without creating the actual memoryview), here implemented in Python:
%%cython
from cpython cimport buffer
def get_format_via_buffer(obj):
cdef buffer.Py_buffer view
buffer.PyObject_GetBuffer(obj, &view, buffer.PyBUF_FORMAT|buffer.PyBUF_ANY_CONTIGUOUS)
cdef bytes format = view.format
buffer.PyBuffer_Release(&view)
return format
This version is about 3 times faster than via memoryview:
import ctypes
c=(ctypes.c_int*3)()
%timeit get_format_via_buffer(c) # 295 ns ± 10.3
%timeit memoryview(c).format # 936 ns ± 7.43 ns
On my machine, there are about 160 ns overhead of calling a def-function and about 50 ms for creating a bytes-object.
Even if it doesn't make much sense to optimize it further due to the unavoidable overhead, there is still at least theoretical interest of how it could be speed up.
If one really wants to shave off also the cost of filling out the Py_buffer-struct, than there is no clean way: ctypes-module isn't part of Python-C-API (it is not in the include-directory), so the way forward is to repeat the solution Cython uses with the array.array, i.e. hardcoding the memory layout of the object (which makes this solution brittle because the memory-layout of StgDictObject can get out of sync).
Here with Cython and without error-checking:
%%cython -a
from cpython cimport PyObject
# emulate memory-layout (i.e. copy definitions from ctypes.h)
cdef extern from *:
"""
#include <Python.h>
typedef struct _ffi_type
{
size_t size;
unsigned short mem[2];
struct _ffi_type **elements;
} ffi_type;
typedef struct {
PyDictObject dict; /* first part identical to PyDictObject */
Py_ssize_t size[3]; /* number of bytes,alignment requirements,number of fields */
ffi_type ffi_type_pointer;
PyObject *proto; /* Only for Pointer/ArrayObject */
void *setfunc[3];
/* Following fields only used by PyCFuncPtrType_Type instances */
PyObject *argtypes[4];
int flags; /* calling convention and such */
/* pep3118 fields, pointers neeed PyMem_Free */
char *format;
int ndim;
} StgDictObject;
"""
ctypedef struct StgDictObject:
char *format
def get_format_via_hack(obj):
cdef PyObject *p =<PyObject *>obj
cdef StgDictObject *dict = <StgDictObject *>(p.ob_type.tp_dict)
return dict.format
And it is as fast as it gets:
%timeit get_format_via_hack(c) # 243 ns ± 14.5 ns

Thread-local arrays in cython's prange without huge memory allocation

I have some independent computations I would like to do in parallel using Cython.
Right now I'm using this approach:
import numpy as np
cimport numpy as cnp
from cython.parallel import prange
[...]
cdef cnp.ndarray[cnp.float64_t, ndim=2] temporary_variable = \
np.zeros((INPUT_SIZE, RESULT_SIZE), np.float64)
cdef cnp.ndarray[cnp.float64_t, ndim=2] result = \
np.zeros((INPUT_SIZE, RESULT_SIZE), np.float64)
for i in prange(INPUT_SIZE, nogil=True):
for j in range(RESULT_SIZE):
[...]
temporary_variable[i, j] = some_very_heavy_mathematics(my_input_array)
result[i, j] = some_more_maths(temporary_variable[i, j])
This methodology works but my problem comes from the fact that I in fact need several temporary_variables. This results in huge memory usage when INPUT_SIZE grows. But I believe what is really needed is a temporary variable in each thread instead.
Am I facing a limitation of Cython's prange and do I need to learn proper C or am I doing/understanding something terribly wrong?
EDIT: The functions I was looking for were openmp.omp_get_max_threads() and openmp.omp_get_thread_num() to create a reasonably sized temporary array. I had to cimport openmp first.

This is something that Cython tries to detect, and actually gets right most of the time. If we take a more complete example code:
import numpy as np
from cython.parallel import prange
cdef double f1(double[:,:] x, int i, int j) nogil:
return 2*x[i,j]
cdef double f2(double y) nogil:
return y+10
def example_function(double[:,:] arr_in):
cdef double[:,:] result = np.zeros(arr_in.shape)
cdef double temporary_variable
cdef int i,j
for i in prange(arr_in.shape[0], nogil=True):
for j in range(arr_in.shape[1]):
temporary_variable = f1(arr_in,i,j)
result[i,j] = f2(temporary_variable)
return result
(this is basically the same as yours, but compilable). This compiles to the C code:
#pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i) lastprivate(__pyx_v_j) lastprivate(__pyx_v_temporary_variable)
#endif /* _OPENMP */
for (__pyx_t_8 = 0; __pyx_t_8 < __pyx_t_9; __pyx_t_8++){
You can see that temporary_variable is set to be thread-local. If Cython does not detect this correctly (I find it's often too keen to make variables a reduction) then my suggestion is to encapsulate (some of) the contents of the loop in a function:
cdef double loop_contents(double[:,:] arr_in, int i, int j) nogil:
cdef double temporary_variable
temporary_variable = f1(arr_in,i,j)
return f2(temporary_variable)
Doing so forces temporary_variable to be local to the function (and hence to the thread)
With respect to creating a thread-local array: I'm not 100% clear exactly what you want to do but I'll try to take a guess...
I don't believe it's possible to create a thread-local memoryview.
You could create a thread-local C array with malloc and free but unless you have a good understanding of C then I would not recommend it.
The easiest way is to allocate a 2D array where you have one column for each thread. The array is shared, but since each thread only touches its own column that doesn't matter. A simple example:
cdef double[:] f1(double[:,:] x, int i) nogil:
return x[i,:]
def example_function(double[:,:] arr_in):
cdef double[:,:] temporary_variable = np.zeros((arr_in.shape[1],openmp.omp_get_max_threads()))
cdef int i
for i in prange(arr_in.shape[0],nogil=True):
temporary_variable[:,openmp.omp_get_thread_num()] = f1(arr_in,i)

__sync_bool_compare_and_swap with different parameter types in Cython

I am using Cython for fast parallel processing of data, adding items to a shared memory linked list from multiple threads. I use __sync_bool_compare_and_swap, which provides an atomic compare and swap (CAS) operation to compare if the value was not modified (by another thread) before replacing it with a new value.
cdef extern int __sync_bool_compare_and_swap (void **ptr, void *oldval, void *newval) nogil
cdef bint firstAttempt = 1
cdef type *next = NULL
cdef type *newlink = ....
while firstAttempt or not __sync_bool_compare_and_swap( <void**> c, <void*>next, <void*>newlink):
firstAttempt = 0
next = c[0]
newlink.next = next
This works very well. However, now I also want to keep track of the size of the linked list, and want to use the same CAS function for the updates, however, this time it is not pointers that need to be updated but an int. How can use the same external function twice in Cython, once with void** parameter and once with an int* parameter?
EDIT
What I have in mind is two separate atomic operations, in one atomic operation I want to update the linked list, in the other I want to update the size. You can do it in C, but for Cython it means you have to reference the same external function twice with different parameters, can that be done?
CONCLUSION
The answer suggested by DavidW works. In case anyone is thinking to use a similar construction, you should be aware that when using two separate update functions there is no guarantee these are processed in sequence (i.e. another thread can update in between), however, if the objective is to update a cumulative value for instance to monitor progress while multithreading or to create an aggregated result that is not used until all threads are finished, CAS does guarantee that all updates are done exactly once. Unexpectedly, gcc refuses to compile without casting to void*, so either define separate hard-typed versions, or you need to cast. A snippet from my code:
in some_header.h:
#define sync_bool_compare_and_swap_int __sync_bool_compare_and_swap
#define sync_bool_compare_and_swap_vp __sync_bool_compare_and_swap
in some_prog.pxd:
cdef extern from "some_header.h":
cdef extern int sync_bool_compare_and_swap_vp (void **ptr, void *oldval, void *newval) nogil
cdef extern int sync_bool_compare_and_swap_int (int *ptr, int oldval, int newval) nogil
in some_prog.pyx:
cdef void updateInt(int *value, int incr) nogil:
cdef cINT previous = value[0]
cdef cINT updated = previous + incr
while not sync_bool_compare_and_swap_int( c, previous, updated):
previous = value[0]
updated = previous + incr

So the issue (as I understand it) is that it's __sync_bool_compare_and_swap is a compiler intrinsic rather than a function, so doesn't really have a fixed signature, because the compiler just figures it out. However, Cython demands to know the types, and because you want to use it with two different types, you have a problem.
I can't see a simpler way than resorting to a (very) small amount of C to "help" Cython. Create a header file with a bunch of #defines
/* compare_swap.h */
#define sync_bool_compare_swap_voidp __sync_bool_compare_and_swap
#define sync_bool_compare_swap_int __sync_bool_compare_and_swap
You can then tell Cython that each of these is a separate function
cdef extern from "compare_swap.h":
int sync_bool_compare_swap_voidp(void**, void*, void*)
int sync_bool_compare_swap_int(int*, int, int)
At this stage you should be able to use them naturally as plain functions without any type casting (i.e. no <void**> in your code, since this tends to hide real errors). The C preprocessor will generate the code you want and all is well.
Edit: Looking at this a few years later I can see a couple of simpler ways you could probably use (untested, but I don't see why they shouldn't work). First you could use Cython's ability to map a name to a "cname" to avoid the need for an extra header:
cdef extern from *:
int sync_bool_compare_swap_voidp(void**, void*, void*) "__sync_bool_compare_and_swap"
int sync_bool_compare_swap_int(int*, int, int) "__sync_bool_compare_and_swap"
Second (and probably best) you could use a single generic definition (just telling Cython that it's a varargs function):
cdef extern from "compare_swap.h":
int __sync_bool_compare_and_swap(...)
This way Cython won't try to understand the types used, and will just defer it entirely to C (which is what you want).
I wouldn't like to comment on whether it's safe for you to use two atomic operations like this, or whether that will pass through a state with dangerously inconsistent data....

PyBuffer_New: Do I need to free manually?

I am searching for a memory leak in code of someone else. I found:
def current(self):
...
data = PyBuffer_New(buflen)
PyObject_AsCharBuffer(data, &data_ptr, &buflen)
...
return VideoFrame(data, self.frame_size, self.frame_mode,
timestamp=<double>self.frame.pts/<double>AV_TIME_BASE,
frameno=self.frame.display_picture_number)
cdef class VideoFrame:
def __init__(self, data, size, mode, timestamp=0, frameno=0):
self.data = data
...
In function current() is no free or similar, neither in VideoFrame. Is the PyBuffer automatically freed when the VideoFrame object gets deleted?

The answer is: "it depends; we don't have enough code to tell from your question." It's governed by what type you've told Cython that PyBuffer_New returns. I'll give two simplified illustrating cases and hopefully you should be able to work it out for your more complicated case.
If you tell Cython that it's a PyObject* it has no innate knowledge of that type, and doesn't do anything to keep track of the memory:
# BAD - memory leak!
cdef extern from "Python.h":
ctypedef struct PyObject
PyObject* PyBuffer_New(int size)
def test():
cdef int i
for i in range(100000): # call lots of times to allocate lots of memory
# (type of a is automatically inferred to be PyObject*
# to match the function definition)
a = PyBuffer_New(1000)
and the generated code for the loop pretty much looks like:
for (__pyx_t_1 = 0; __pyx_t_1 < 1000; __pyx_t_1+=1) {
__pyx_v_i = __pyx_t_1;
__pyx_v_a = PyBuffer_New(1000);
}
i.e. memory is being allocated but never freed. If you run test() and look at a task manager you can see the memory usage jump up and not return.
Alternatively, if you tell Cython it's an object that lets Cython deal with it like any other Python object, and manage the reference count correctly:
# Good - no memory leak
cdef extern from "Python.h":
object PyBuffer_New(int size)
def test():
cdef int i
for i in range(100000):# call lots of times to allocate lots of memory
a = PyBuffer_New(1000)
The generated code for the loop is then
for (__pyx_t_1 = 0; __pyx_t_1 < 100000; __pyx_t_1+=1) {
__pyx_v_i = __pyx_t_1;
__pyx_t_2 = PyBuffer_New(1000); if (unlikely(!__pyx_t_2)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 7; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
__Pyx_GOTREF(__pyx_t_2);
__Pyx_XDECREF_SET(__pyx_v_a, __pyx_t_2);
__pyx_t_2 = 0;
}
Note the DECREF, which will be where the object is deallocated. If you run test() here you see no long-term jump in memory usage.
It may be possible to jump between these two cases by using cdef for variable (for example in the definition of VideoFrame). If they use PyObject* without careful DECREFs then they're probably leaking memory...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

cython.parallel: how to initialise thread-local ndarray buffer? - multithreading

Related

Passing python objects to C Gstreamer functions, using Cython

How to get format-string for data of a ctypes-pointer

Thread-local arrays in cython's prange without huge memory allocation

__sync_bool_compare_and_swap with different parameter types in Cython

PyBuffer_New: Do I need to free manually?

Categories

Resources