I am searching for a memory leak in code of someone else. I found:
def current(self):
...
data = PyBuffer_New(buflen)
PyObject_AsCharBuffer(data, &data_ptr, &buflen)
...
return VideoFrame(data, self.frame_size, self.frame_mode,
timestamp=<double>self.frame.pts/<double>AV_TIME_BASE,
frameno=self.frame.display_picture_number)
cdef class VideoFrame:
def __init__(self, data, size, mode, timestamp=0, frameno=0):
self.data = data
...
In function current() is no free or similar, neither in VideoFrame. Is the PyBuffer automatically freed when the VideoFrame object gets deleted?
The answer is: "it depends; we don't have enough code to tell from your question." It's governed by what type you've told Cython that PyBuffer_New returns. I'll give two simplified illustrating cases and hopefully you should be able to work it out for your more complicated case.
If you tell Cython that it's a PyObject* it has no innate knowledge of that type, and doesn't do anything to keep track of the memory:
# BAD - memory leak!
cdef extern from "Python.h":
ctypedef struct PyObject
PyObject* PyBuffer_New(int size)
def test():
cdef int i
for i in range(100000): # call lots of times to allocate lots of memory
# (type of a is automatically inferred to be PyObject*
# to match the function definition)
a = PyBuffer_New(1000)
and the generated code for the loop pretty much looks like:
for (__pyx_t_1 = 0; __pyx_t_1 < 1000; __pyx_t_1+=1) {
__pyx_v_i = __pyx_t_1;
__pyx_v_a = PyBuffer_New(1000);
}
i.e. memory is being allocated but never freed. If you run test() and look at a task manager you can see the memory usage jump up and not return.
Alternatively, if you tell Cython it's an object that lets Cython deal with it like any other Python object, and manage the reference count correctly:
# Good - no memory leak
cdef extern from "Python.h":
object PyBuffer_New(int size)
def test():
cdef int i
for i in range(100000):# call lots of times to allocate lots of memory
a = PyBuffer_New(1000)
The generated code for the loop is then
for (__pyx_t_1 = 0; __pyx_t_1 < 100000; __pyx_t_1+=1) {
__pyx_v_i = __pyx_t_1;
__pyx_t_2 = PyBuffer_New(1000); if (unlikely(!__pyx_t_2)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 7; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
__Pyx_GOTREF(__pyx_t_2);
__Pyx_XDECREF_SET(__pyx_v_a, __pyx_t_2);
__pyx_t_2 = 0;
}
Note the DECREF, which will be where the object is deallocated. If you run test() here you see no long-term jump in memory usage.
It may be possible to jump between these two cases by using cdef for variable (for example in the definition of VideoFrame). If they use PyObject* without careful DECREFs then they're probably leaking memory...
Related
I have a simple program below. My question is that where is "temp" actually stored? is it in global or local memory? I need array temp for each idx so that every thread has individual array temp. In this case, it is working properly. But in my actual program, when I tried to fill temp[0] from test2 it made the program stopped. Suppose we have 1024 threads then it only run the kernel around 200 threads. So, I am wondering whether temp is shared or not. If yes, maybe there is a collision there. I also did not get any error messsage. Please someone explain about this.
__device__ void test2(int temp[], int idx) {
temp[0] = idx;
printf("%d ", temp[0]);
}
__global__ void test() {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int *temp = (int *) malloc(100 * sizeof (int));
test2(temp, idx);
}
int main() {
test << <1, 1024 >> >();
return 0;
}
My question is that where is "temp" actually stored?
The allocation for temp is stored in a place called the device heap. It is a form of global memory. However the temp variable itself (i.e. the pointer value) is in local memory - not shared or visible to other threads.
I need array temp for each idx so that every thread has individual array temp.
You will get that, subject to caveats below. Each thread will have its own individual array, referenced by its local variable temp. Each thread will have a separate allocation for storage on the device heap.
People commonly have problems with in-kernel new or malloc. One of the main reasons is that the device heap is initially limited to 8MB, across all of your device heap allocations. So if enough threads do a new or malloc of enough allocation requests, you will run out of space.
When you run out of space, the API way to signal that is to return a zero pointer value for the allocation (a NULL pointer). If you then attempt to use this NULL pointer, you will have trouble.
For debugging purposes (i.e. to prove this is happening), test the pointer for NULL (i.e. == 0) before using it. If it is NULL, don't use it (perhaps print an error message instead).
You can read more about this in the documentation or in many questions here on the SO cuda tag. If you read any of these sources, you will discover that you can increase the size of the device heap.
I'm using prange for parallelizing a loop in my cython code. As this loop takes a long time, I want to print its progress as it goes. A progress bar would be nice but anything that shows the progress would do. I tried to add some form of logging in the following fashion:
for index in prange(n, nogil = True, num_threads = 5):
if index == (index//1000)*1000:
printf("%d percent done", index*100/n)
This should print the progress whenever index % 1000 == 0.
The output of this is kind of random.
0 percent done
80 percent done
20 percent done
100 percent done
60 percent done
I assume that this is because prange does not assign threads to indexes starting from 0 and going up.
Is there any way to implement such a thing in Cython?
Thanks!
There are probably ways in Cython to achieve what you want, however I find it easier to use OpenMP from C/C++ and to call the functionality from Cython (thanks to C-verbatim-code since Cython 0.28 it is quite convenient).
One thing is clear, to achieve your goal you need some kind of synchronization between threads, which might impact performance. My strategy is simple: every thread reports "done", all done task are registered and they number is reported from time to time. The count of reported tasks is protected with a mutex/lock:
%%cython -c=/openmp --link-args=/openmp
cdef extern from * nogil:
r"""
#include <omp.h>
#include <stdio.h>
static omp_lock_t cnt_lock;
static int cnt = 0;
void reset(){
omp_init_lock(&cnt_lock);
cnt = 0;
}
void destroy(){
omp_destroy_lock(&cnt_lock);
}
void report(int mod){
omp_set_lock(&cnt_lock);
// start protected code:
cnt++;
if(cnt%mod == 0){
printf("done: %d\n", cnt);
}
// end protected code block
omp_unset_lock(&cnt_lock);
}
"""
void reset()
void destroy()
void report(int mod)
from cython.parallel import prange
def prange_with_report(int n):
reset() # reset counter and init lock
cdef int index
for index in prange(n, nogil = True, num_threads = 5):
report(n//10)
destroy() # release lock
Now calling: prange_with_report(100) would print done: 10\n, done: 20\n, ..., done: 100\n.
As a slight optimization, report could be called not for every index but only for example for index%100==0 - there will be less impact on the performance but also the reporting will be less precise.
I am trying to use Collections-C in Cython.
I noticed that some structures are defined in the .c file, and an alias for them is in the .h file. When I try to define those structures in a .pxd file and use them in a .pyx file, gcc throws an error: storage size of ‘[...]’ isn’t known.
I was able to reproduce my issue to a minimum setup that replicates the external library and my application:
testdef.c
/* Note: I can't change this */
struct bogus_s {
int x;
int y;
};
testdef.h
/* Note: I can't change this */
typedef struct bogus_s Bogus;
cytestdef.pxd
# This is my code
cdef extern from 'testdef.h':
struct bogus_s:
int x
int y
ctypedef bogus_s Bogus
cytestdef.pyx
# This is my code
def fn():
cdef Bogus n
n.x = 12
n.y = 23
print(n.x)
If I run cythonize, I get
In function ‘__pyx_pf_7sandbox_9cytestdef_fn’:
cytestdef.c:1106:9: error: storage size of ‘__pyx_v_n’ isn’t known
Bogus __pyx_v_n;
^~~~~~~~~
I also get the same error if I use ctypedef Bogus: [...] notation as indicated in the Cython manual.
What am I doing wrong?
Thanks.
Looking at the documentation for your Collections-C library these are opaque structures that you're supposed to use purely through pointers (don't need to know the size to have a pointer, while you do to allocate on the stack). Allocation of these structures is done in library functions.
To change your example to match this case:
// C file
int bogus_s_new(struct bogus_s** v) {
*v = malloc(sizeof(struct bogus_s));
return (v!=NULL);
}
void free_bogus_s(struct bogus_s* v) {
free(v);
}
Your H file would contain the declarations for those and your pxd file would contain wrappers for the declarations. Then in Cython:
def fn():
cdef Bogus* n
if not bogus_s_new(&n):
return
try:
# you CANNOT access x and y since the type is
# designed to be opaque. Instead you should use
# the acessor functions defined in the header
# n.x = 12
# n.y = 23
finally:
free_bogus_s(n)
I have some independent computations I would like to do in parallel using Cython.
Right now I'm using this approach:
import numpy as np
cimport numpy as cnp
from cython.parallel import prange
[...]
cdef cnp.ndarray[cnp.float64_t, ndim=2] temporary_variable = \
np.zeros((INPUT_SIZE, RESULT_SIZE), np.float64)
cdef cnp.ndarray[cnp.float64_t, ndim=2] result = \
np.zeros((INPUT_SIZE, RESULT_SIZE), np.float64)
for i in prange(INPUT_SIZE, nogil=True):
for j in range(RESULT_SIZE):
[...]
temporary_variable[i, j] = some_very_heavy_mathematics(my_input_array)
result[i, j] = some_more_maths(temporary_variable[i, j])
This methodology works but my problem comes from the fact that I in fact need several temporary_variables. This results in huge memory usage when INPUT_SIZE grows. But I believe what is really needed is a temporary variable in each thread instead.
Am I facing a limitation of Cython's prange and do I need to learn proper C or am I doing/understanding something terribly wrong?
EDIT: The functions I was looking for were openmp.omp_get_max_threads() and openmp.omp_get_thread_num() to create a reasonably sized temporary array. I had to cimport openmp first.
This is something that Cython tries to detect, and actually gets right most of the time. If we take a more complete example code:
import numpy as np
from cython.parallel import prange
cdef double f1(double[:,:] x, int i, int j) nogil:
return 2*x[i,j]
cdef double f2(double y) nogil:
return y+10
def example_function(double[:,:] arr_in):
cdef double[:,:] result = np.zeros(arr_in.shape)
cdef double temporary_variable
cdef int i,j
for i in prange(arr_in.shape[0], nogil=True):
for j in range(arr_in.shape[1]):
temporary_variable = f1(arr_in,i,j)
result[i,j] = f2(temporary_variable)
return result
(this is basically the same as yours, but compilable). This compiles to the C code:
#pragma omp for firstprivate(__pyx_v_i) lastprivate(__pyx_v_i) lastprivate(__pyx_v_j) lastprivate(__pyx_v_temporary_variable)
#endif /* _OPENMP */
for (__pyx_t_8 = 0; __pyx_t_8 < __pyx_t_9; __pyx_t_8++){
You can see that temporary_variable is set to be thread-local. If Cython does not detect this correctly (I find it's often too keen to make variables a reduction) then my suggestion is to encapsulate (some of) the contents of the loop in a function:
cdef double loop_contents(double[:,:] arr_in, int i, int j) nogil:
cdef double temporary_variable
temporary_variable = f1(arr_in,i,j)
return f2(temporary_variable)
Doing so forces temporary_variable to be local to the function (and hence to the thread)
With respect to creating a thread-local array: I'm not 100% clear exactly what you want to do but I'll try to take a guess...
I don't believe it's possible to create a thread-local memoryview.
You could create a thread-local C array with malloc and free but unless you have a good understanding of C then I would not recommend it.
The easiest way is to allocate a 2D array where you have one column for each thread. The array is shared, but since each thread only touches its own column that doesn't matter. A simple example:
cdef double[:] f1(double[:,:] x, int i) nogil:
return x[i,:]
def example_function(double[:,:] arr_in):
cdef double[:,:] temporary_variable = np.zeros((arr_in.shape[1],openmp.omp_get_max_threads()))
cdef int i
for i in prange(arr_in.shape[0],nogil=True):
temporary_variable[:,openmp.omp_get_thread_num()] = f1(arr_in,i)
I am struggling to initialise thread-local ndarrays with cython.parallel:
Pseudo-code:
cdef:
ndarray buffer
with nogil, parallel():
buffer = np.empty(...)
for i in prange(n):
with gil:
print "Thread %d: data address: 0x%x" % (threadid(), <uintptr_t>buffer.data)
some_func(buffer.data) # use thread-local buffer
cdef void some_func(char * buffer_ptr) nogil:
(... works on buffer contents...)
My problem is that in all threads buffer.data points to the same address. Namely the address of the thread that last assigned buffer.
Despite buffer being assigned within the parallel() (or alternatively prange) block, cython does not make buffer a private or thread-local variable but keeps it as a shared variable.
As a result, buffer.data points to the same memory region wreaking havoc on my algorithm.
This is not a problem exclusively with ndarray objects but seemingly with all cdef class defined objects.
How do I solve this problem?
I think I have finally found a solution to this problem that I like.
The short version is that you create an array that has shape:
(number_of_threads, ...<whatever shape you need in the thread>...)
Then, call openmp.omp_get_thread_num and use that to index the array to get a "thread-local" sub-array. This avoids having a separate array for every loop index (which could be enormous) but also prevents threads overwriting each other.
Here's a rough version of what I did:
import numpy as np
import multiprocessing
from cython.parallel cimport parallel
from cython.parallel import prange
cimport openmp
cdef extern from "stdlib.h":
void free(void* ptr)
void* malloc(size_t size)
void* realloc(void* ptr, size_t size)
...
cdef int num_items = ...
num_threads = multiprocessing.cpu_count()
result_array = np.zeros((num_threads, num_items), dtype=DTYPE) # Make sure each thread uses separate memory
cdef c_numpy.ndarray result_cn
cdef CDTYPE ** result_pointer_arr
result_pointer_arr = <CDTYPE **> malloc(num_threads * sizeof(CDTYPE *))
for i in range(num_threads):
result_cn = result_array[i]
result_pointer_arr[i] = <CDTYPE*> result_cn.data
cdef int thread_number
for i in prange(num_items, nogil=True, chunksize=1, num_threads=num_threads, schedule='static'):
thread_number = openmp.omp_get_thread_num()
some_function(result_pointer_arr[thread_number])