How to free memory from Gst.Buffer.extract_dup in Python using GLib.free() - memory-leaks

I am running object detection on buffers from gstreamer and am utilizing gst_buffer_extract_dup to create an image array from a Gstreamer buffer. Here is a code snip:
gstbuffer = gstsample.get_buffer()
caps_format = gstsample.get_caps().get_structure(0) # Gst.Structure
frmt_str = caps_format.get_value('format')
video_format = GstVideo.VideoFormat.from_string(frmt_str)
p, q = caps_format.get_value('width'), caps_format.get_value('height')
buf = gstbuffer.extract_dup(0, gstbuffer.get_size())
array = np.ndarray(shape=(q, p, 3), \
buffer = buf, \
dtype='uint8')
svg = self.user_function(gstbuffer, array, self.src_size, self.get_box())
I have discovered a substantial memory leak causing the program to crash within 10 minutes and have identified extract_dup as the likely cause as the GStreamer documentation says it needs to be freed with g_free. The (potential) problem is that I cannot figure out the syntax for doing this. trying GLib.free(buf) results in the error "GLib.free(buf)
ValueError: Pointer arguments are restricted to integers, capsules, and None. See: https://bugzilla.gnome.org/show_bug.cgi?id=683599
"
How would I free this memory? Furthermore, how can I confirm that this memory isn't being freed and is the cause of my leak?

Related

Local memory for each CUDA thread

I have a simple program below. My question is that where is "temp" actually stored? is it in global or local memory? I need array temp for each idx so that every thread has individual array temp. In this case, it is working properly. But in my actual program, when I tried to fill temp[0] from test2 it made the program stopped. Suppose we have 1024 threads then it only run the kernel around 200 threads. So, I am wondering whether temp is shared or not. If yes, maybe there is a collision there. I also did not get any error messsage. Please someone explain about this.
__device__ void test2(int temp[], int idx) {
temp[0] = idx;
printf("%d ", temp[0]);
}
__global__ void test() {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int *temp = (int *) malloc(100 * sizeof (int));
test2(temp, idx);
}
int main() {
test << <1, 1024 >> >();
return 0;
}
My question is that where is "temp" actually stored?
The allocation for temp is stored in a place called the device heap. It is a form of global memory. However the temp variable itself (i.e. the pointer value) is in local memory - not shared or visible to other threads.
I need array temp for each idx so that every thread has individual array temp.
You will get that, subject to caveats below. Each thread will have its own individual array, referenced by its local variable temp. Each thread will have a separate allocation for storage on the device heap.
People commonly have problems with in-kernel new or malloc. One of the main reasons is that the device heap is initially limited to 8MB, across all of your device heap allocations. So if enough threads do a new or malloc of enough allocation requests, you will run out of space.
When you run out of space, the API way to signal that is to return a zero pointer value for the allocation (a NULL pointer). If you then attempt to use this NULL pointer, you will have trouble.
For debugging purposes (i.e. to prove this is happening), test the pointer for NULL (i.e. == 0) before using it. If it is NULL, don't use it (perhaps print an error message instead).
You can read more about this in the documentation or in many questions here on the SO cuda tag. If you read any of these sources, you will discover that you can increase the size of the device heap.

ESP8266 Fails to add char to a very long String (>8000 chars)

After correctly getting the payload from an HTTPS request, adding the client's chars to a String stops after about 8000 characters, then resumes and stops again a few times
Here's a snippet of my code:
long streamSize = 0;
Serial.println("Now reading payload...");
while (stream.connected()) {
while (stream.available() > 0) {
char ch = (char)stream.read();
Serial.println((String)"Reading [" + ++streamSize + "] " + ch);
ret += ch;
Serial.println(ret.length());
}
}
Which works fine, until:
Reading [8685] t
8685
Reading [8686] r
8686
Reading [8687] u
8687
Reading [8688] m
8687
Reading [8689]
8687
Reading [8690] e
8687
[Resumes correctly appending chars]
Reading [9226] i
8748
Reading [9227] p
8749
Reading [9228] t
8750
Reading [9229] i
8751
Reading [9230] o
8751
Reading [9231] n
8751
And so on for quite a few times.
Memory Heap size doesn't seem to be the issue, as I'm getting 14128 free bytes from system_get_free_heap_size() after appending everything.
I'm using a Wemos D1 R1, and this is the file I'm trying to fully read, for testing, using the Github API
I've found out that Arduino can fail to concatenate strings because of a low amount of free memory. Moreover, String class in Arduino seems not to have an error handler, so it can silently fail when - for example - memory is too fragmented.
See here: from Arduino forum
and here: from a discussion in Stack Overflow
In many cases they suggest that you could pre-allocate the buffer with a String reserve(int) call.
Maybe you could not know in advance how big your string will grow, but maybe you could managing it. For example, by calling two times your https target. First time just to know how big your answer will be (and so you will able to allocate the exact amount of memory); second time to effectively read.

why it's slowly when I parse a message of Google protocol buffer in multi-thread?

I try to parse many Google protocol buffer messages from a binary file generated by calling SerializeToString. I first load all Bytes into a heap memory by calling new function. I also have two arrays to store the Bytes begin address of a message in the heap memory and the Bytes count of the message.
Then I begin to parse message by calling ParseFromString.I want to quicken the procedure by using multi-thread.
In each thread, I pass the start index and end index of address array and Byte count array.
In parent process. the main code is:
struct ParsePara
{
char* str_buffer;
size_t* buffer_offset;
size_t* binary_string_length_array;
size_t start_idx;
size_t end_idx;
Flight_Ticket_Info* ticket_info_buffer_array;
};
//Flight_Ticket_Info is class of message
//offset_size is the count of message
ticket_array = new Flight_Ticket_Info[offset_size];
const int max_thread_count = 6;
pthread_t pthread_id_vec[max_thread_count];
CTimer thread_cost;
thread_cost.start();
vector<ParsePara*> para_vec;
const size_t each_count = ceil(float(offset_size) / max_thread_count);
for (size_t k = 0;k < max_thread_count;k++)
{
size_t start_idx = each_count * k;
size_t end_idx = each_count * (k+1);
if (start_idx >= offset_size)
break;
if (end_idx >= offset_size)
end_idx = offset_size;
ParsePara* cand_para_ptr = new ParsePara();
if (!cand_para_ptr)
{
_ERROR_EXIT(0,"[Malloc memory fail.]");
}
cand_para_ptr->str_buffer = m_valdata;//heap memory for storing Bytes of message
cand_para_ptr->buffer_offset = offset_array;//begin address of each message
cand_para_ptr->start_idx = start_idx;
cand_para_ptr->end_idx = end_idx;
cand_para_ptr->ticket_info_buffer_array = ticket_array;//array to store message
cand_para_ptr->binary_string_length_array = binary_length_array;//Bytes count of each message
para_vec.push_back(cand_para_ptr);
}
for(size_t k = 0 ;k < para_vec.size();k++)
{
int ret = pthread_create(&pthread_id_vec[k],NULL,parserFlightTicketForMultiThread,para_vec[k]);
if (0 != ret)
{
_ERROR_EXIT(0,"[Error] [create thread fail]");
}
}
for (size_t k = 0;k < para_vec.size();k++)
{
pthread_join(pthread_id_vec[k],NULL);
}
In each thread the thread function is:
void* parserFlightTicketForMultiThread(void* void_para_ptr)
{
ParsePara* para_ptr = (ParsePara*) void_para_ptr;
parserFlightTicketForMany(para_ptr->str_buffer,para_ptr->ticket_info_buffer_array,para_ptr->buffer_offset,
para_ptr->start_idx,para_ptr->end_idx,para_ptr->binary_string_length_array);
}
void parserFlightTicketForMany(const char* str_buffer,Flight_Ticket_Info* ticket_info_buffer_array,
size_t* buffer_offset,const size_t start_idx,const size_t end_idx,size_t* binary_string_length_array)
{
printf("start_idx:%d,end_idx:%d\n",start_idx,end_idx);
for (size_t k = start_idx;k < end_idx;k++)
{
if (k % 100000 == 0)
cout << k << endl;
size_t cand_offset = buffer_offset[k];
size_t binary_length = binary_string_length_array[k];
ticket_info_buffer_array[k].ParseFromString(string(&str_buffer[cand_offset],binary_length-1));
}
printf("done %ld %ld\n",start_idx,end_idx);
}
But multi-thread cost is more than one thread.
one thread cost is:40455623ms
My computer is 8 core and six thread cost is:131586865ms
Anyone can help me? thank you!
Some possible problems -- you'll have to experiment to determine which:
Protobuf parsing speed is often limited by memory bandwidth rather than CPU time, especially with a large input data set. In that case, more threads won't help, since all the cores are sharing bandwidth to main memory. Indeed, having multiple cores fighting over memory bandwidth could make the overall operation slower. Note that the biggest consumer of memory is not the input bytes but rather the parsed data objects -- that is, the output of parsing -- which are many times larger than the encoded data. To improve this problem, consider writing the parsing loop so that it fully-processes each message immediately after parsing, before moving on to the text message. That way, instead of allocating k protobuf objects, you only need to allocate one protobuf object per thread, and repeatedly reuse the same object for parsing. This way the object will (probably) stay in the core's private L1 cache and avoid consuming memory bandwidth; only the input bytes will be read over the main bus.
How are you loading data into RAM? Did you read() into a large array or did you mmap()? In the latter case the data is read from disk lazily -- it won't happen until you actually attempt to parse it. Even in the read() case, it could be that the data has been swapped out, creating similar effects. Either way, your threads are now not just fighting for memory bandwidth, but disk bandwidth, which is of course much slower. Having six threads reading separate parts of a big file will definitely be slower overall than having one thread read the whole file, because the operating system optimizes for sequential access.
Protobuf allocates memory during parsing. Many memory allocators take a lock while allocating new memory. Since all your threads are allocating tons and tons of objects in a tight loop, they will contend for this lock. Make sure you are using a thread-friendly memory allocator, such as Google's tcmalloc. Note that repeatedly reusing the same protobuf object in a parse-consume loop rather than allocating lots of different objects will also help immensely here, because the protobuf object will automatically reuse memory for sub-objects.
There may be a bug in your code and it might not be doing what you expect at all when multithreaded. For example, a bug might be causing all the threads to process the same data, rather than different data, and it could be that the data they're choosing happens to be bigger. Make sure you are testing that the results of your code are exactly the same when you run single-threaded vs. multi-threaded.
In short, if you want multiple cores to make your code faster, you have to think about not just what each core is doing, but what data is going in and out of each core, and how much the cores have to talk to each other. Ideally you want each core to operate all on its own without talking to anyone or anything; then you get maximum parallelism. That's not usually possible, of course, but the closer you can get to that, the better.
BTW, a random optimization for you:
ParseFromString(string(&str_buffer[cand_offset],binary_length-1))
Replace that with:
ParseFromArray(&str_buffer[cand_offset],binary_length-1)
Creating at std::string makes a copy of the data, which wastes time (and memory bandwidth). (This doesn't explain why threading is slow, though.)

Accessing GPU memory in OpenCL/C++Amp

I need to find information about how the Unified Shader Array accessess the GPU memory to have an idea how to use it effectively. The image of the architecture of my graphics card doesn't show it clearly.
I need to load a big image into GPU memory using C++Amp and divide it into small pieces (like 4x4 pixels). Every piece should be computed with a different thread. I don't know how the threads share the access to the image.
Is there any way of doing it in such way that the threads aren't blocking each other while accessing the image? Maybe they have their own memory that can be accesses exclusively?
Or maybe the access to the unified memory is so fast that I shouldn't care about it (however I don't belive in it)? It is really important, because I need to compute about 10k subsets for every image.
For C++ AMP you want to load the data that each thread within a tile uses into tile_static memory before starting your convolution calculation. Because each thread accesses pixels which are also read by other threads this allows your to do a single read for each pixel from (slow) global memory and cache it in (fast) tile static memory so that all of the subsequent reads are faster.
You can see an example of tiling for convolution here. The DetectEdgeTiled method loads all the data that it requires and the calls idx.barrier.wait() to ensure all the threads have finished writing data into tile static memory. Then it executes the edge detection code taking advantage of tile_static memory. There are many other examples of this pattern in the samples. Note that the loading code in DetectEdgeTiled is complex only because it must account for the additional pixels around the edge of the pixels that are being written in the current tile and is essentially an unrolled loop, hence it's length.
I'm not sure you are thinking about the problem in quite the right way. There are two levels of partitioning here. To calculate the new value for each pixel the thread doing this work reads the block of surrounding pixels. In addition blocks (tiles) of threads loads larger blocks of pixel data into tile_static memory. Each thread on the tile then calculates the result for one pixel within the block.
void ApplyEdgeDetectionTiledHelper(const array<ArgbPackedPixel, 2>& srcFrame,
array<ArgbPackedPixel, 2>& destFrame)
{
tiled_extent<tileSize, tileSize> computeDomain = GetTiledExtent(srcFrame.extent);
parallel_for_each(computeDomain.tile<tileSize, tileSize>(), [=, &srcFrame, &destFrame, &orgFrame](tiled_index<tileSize, tileSize> idx) restrict(amp)
{
DetectEdgeTiled(idx, srcFrame, destFrame, orgFrame);
});
}
void DetectEdgeTiled(
tiled_index<tileSize, tileSize> idx,
const array<ArgbPackedPixel, 2>& srcFrame,
array<ArgbPackedPixel, 2>& destFrame) restrict(amp)
{
const UINT shift = imageBorderWidth / 2;
const UINT startHeight = 0;
const UINT startWidth = 0;
const UINT endHeight = srcFrame.extent[0];
const UINT endWidth = srcFrame.extent[1];
tile_static RgbPixel localSrc[tileSize + imageBorderWidth ]
[tileSize + imageBorderWidth];
const UINT global_idxY = idx.global[0];
const UINT global_idxX = idx.global[1];
const UINT local_idxY = idx.local[0];
const UINT local_idxX = idx.local[1];
const UINT local_idx_tsY = local_idxY + shift;
const UINT local_idx_tsX = local_idxX + shift;
// Copy image data to tile_static memory. The if clauses are required to deal with threads that own a
// pixel close to the edge of the tile and need to copy additional halo data.
// This pixel
index<2> gNew = index<2>(global_idxY, global_idxX);
localSrc[local_idx_tsY][local_idx_tsX] = UnpackPixel(srcFrame[gNew]);
// Left edge
if (local_idxX < shift)
{
index<2> gNew = index<2>(global_idxY, global_idxX - shift);
localSrc[local_idx_tsY][local_idx_tsX-shift] = UnpackPixel(srcFrame[gNew]);
}
// Right edge
// Top edge
// Bottom edge
// Top Left corner
// Bottom Left corner
// Bottom Right corner
// Top Right corner
// Synchronize all threads so that none of them start calculation before
// all data is copied onto the current tile.
idx.barrier.wait();
// Make sure that the thread is not referring to a border pixel
// for which the filter cannot be applied.
if ((global_idxY >= startHeight + 1 && global_idxY <= endHeight - 1) &&
(global_idxX >= startWidth + 1 && global_idxX <= endWidth - 1))
{
RgbPixel result = Convolution(localSrc, index<2>(local_idx_tsY, local_idx_tsX));
destFrame[index<2>(global_idxY, global_idxX)] = result;
}
}
This code was taken from CodePlex and I stripped out a lot of the real implementation to make it clearer.
WRT #sharpneli's answer you can use texture<> in C++ AMP to achieve the same result as OpenCL images. There is also an example of this on CodePlex.
In this particular case you do not have to worry. Just use OpenCL images. GPU's are extremely good at simply reading images (due to texturing). However this method requires writing the result into a separate image because you cannot read and write from the same image in a single kernel. You should use this if you can perform the computation as a single pass (no need to iterate).
Another way is to access it as a normal memory buffer, load the parts within a wavefront (group of threads running in sync) into local memory (this memory is blazingly fast), perform computation and write complete end result back into unified memory after computation. You should use this approach if you need to read and write values to the same image while computing. If you are not memory bound you can still read the original values from a texture, then iterate in local memory and write the end results in separate image.
Reads from unified memory are slow only if it's not const * restrict and multiple threads read the same location. In general if subsequent thread id's read subsequent locations it's rather fast. However if your threads both write and read to unified memory then it's going to be slow.

What is openCL equivalent for this cuda "cudaMallocPitch "code.?

My PC has an AMD processor with an ATI 3200 GPU which doesn't support OpenCL. The rest of the codes all running by "Falling back to CPU itself".
I am converting one of the code from CUDA to OpenCL but stuck in some particular part for which there is no exact conversion code in OpenCL. since i have less experience in OpenCL I can't make out this, please suggest me some solution if any of you think will work,
The CUDA code is,
size_t pitch = 0;
cudaError error = cudaMallocPitch((void**)&gpu_data, (size_t*)&pitch,
instances->cols * sizeof(float), instances->rows);
for( int i = 0; i < instances->rows; i++ ){
error = cudaMemcpy((void*)(gpu_data + (pitch/sizeof(float))*i),
(void*)(instances->data + (instances->cols*i)),
instances->cols * sizeof(float) ,cudaMemcpyHostToDevice);
If I remove the pitch value from the above I end up with an problem which doesn't write to the device memory "gpu_data".
Somebody please convert this code to OpenCL and reply. I have converted it to OpenCL, but its not working and the data is not written to "gpu_data". My converted OpenCL code is
gpu_data = clCreateBuffer(context, CL_MEM_READ_WRITE, ((instances->cols)*(instances->rows))*sizeof(float), NULL, &ret);
for( int i = 0; i < instances->rows; i++ ){
ret = clEnqueueWriteBuffer(command_queue, gpu_data, CL_TRUE, 0, ((instances->cols)*(instances->rows))*sizeof(float),(void*)(instances->data + (instances->cols*i)) , 0, NULL, NULL);
Sometimes it runs well for this code and gets stuck in the reading part i.e.
ret = clEnqueueReadBuffer(command_queue, gpu_data, CL_TRUE, 0,sizeof( float ) * instances->cols* 1 , instances->data, 0, NULL, NULL);
overhere. And it gives error like
Unhandled exception at 0x10001098 in CL_kmeans.exe: 0xC000001D: Illegal Instruction.
when break is pressed , it gives:
No symbols are loaded for any call stack frame. The source code cannot be displayed.
while debugging. In the call stack it is displaying:
OCL8CA9.tmp.dll!10001098()
[Frames below may be incorrect and/or missing, no symbols loaded for OCL8CA9.tmp.dll]
amdocl.dll!5c39de16()
I really dont know what it means. someone please help me to rid of this problem.
First of all, in the CUDA code you're doing a horribly inefficient thing to copy the data. The CUDA runtime has the function cudaMemcpy2D that does exactly what you are trying to do by looping over different rows.
What cudaMallocPitch does is to compute an optimal pitch (= distance in byte between rows in a 2D array) such that each new row begins at an address that is optimal for coalescing, and then allocates a memory area as large as pitch times the number of rows you specify. You can emulate the same thing in OpenCL by first computing the optimal pitch and then doing the allocation of the correct size.
The optimal pitch is computed by (1) getting the base address alignment preference for your card (CL_DEVICE_MEM_BASE_ADDR_ALIGN property with clGetDeviceInfo: note that the returned value is in bits, so you have to divide by 8 to get it in bytes); let's call this base (2) find the largest multiple of base that is no less than your natural data pitch (sizeof(type) times number of columns); this will be your pitch.
You then allocate pitch times number of rows bytes, and pass the pitch information to kernels.
Also, when copying data from the host to the device and converesely, you want to use clEnqueue{Read,Write}BufferRect, that are specifically designed to copy 2D data (they are the counterparts to cudaMemcpy2D).

Resources