'PTX JIT compilation failed' from cuModuleLoadData - multithreading

Below is the code:
#define FILENAME "kernel.code"
#define kernel_name "hello_world"
#define THREADS 4
std::vector<char> load_file()
{
std::ifstream file(FILENAME, std::ios::binary | std::ios::ate);
std::streamsize fsize = file.tellg();
file.seekg(0, std::ios::beg);
std::vector<char> buffer(fsize);
if (!file.read(buffer.data(), fsize)) {
failed("could not open code object '%s'\n", FILENAME);
}
return buffer;
}
struct joinable_thread : std::thread
{
template <class... Xs>
joinable_thread(Xs&&... xs) : std::thread(std::forward<Xs>(xs)...) // NOLINT
{
}
joinable_thread& operator=(joinable_thread&& other) = default;
joinable_thread(joinable_thread&& other) = default;
~joinable_thread()
{
if(this->joinable())
this->join();
}
};
void run(const std::vector<char>& buffer) {
CUdevice device;
CUDACHECK(cuDeviceGet(&device, 0));
CUcontext context;
CUDACHECK(cuCtxCreate(&context, 0, device));
CUmodule Module;
CUDACHECK(cuModuleLoadData(&Module, &buffer[0]));
...
}
void run_multi_threads(uint32_t n) {
{
auto buffer = load_file();
std::vector<joinable_thread> threads;
for (uint32_t i = 0; i < n; i++) {
threads.emplace_back(std::thread{[&, i, buffer] {
run(buffer);
}});
}
}
}
int main() {
CUDACHECK(cuInit(0));
run_multi_threads(THREADS);
}
And the code kernel.cu used for ptx is as follows:
#include "cuda_runtime.h"
extern "C" __global__ void hello_world(float* a, float* b) {
int tx = threadIdx.x;
b[tx] = a[tx];
}
I m generating the ptx in this way
nvcc --ptx kernel.cu -o kernel.code
Im using a machine with GeForce GTX TITAN X.
And Im facing this "PTX JIT compilation failed" from cuModuleLoadData error, only when I m trying to use this with multiple threads. If i remove the multi-threading part and run normally, this error doesn't occur.
Can anyone please tell me what is going wrong and how to overcome this.

As mentioned in the comments, I was able to get it to work by moving the load_file() call to the main, so that the buffer read from the file is valid, and then pass only the buffer to all the threads.

Actually in the original code, the buffer will be deconstructed once it leaves the '{...}' scope. So when thread starts, you may read the invalid buffer.
If you put your buffer in the main, it will not be deconstructed or freed until the program exits.
So yes, it's because you pass the invalid buffer (which may have already been freed) to the cu code.

Related

C++ multi-thread memory leak issue

Once the following code is running, it will eat all my memory and cause OOM issue on my ARM 64-bit processor. it seems that some 'delete' operations do not work... it confused me. Could any experts help to analysis what is the root cause? Ttoolchain or libc?
I run the same code in another two platforms, one is x64, one is another ARM 32-bit chip; they all fine except my IPQ ARM chip.
environment:
libc version: musl libc 1.1.16
gcc version: 5.2.0
cpu chip: IPQ ARM 64bit
os: linux 4.1
#include <iostream>
#include <cstring>
#include <list>
#include <memory>
#include <pthread.h>
void *h(void *param)
{
while(1)
{
char *p = new char[4096];
delete[] p;
}
return NULL;
}
int main(int argc, char *argv)
{
pthread_t th[50];
int thread_cnt = 2;
for(int i = 0; i < thread_cnt; i++)
pthread_create(th+i, NULL, h, NULL);
for(int i = 0; i < thread_cnt; i++)
pthread_join(th+i, NULL);
return 0;
}
If I use a mutex in the header-file thread callback, the issue will disappear; there is no memory leak.
std::mutex thlock;
void *h(void *param)
{
while(1)
{
std::lock_guard<std::mutex> lk(thlock);
char *p = new char[4096];
delete[] p;
}
return NULL;
}
If I change the memory allocation size to 1024, then a weird thing happed. 2 threads do not leak memory, but 20 threads will cause memory leak. Whatever the allocation size, there's no issue with only 1 thread.
void *h(void *param)
{
while(1)
{
//char *p = new char[4096];
char *p = new char[1024];
delete[] p;
}
return NULL;
}
If I replace new/delete to std::malloc/std::free, the issue will disappear; no memory leak.
void *h(void *param)
{
while(1)
{
//char *p = new char[4096];
char *p = (char *)std::malloc(4096);
std::free(p);
//delete[] p;
}
return NULL;
}
So I thought if I overload operator new/delete and my "new/delete" allocate memory by using std::malloc/free, maybe the issue could be solved. But
the weird thing happened again; the memory leak still exists.
void* operator new(std::size_t sz) // no inline, required by [replacement.functions]/3
{
std::printf("global op new called, size = %zu\n", sz);
if (sz == 0)
++sz; // avoid std::malloc(0) which may return nullptr on success
if (void *ptr = std::malloc(sz))
return ptr;
throw std::bad_alloc{}; // required by [new.delete.single]/3
}
void operator delete(void* ptr) noexcept
{
std::puts("global op delete called");
std::free(ptr);
}
void *h(void *param)
{
while(1)
{
char *p = new char[4096];
delete[] p;
}
return NULL;
}
So I thought the memory leak may be caused by my operator new/delete rather than malloc/free, so I changed the code to look like what's shown below. I don't really call malloc/free but return a pre-allocated buffer. But the memory leak issue disappeared, so it seems the issue not only comes from new/delete...
char pp[4096];
void* operator new(std::size_t sz) // no inline, required by [replacement.functions]/3
{
std::printf("global op new called, size = %zu\n", sz);
if (sz == 0)
++sz; // avoid std::malloc(0) which may return nullptr on success
return (void *)pp;
//if (void *ptr = std::malloc(sz))
//return ptr;
throw std::bad_alloc{}; // required by [new.delete.single]/3
}
void operator delete(void* ptr) noexcept
{
std::puts("global op delete called");
//std::free(ptr);
}
So, what happened? Could any experts give me some idea? I guess it maybe caused musl libc thread lib; it seems that thread synchronization does not work fine... or is there a patch that already fixed this issue?
this issue is involved by musl libc, and has been fixed by the following patch.
https://git.musl-libc.org/cgit/musl/commit/src/malloc?id=3e16313f8fe2ed143ae0267fd79d63014c24779f
above patch involved a buffur overflow issue of realloc and has been fixed by below 2 patches:
https://git.musl-libc.org/cgit/musl/commit/src/malloc?id=cb5babdc8d624a3e3e7bea0b4e28a677a2f2fc46
https://git.musl-libc.org/cgit/musl/commit/src/malloc?id=fca7428c096066482d8c3f52450810288e27515c

custom memory allocator - segfault

me and my friend are trying to develop custom memory allocator in linux ubuntu 16.04.
We got stuck because of an error, btw its our first time
that we are trying to code something like that so we are not the best debuggers the error is : Segmentation fault (core dumped)
and here is the code.
can anybody help us understand whats wrong ?
Thank you!
#include <unistd.h>
#include <string.h>
#include <pthread.h>
#include <stdio.h>
struct header_t {
size_t size;
unsigned is_free;
struct header_t *next; };
struct header_t *head = NULL, *tail = NULL;
pthread_mutex_t global_malloc_lock;
struct header_t *get_free_block(size_t size)
{
struct header_t *curr = head;
while(curr) {
/* see if there's a free block that can accomodate requested size */
if (curr->is_free && curr->size >= size)
return curr;
curr = curr->next;
}
return NULL;
}
void free(void *block)
{
struct header_t *header, *tmp;
/* program break is the end of the
process's data segment */
void *programbreak;
if (!block)
return;
pthread_mutex_lock(&global_malloc_lock);
header = (struct header_t*)block - 1;
/* sbrk(0) gives the current program break address */
programbreak = sbrk(0);
/*
Check if the block to be freed is the last one in the
linked list. If it is, then we could shrink the size of the
heap and release memory to OS. Else, we will keep the block
but mark it as free. */
if ((char*)block + header->size == programbreak) {
if (head == tail) {
head = tail = NULL;
} else {
tmp = head;
while (tmp) {
if(tmp->next == tail) {
tmp->next = NULL;
tail = tmp;
}
tmp = tmp->next;
}
}
/* sbrk() with a negative argument decrements the program break.
So memory is released by the program to OS. */
sbrk(0 - header->size - sizeof(struct header_t));
/* Note: This lock does not really assure thread
safety, because sbrk() itself is not really
thread safe. Suppose there occurs a foregin sbrk(N)
after we find the program break and before we decrement
it, then we end up realeasing the memory obtained by
the foreign sbrk(). */
pthread_mutex_unlock(&global_malloc_lock);
return;
}
header->is_free = 1;
pthread_mutex_unlock(&global_malloc_lock);
}
void *malloc(size_t size)
{
size_t total_size;
void *block;
struct header_t *header;
if (!size)
return NULL;
pthread_mutex_lock(&global_malloc_lock);
header = get_free_block(size);
if (header) {
/* Woah, found a free block to accomodate requested memory. */
header->is_free = 0;
pthread_mutex_unlock(&global_malloc_lock);
return (void*)(header + 1);
}
/* We need to get memory to fit in the requested block and header
from OS. */
total_size = sizeof(struct header_t) + size;
block = sbrk(total_size);
if (block == (void*) -1) {
pthread_mutex_unlock(&global_malloc_lock);
return NULL;
}
header = block;
header->size = size;
header->is_free = 0;
header->next = NULL;
if (!head)
head = header;
if (tail)
tail->next = header;
tail = header;
pthread_mutex_unlock(&global_malloc_lock);
return (void*)(header + 1);
}
void *calloc(size_t num, size_t nsize)
{
size_t size;
void *block;
if (!num || !nsize)
return NULL;
size = num * nsize;
/* check mul overflow */
if (nsize != size / num)
return NULL;
block = malloc(size);
if (!block)
return NULL;
memset(block, 0, size);
return block;
}
void *realloc(void *block, size_t size)
{
struct header_t *header;
void *ret;
if (!block || !size)
return malloc(size);
header = (struct header_t*)block - 1;
if (header->size >= size)
return block;
ret = malloc(size);
if (ret) {
/* Relocate contents to the new bigger block */
memcpy(ret, block, header->size);
/* Free the old memory block */
free(block);
}
return ret;
}
The problem occurred because the functions were not prototyped [decalred].
Once I added functions prototype. The code worked.
For more information about prototyping: http://www.trytoprogram.com/c-programming/function-prototype-in-c/
mutex variable should be initialized before using it for applying lock. your global_malloc_lock is not initialized.
you can't initialize mutex variable as of normal variable.
pthread_mutex_t global_malloc_lock = 0 ;// invalid .. you may thinking since it's it declared as global it's initialized with 0 which is wrong
Initialize the mutex variable by calling pthread_mutex_init() or using PTHREAD_MUTEX_INITIALIZER ;
for your code add this
pthread_mutex_t global_malloc_lock = pthread_mutex_t global_malloc_lock;

Linux - Control Flow in a linux kernel module

I am learning to write kernel modules and in one of the examples I had to make sure that a thread executed 10 times and exits, so I wrote this according to what I have studied:
#include <linux/module.h>
#include <linux/kthread.h>
struct task_struct *ts;
int flag = 0;
int id = 10;
int function(void *data) {
int n = *(int*)data;
set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout(n*HZ); // after doing this it executed infinitely and i had to reboot
while(!kthread_should_stop()) {
printk(KERN_EMERG "Ding");
}
flag = 1;
return 0;
}
int init_module (void) {
ts = kthread_run(function, (void *)&id, "spawn");
return 0;
}
void cleanup_module(void) {
if (flag==1) { return; }
else { if (ts != NULL) kthread_stop(ts);
}
return;
}
MODULE_LICENSE("GPL");
What I want to know is :
a) How to make thread execute 10 times like a loop
b) How does the control flows in these kind of processes that is if we make it to execute 10 times then does it go back and forth between function and cleanup_module or init_module or what exactly happens?
If you control kthread with kthread_stop, the kthread shouldn't exit until be ing stopped (see also that answer). So, after executing all operations, kthread should wait until stopped.
Kernel already implements kthread_worker mechanism, when kthread just executes works, added to it.
DEFINE_KTHREAD_WORKER(worker);
struct my_work
{
struct kthread_work *work; // 'Base' class
int n;
};
void do_work(struct kthread_work *work)
{
struct my_work* w = container_of(work, struct my_work, work);
printk(KERN_EMERG "Ding %d", w->n);
// And free work struct at the end
kfree(w);
}
int init_module (void) {
int i;
for(i = 0; i < 10; i++)
{
struct my_work* w = kmalloc(sizeof(struct my_work), GFP_KERNEL);
init_kthread_work(&w->work, &do_work);
w->n = i + 1;
queue_kthread_work(&worker, &w->work);
}
ts = kthread_run(&kthread_worker_fn, &worker, "spawn");
return 0;
}
void cleanup_module(void) {
kthread_stop(ts);
}

memory corruption while executing my code

# include "stdafx.h"
# include <iostream>
#include <ctype.h>
using namespace std;
class a
{
protected:
int d;
public:
virtual void assign(int A) = 0;
int get();
};
class b : a
{
char* n;
public:
b()
{
n=NULL;
}
virtual ~b()
{
delete n;
}
void assign(int A)
{
d=A;
}
void assignchar(char *c)
{
n=c;
}
int get()
{
return d;
}
char* getchart()
{
return n;
}
};
class c : b
{
b *pB;
int e;
public:
c()
{
pB=new b();
}
~c()
{
delete pB;
}
void assign(int A)
{
e=A;
pB->assign(A);
}
int get()
{
return e;
}
b* getp()
{
return pB;
}
};
int _tmain(int argc, _TCHAR* argv[])
{
c *pC=new c();
pC->assign(10);
b *p=pC->getp();
p->assignchar("a");
char *abc=p->getchart();
delete pC;
cout<<*abc<<endl;
getchar();
}
i'm a noob at c++ and was experimenting when i got to this point. I don't understand why i keep getting a memory corruption message from VS2010. I am trying to replicate a problem which is at a higher level by breaking it down into smaller bits, any help would be appreciated.
From a cursory glance, you are passing a static char array to AssignChar that cannot be deleted (ie when you type "A" into your code, its a special block of memory the compiler allocates for you).
You need to understand what assignment of a char* does (or any pointer to type). When you call n=c you are just assigning the pointer, the memory that pointer points to remains where it is. So, unless this is exactly what you meant to do, you will have 2 pointers pointing to the same block of memory.. and you need to decide which to delete (you can't delete it twice, that'd be bad).
My advice here is to start using C++, so no more char* types, use std::string instead. Using char* is C programming. Note that if you did use a std::string, and passed one to assignChars, it would copy as you expected (and there is no need to free std::string objects in your destructor, they handle all that for you).
The problem occurs when you're trying to delete pC.
When ~c() destructor calls ~b() destructor - you're trying to delete n;.
The problem is that after assignchar(), n points to a string literal which was given to it as an argument ("a").
That string is not dynamically allocated, and should not be freed, meaning you should either remove the 'delete n;' line, or give a dynamically-allocated string to assignchar() as an argument.

Why this boost thread creation does't compile?

I wrote some multithreading code using Boost thread library. I initialized two threads in the constructor using the placeholder _1 as the argument required by member function fillSample(int num). But this doesn't compile in my Visual Studio 2010. Following is the code:
#include<boost/thread.hpp>
#include<boost/thread/condition.hpp>
#include<boost/bind/placeholders.hpp>
#define SAMPLING_FREQ 250
#define MAX_NUM_SAMPLES 5*60*SAMPLING_FREQ
#define BUFFER_SIZE 8
class ECG
{
private:
int sample[BUFFER_SIZE];
int sampleIdx;
int readIdx, writeIdx;
boost::thread m_ThreadWrite;
boost::thread m_ThreadRead;
boost::mutex m_Mutex;
boost::condition bufferNotFull, bufferNotEmpty;
public:
ECG();
void fillSample(int num); //get sample from the data stream
void processSample(); //process ECG sample, return the last processed
};
ECG::ECG() : readyFlag(false), sampleIdx(0), readIdx(0), writeIdx(0)
{
m_ThreadWrite=boost::thread((boost::bind(&ECG::fillSample, this, _1)));
m_ThreadRead=boost::thread((boost::bind(&ECG::processSample, this)));
}
void ECG::fillSample(int num)
{
boost::mutex::scoped_lock lock(m_Mutex);
while( (writeIdx-readIdx)%BUFFER_SIZE == BUFFER_SIZE-1 )
{
bufferNotFull.wait(lock);
}
sample[writeIdx] = num;
writeIdx = (writeIdx+1) % BUFFER_SIZE;
bufferNotEmpty.notify_one();
}
void ECG::processSample()
{
boost::mutex::scoped_lock lock(m_Mutex);
while( readIdx == writeIdx )
{
bufferNotEmpty.wait(lock);
}
sample[readIdx] *= 2;
readIdx = (readIdx+1) % BUFFER_SIZE;
++sampleIdx;
bufferNotFull.notify_one();
}
I already included the placeholders.hpp header file but it still doesn't compile. If I replace the _1 with 0, then it will work. But this will initialize the thread function with 0, which is not what I want. Any ideas on how to make this work?
Move the creation to the initialization list:
m_ThreadWrite(boost::bind(&ECG::fillSample, this, _1)), ...
thread object is not copyable, and your compiler doesn't support its move constructor.

Resources