Local memory for each CUDA thread

Local memory for each CUDA thread - multithreading

I have a simple program below. My question is that where is "temp" actually stored? is it in global or local memory? I need array temp for each idx so that every thread has individual array temp. In this case, it is working properly. But in my actual program, when I tried to fill temp[0] from test2 it made the program stopped. Suppose we have 1024 threads then it only run the kernel around 200 threads. So, I am wondering whether temp is shared or not. If yes, maybe there is a collision there. I also did not get any error messsage. Please someone explain about this.
__device__ void test2(int temp[], int idx) {
temp[0] = idx;
printf("%d ", temp[0]);
}
__global__ void test() {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int *temp = (int *) malloc(100 * sizeof (int));
test2(temp, idx);
}
int main() {
test << <1, 1024 >> >();
return 0;
}

My question is that where is "temp" actually stored?
The allocation for temp is stored in a place called the device heap. It is a form of global memory. However the temp variable itself (i.e. the pointer value) is in local memory - not shared or visible to other threads.
I need array temp for each idx so that every thread has individual array temp.
You will get that, subject to caveats below. Each thread will have its own individual array, referenced by its local variable temp. Each thread will have a separate allocation for storage on the device heap.
People commonly have problems with in-kernel new or malloc. One of the main reasons is that the device heap is initially limited to 8MB, across all of your device heap allocations. So if enough threads do a new or malloc of enough allocation requests, you will run out of space.
When you run out of space, the API way to signal that is to return a zero pointer value for the allocation (a NULL pointer). If you then attempt to use this NULL pointer, you will have trouble.
For debugging purposes (i.e. to prove this is happening), test the pointer for NULL (i.e. == 0) before using it. If it is NULL, don't use it (perhaps print an error message instead).
You can read more about this in the documentation or in many questions here on the SO cuda tag. If you read any of these sources, you will discover that you can increase the size of the device heap.

Related

Why don't multiple threads have to share a lock to call mmap like they do malloc/calloc/sbrk?

I'm working with ptmalloc, and something interesting I came across is when an arena runs out of available chunks (and the top chunk is not large enough) and has to either extend the arena using sbrk() or allocate a non-contiguous region using mmap(). What particularly stood out to me is that in order to allocate more memory using sbrk(), a lock had to be acquired before being able to call it (in addition to the lock previously obtained to be in sole possession of the current arena). However, no lock needs to be acquired before calling mmap(). I have included the specific parts of the sys_alloc() function from the malloc.c file included in the ptmalloc implementation (for reference) below:
Call to extend arena using sbrk():
if (HAVE_MORECORE && tbase == CMFAIL) { /* Try noncontiguous MORECORE */
size_t asize = granularity_align(nb + TOP_FOOT_SIZE + SIZE_T_ONE);
if (asize < HALF_MAX_SIZE_T) {
char* br = CMFAIL;
char* end = CMFAIL;
ACQUIRE_MORECORE_LOCK(); /* LOCK */
br = (char*)(CALL_MORECORE(asize));
end = (char*)(CALL_MORECORE(0));
RELEASE_MORECORE_LOCK(); /* UNLOCK */
if (br != CMFAIL && end != CMFAIL && br < end) {
size_t ssize = end - br;
if (ssize > nb + TOP_FOOT_SIZE) {
tbase = br;
tsize = ssize;
}
}
}
}
Call to extend arena using mmap():
if (HAVE_MMAP && tbase == CMFAIL) { /* Try MMAP */
size_t req = nb + TOP_FOOT_SIZE + SIZE_T_ONE;
size_t rsize = granularity_align(req);
if (rsize > nb) { /* Fail if wraps around zero */
char* mp = (char*)(CALL_MMAP(rsize));
if (mp != CMFAIL) {
tbase = mp;
tsize = rsize;
mmap_flag = IS_MMAPPED_BIT;
}
}
}
Any help understanding why this is able to work even with multiple threads that have the exact same memory pattern (and thus have to extend their arenas at the same time) without having to use locks (i.e., how mmap() is guaranteed to return distinct addresses, even if called simultaneously with a NULL suggested address) would be greatly appreciated.

In the code snippet using sbrk(). It is used to increased the process global heap area. Two calls are issued: the 1st one extends the heap area by rsize bytes and the second gets the resulting address of the new top of the heap (the so-called program's break). The heap area is shared by all the threads of the process. The cuurent top is a global variable for all the threads. Hence, it is protected by a mutex whenever a thread modifies it (shrink/grow operations);
In the code snippet using mmap(), the current thread is allocating a single memory mapped area for itself. The resulting address is only for the calling thread. Hence, no mutual exclusion is necessary from the ptmalloc global data structures point of view as the latter are not modified. A flag IS_MMAPPED_BIT is set in the internal allocated header to indicate to ptmalloc that this is a memory mapped region when it is requested to free it. Concerning mmap() internals, the mutual exclusion is managed inside the kernel.

register_kprobe() returns EINVAL without additional memory on containing struct

I've written a kernel module (a character device) that registers new KProbes whenever I write to the module.
I have a structure that contains struct kprobe. When I call register_kprobe(), it returns -EINVAL. But when I add a dummy character array to the (possibly some other data types as well), the KProbe registration succeeds.
Probe Registration
struct my_struct *container = kmalloc(sizeof(struct my_struct));
(container->probe).addr = (kprobe_opcode_t *) kallsyms_lookup_name("my_exported_fn"); /* my_exported_fn is in code section */
(container->probe).pre_handler = Pre_Handler;
(container->probe).post_handler = Post_Handler;
register_probe(&container->probe);
/* Returns -EINVAL if my_struct contains only `struct kprobe`. */
Not working:
struct my_struct {
struct kprobe probe;
}
Working:
struct my_struct {
char dummy[512]; /* At 512, it gets consistently registered. At 256, sometimes (maybe one out of 5 - 10 times get registered) */
struct kprobe probe;
}
Why does it need this extra bit of memory to be present in the struct?

This could be unaligned memory access or not, but in this particular case (I mean your original code before the edit) I suspect that the data is not properly initialised. Namely, register_kprobe() calls kprobe_addr() function which in turn implies the following check:
if ((symbol_name && addr) || (!symbol_name && !addr))
goto invalid;
...
invalid:
return ERR_PTR(-EINVAL);
So, if you indeed initialise addr and don't initialise symbol_name, the latter could be a garbage pointer under certain circumstances. Namely, kmalloc() doesn't zeroise allocated memory and, furthermore, depending on requested size, it may take memory object of a suitable size from a different pool (there are different pools to provide objects of different sizes), and when you artificially increase the size of the struct, kmalloc() has to allocate a larger object from a suitable pool. From this perspective, the probability is that such an object may not contain garbage by occasion (since larger chunks are requested less often).
All in all, I suggest zeroising the memory chunk or using kzalloc().

why it's slowly when I parse a message of Google protocol buffer in multi-thread?

I try to parse many Google protocol buffer messages from a binary file generated by calling SerializeToString. I first load all Bytes into a heap memory by calling new function. I also have two arrays to store the Bytes begin address of a message in the heap memory and the Bytes count of the message.
Then I begin to parse message by calling ParseFromString.I want to quicken the procedure by using multi-thread.
In each thread, I pass the start index and end index of address array and Byte count array.
In parent process. the main code is:
struct ParsePara
{
char* str_buffer;
size_t* buffer_offset;
size_t* binary_string_length_array;
size_t start_idx;
size_t end_idx;
Flight_Ticket_Info* ticket_info_buffer_array;
};
//Flight_Ticket_Info is class of message
//offset_size is the count of message
ticket_array = new Flight_Ticket_Info[offset_size];
const int max_thread_count = 6;
pthread_t pthread_id_vec[max_thread_count];
CTimer thread_cost;
thread_cost.start();
vector<ParsePara*> para_vec;
const size_t each_count = ceil(float(offset_size) / max_thread_count);
for (size_t k = 0;k < max_thread_count;k++)
{
size_t start_idx = each_count * k;
size_t end_idx = each_count * (k+1);
if (start_idx >= offset_size)
break;
if (end_idx >= offset_size)
end_idx = offset_size;
ParsePara* cand_para_ptr = new ParsePara();
if (!cand_para_ptr)
{
_ERROR_EXIT(0,"[Malloc memory fail.]");
}
cand_para_ptr->str_buffer = m_valdata;//heap memory for storing Bytes of message
cand_para_ptr->buffer_offset = offset_array;//begin address of each message
cand_para_ptr->start_idx = start_idx;
cand_para_ptr->end_idx = end_idx;
cand_para_ptr->ticket_info_buffer_array = ticket_array;//array to store message
cand_para_ptr->binary_string_length_array = binary_length_array;//Bytes count of each message
para_vec.push_back(cand_para_ptr);
}
for(size_t k = 0 ;k < para_vec.size();k++)
{
int ret = pthread_create(&pthread_id_vec[k],NULL,parserFlightTicketForMultiThread,para_vec[k]);
if (0 != ret)
{
_ERROR_EXIT(0,"[Error] [create thread fail]");
}
}
for (size_t k = 0;k < para_vec.size();k++)
{
pthread_join(pthread_id_vec[k],NULL);
}
In each thread the thread function is:
void* parserFlightTicketForMultiThread(void* void_para_ptr)
{
ParsePara* para_ptr = (ParsePara*) void_para_ptr;
parserFlightTicketForMany(para_ptr->str_buffer,para_ptr->ticket_info_buffer_array,para_ptr->buffer_offset,
para_ptr->start_idx,para_ptr->end_idx,para_ptr->binary_string_length_array);
}
void parserFlightTicketForMany(const char* str_buffer,Flight_Ticket_Info* ticket_info_buffer_array,
size_t* buffer_offset,const size_t start_idx,const size_t end_idx,size_t* binary_string_length_array)
{
printf("start_idx:%d,end_idx:%d\n",start_idx,end_idx);
for (size_t k = start_idx;k < end_idx;k++)
{
if (k % 100000 == 0)
cout << k << endl;
size_t cand_offset = buffer_offset[k];
size_t binary_length = binary_string_length_array[k];
ticket_info_buffer_array[k].ParseFromString(string(&str_buffer[cand_offset],binary_length-1));
}
printf("done %ld %ld\n",start_idx,end_idx);
}
But multi-thread cost is more than one thread.
one thread cost is:40455623ms
My computer is 8 core and six thread cost is:131586865ms
Anyone can help me? thank you!

Some possible problems -- you'll have to experiment to determine which:
Protobuf parsing speed is often limited by memory bandwidth rather than CPU time, especially with a large input data set. In that case, more threads won't help, since all the cores are sharing bandwidth to main memory. Indeed, having multiple cores fighting over memory bandwidth could make the overall operation slower. Note that the biggest consumer of memory is not the input bytes but rather the parsed data objects -- that is, the output of parsing -- which are many times larger than the encoded data. To improve this problem, consider writing the parsing loop so that it fully-processes each message immediately after parsing, before moving on to the text message. That way, instead of allocating k protobuf objects, you only need to allocate one protobuf object per thread, and repeatedly reuse the same object for parsing. This way the object will (probably) stay in the core's private L1 cache and avoid consuming memory bandwidth; only the input bytes will be read over the main bus.
How are you loading data into RAM? Did you read() into a large array or did you mmap()? In the latter case the data is read from disk lazily -- it won't happen until you actually attempt to parse it. Even in the read() case, it could be that the data has been swapped out, creating similar effects. Either way, your threads are now not just fighting for memory bandwidth, but disk bandwidth, which is of course much slower. Having six threads reading separate parts of a big file will definitely be slower overall than having one thread read the whole file, because the operating system optimizes for sequential access.
Protobuf allocates memory during parsing. Many memory allocators take a lock while allocating new memory. Since all your threads are allocating tons and tons of objects in a tight loop, they will contend for this lock. Make sure you are using a thread-friendly memory allocator, such as Google's tcmalloc. Note that repeatedly reusing the same protobuf object in a parse-consume loop rather than allocating lots of different objects will also help immensely here, because the protobuf object will automatically reuse memory for sub-objects.
There may be a bug in your code and it might not be doing what you expect at all when multithreaded. For example, a bug might be causing all the threads to process the same data, rather than different data, and it could be that the data they're choosing happens to be bigger. Make sure you are testing that the results of your code are exactly the same when you run single-threaded vs. multi-threaded.
In short, if you want multiple cores to make your code faster, you have to think about not just what each core is doing, but what data is going in and out of each core, and how much the cores have to talk to each other. Ideally you want each core to operate all on its own without talking to anyone or anything; then you get maximum parallelism. That's not usually possible, of course, but the closer you can get to that, the better.
BTW, a random optimization for you:
ParseFromString(string(&str_buffer[cand_offset],binary_length-1))
Replace that with:
ParseFromArray(&str_buffer[cand_offset],binary_length-1)
Creating at std::string makes a copy of the data, which wastes time (and memory bandwidth). (This doesn't explain why threading is slow, though.)

issue with copy_from_user in kernel

I'm trying to use this function to copy a buffer from the user to one in kernel.
both buffers were allocated. I'm using while in case not all the bytes were copied on the first try. but for some reason, nothing is copied and the program is stuck in the while loop.
what can be the reasons for that?
void my_copy_from_user(const char* source_buff, char* dest_buff, int size_to_copy){
int not_copied = size_to_copy
int left = size_to_copy;
while( not_copied ){
not_copied = copy_from_user(dest_buff, source_buff, left);
dest_buff += (left - not_copied);
source_buff += (left - not_copied);
left = not_copied;
}
}

It is possible that it is legitimately failing for reasons that you cannot recover from.
Please look at: http://lxr.free-electrons.com/source/arch/x86/lib/usercopy_32.c#L681
unsigned long _copy_from_user(void *to, const void __user *from, unsigned n)
{
if (access_ok(VERIFY_READ, from, n))
n = __copy_from_user(to, from, n);
else
memset(to, 0, n);
return n;
}
This is the underlying implementation for copy_from_user for Linux on x86 processors. It first checks access_ok. If access is not allowed, it will fail and return with n (the number of bytes you requested to copy) immediately. This would cause an infinite loop.
Two points:
I do not think you should invoke copy_from_user in a loop like that. If it fails to copy in kernel mode, there is a reason why. This is a different beast from read() functions when reading from sockets, etc, where you are encouraged to read() in a loop.
Are you sure that you are passing in the correct dest_buff to copy_from_user?
Tips:
Printk all the values and see what's happening. Is left being changed or not? It is likely not.

Segmentation Fault With Multiple Threads

I get error segmentation fault because of the free() at the end of this equation...
don't I have to free the temporary variable *stck? Or since it's a local pointer and
was never assigned a memory space via malloc, the compiler cleans it up for me?
void * push(void * _stck)
{
stack * stck = (stack*)_stck;//temp stack
int task_per_thread = 0; //number of push per thread
pthread_mutex_lock(stck->mutex);
while(stck->head == MAX_STACK -1 )
{
pthread_cond_wait(stck->has_space,stck->mutex);
}
while(task_per_thread <= (MAX_STACK/MAX_THREADS)&&
(stck->head < MAX_STACK) &&
(stck->item < MAX_STACK)//this is the amount of pushes
//we want to execute
)
{ //store actual value into stack
stck->list[stck->head]=stck->item+1;
stck->head = stck->head + 1;
stck->item = stck->item + 1;
task_per_thread = task_per_thread+1;
}
pthread_mutex_unlock(stck->mutex);
pthread_cond_signal(stck->has_element);
free(stck);
return NULL;
}

Edit: You totally changed the question so my old answer doesn't really make sense anymore. I'll try to answer the new one (old answer still below) but for reference, next time please just ask a new question instead of changing an old one.
stck is a pointer that you set to point to the same memory as _stck points to. A pointer does not imply allocating memory, it just points to memory that is already (hopefully) allocated. When you do for example
char* a = malloc(10); // Allocate memory and save the pointer in a.
char* b = a; // Just make b point to the same memory block too.
free(a); // Free the malloc'd memory block.
free(b); // Free the same memory block again.
you free the same memory twice.
-- old answer
In push, you're setting stck to point to the same memory block as _stck, and at the end of the call you free stack (thereby calling free() on your common stack once from each thread)
Remove the free() call and, at least for me, it does not crash anymore. Deallocating the stack should probably be done in main() after joining all the threads.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string