Segmentation Fault With Multiple Threads - multithreading

I get error segmentation fault because of the free() at the end of this equation...
don't I have to free the temporary variable *stck? Or since it's a local pointer and
was never assigned a memory space via malloc, the compiler cleans it up for me?
void * push(void * _stck)
{
stack * stck = (stack*)_stck;//temp stack
int task_per_thread = 0; //number of push per thread
pthread_mutex_lock(stck->mutex);
while(stck->head == MAX_STACK -1 )
{
pthread_cond_wait(stck->has_space,stck->mutex);
}
while(task_per_thread <= (MAX_STACK/MAX_THREADS)&&
(stck->head < MAX_STACK) &&
(stck->item < MAX_STACK)//this is the amount of pushes
//we want to execute
)
{ //store actual value into stack
stck->list[stck->head]=stck->item+1;
stck->head = stck->head + 1;
stck->item = stck->item + 1;
task_per_thread = task_per_thread+1;
}
pthread_mutex_unlock(stck->mutex);
pthread_cond_signal(stck->has_element);
free(stck);
return NULL;
}

Edit: You totally changed the question so my old answer doesn't really make sense anymore. I'll try to answer the new one (old answer still below) but for reference, next time please just ask a new question instead of changing an old one.
stck is a pointer that you set to point to the same memory as _stck points to. A pointer does not imply allocating memory, it just points to memory that is already (hopefully) allocated. When you do for example
char* a = malloc(10); // Allocate memory and save the pointer in a.
char* b = a; // Just make b point to the same memory block too.
free(a); // Free the malloc'd memory block.
free(b); // Free the same memory block again.
you free the same memory twice.
-- old answer
In push, you're setting stck to point to the same memory block as _stck, and at the end of the call you free stack (thereby calling free() on your common stack once from each thread)
Remove the free() call and, at least for me, it does not crash anymore. Deallocating the stack should probably be done in main() after joining all the threads.

Related

Why don't multiple threads have to share a lock to call mmap like they do malloc/calloc/sbrk?

I'm working with ptmalloc, and something interesting I came across is when an arena runs out of available chunks (and the top chunk is not large enough) and has to either extend the arena using sbrk() or allocate a non-contiguous region using mmap(). What particularly stood out to me is that in order to allocate more memory using sbrk(), a lock had to be acquired before being able to call it (in addition to the lock previously obtained to be in sole possession of the current arena). However, no lock needs to be acquired before calling mmap(). I have included the specific parts of the sys_alloc() function from the malloc.c file included in the ptmalloc implementation (for reference) below:
Call to extend arena using sbrk():
if (HAVE_MORECORE && tbase == CMFAIL) { /* Try noncontiguous MORECORE */
size_t asize = granularity_align(nb + TOP_FOOT_SIZE + SIZE_T_ONE);
if (asize < HALF_MAX_SIZE_T) {
char* br = CMFAIL;
char* end = CMFAIL;
ACQUIRE_MORECORE_LOCK(); /* LOCK */
br = (char*)(CALL_MORECORE(asize));
end = (char*)(CALL_MORECORE(0));
RELEASE_MORECORE_LOCK(); /* UNLOCK */
if (br != CMFAIL && end != CMFAIL && br < end) {
size_t ssize = end - br;
if (ssize > nb + TOP_FOOT_SIZE) {
tbase = br;
tsize = ssize;
}
}
}
}
Call to extend arena using mmap():
if (HAVE_MMAP && tbase == CMFAIL) { /* Try MMAP */
size_t req = nb + TOP_FOOT_SIZE + SIZE_T_ONE;
size_t rsize = granularity_align(req);
if (rsize > nb) { /* Fail if wraps around zero */
char* mp = (char*)(CALL_MMAP(rsize));
if (mp != CMFAIL) {
tbase = mp;
tsize = rsize;
mmap_flag = IS_MMAPPED_BIT;
}
}
}
Any help understanding why this is able to work even with multiple threads that have the exact same memory pattern (and thus have to extend their arenas at the same time) without having to use locks (i.e., how mmap() is guaranteed to return distinct addresses, even if called simultaneously with a NULL suggested address) would be greatly appreciated.
In the code snippet using sbrk(). It is used to increased the process global heap area. Two calls are issued: the 1st one extends the heap area by rsize bytes and the second gets the resulting address of the new top of the heap (the so-called program's break). The heap area is shared by all the threads of the process. The cuurent top is a global variable for all the threads. Hence, it is protected by a mutex whenever a thread modifies it (shrink/grow operations);
In the code snippet using mmap(), the current thread is allocating a single memory mapped area for itself. The resulting address is only for the calling thread. Hence, no mutual exclusion is necessary from the ptmalloc global data structures point of view as the latter are not modified. A flag IS_MMAPPED_BIT is set in the internal allocated header to indicate to ptmalloc that this is a memory mapped region when it is requested to free it. Concerning mmap() internals, the mutual exclusion is managed inside the kernel.

Local memory for each CUDA thread

I have a simple program below. My question is that where is "temp" actually stored? is it in global or local memory? I need array temp for each idx so that every thread has individual array temp. In this case, it is working properly. But in my actual program, when I tried to fill temp[0] from test2 it made the program stopped. Suppose we have 1024 threads then it only run the kernel around 200 threads. So, I am wondering whether temp is shared or not. If yes, maybe there is a collision there. I also did not get any error messsage. Please someone explain about this.
__device__ void test2(int temp[], int idx) {
temp[0] = idx;
printf("%d ", temp[0]);
}
__global__ void test() {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int *temp = (int *) malloc(100 * sizeof (int));
test2(temp, idx);
}
int main() {
test << <1, 1024 >> >();
return 0;
}
My question is that where is "temp" actually stored?
The allocation for temp is stored in a place called the device heap. It is a form of global memory. However the temp variable itself (i.e. the pointer value) is in local memory - not shared or visible to other threads.
I need array temp for each idx so that every thread has individual array temp.
You will get that, subject to caveats below. Each thread will have its own individual array, referenced by its local variable temp. Each thread will have a separate allocation for storage on the device heap.
People commonly have problems with in-kernel new or malloc. One of the main reasons is that the device heap is initially limited to 8MB, across all of your device heap allocations. So if enough threads do a new or malloc of enough allocation requests, you will run out of space.
When you run out of space, the API way to signal that is to return a zero pointer value for the allocation (a NULL pointer). If you then attempt to use this NULL pointer, you will have trouble.
For debugging purposes (i.e. to prove this is happening), test the pointer for NULL (i.e. == 0) before using it. If it is NULL, don't use it (perhaps print an error message instead).
You can read more about this in the documentation or in many questions here on the SO cuda tag. If you read any of these sources, you will discover that you can increase the size of the device heap.

How does linux know when to allocate more pages to a call stack?

Given the program below, segfault() will (As the name suggests) segfault the program by accessing 256k below the stack. nofault() however, gradually pushes below the stack all the way to 1m below, but never segfaults.
Additionally, running segfault() after nofault() doesn't result in an error either.
If I put sleep()s in nofault() and use the time to cat /proc/$pid/maps I see the allocated stack space grows between the first and second call, this explains why segfault() doesn't crash afterwards - there's plenty of memory.
But the disassembly shows there's no change to %rsp. This makes sense since that would screw up the call stack.
I presumed that the maximum stack size would be baked into the binary at compile time (In retrospect that would be very hard for a compiler to do) or that it would just periodically check %rsp and add a buffer after that.
How does the kernel know when to increase the stack memory?
#include <stdio.h>
#include <unistd.h>
void segfault(){
char * x;
int a;
for( x = (char *)&x-1024*256; x<(char *)(&x+1); x++){
a = *x & 0xFF;
printf("%p = 0x%02x\n",x,a);
}
}
void nofault(){
char * x;
int a;
sleep(20);
for( x = (char *)(&x); x>(char *)&x-1024*1024; x--){
a = *x & 0xFF;
printf("%p = 0x%02x\n",x,a);
}
sleep(20);
}
int main(){
nofault();
segfault();
}
The processor raises a page fault when you access an unmapped page. The kernel's page fault handler checks whether the address is reasonably close to the process's %rsp and if so, it allocates some memory and resumes the process. If you are too far below %rsp, the kernel passes the fault along to the process as a signal.
I tried to find the precise definition of what addresses are close enough to %rsp to trigger stack growth, and came up with this from linux/arch/x86/mm.c:
/*
* Accessing the stack below %sp is always a bug.
* The large cushion allows instructions like enter
* and pusha to work. ("enter $65535, $31" pushes
* 32 pointers and then decrements %sp by 65535.)
*/
if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
bad_area(regs, error_code, address);
return;
}
But experimenting with your program I found that 65536+32*sizeof(unsigned long) isn't the actual cutoff point between segfault and no segfault. It seems to be about twice that value. So I'll just stick with the vague "reasonably close" as my official answer.

Memory release while reassign char * to null

I'm a little bit confused regarding string memory usage in c++.
Is it good reassign *PChar to NULL second time? Will assigned first time to *PChar string memory be released?
char * fnc(int g)
{
...
}
char *PChar = NULL;
PChar=fnc(1);
if (PChar) { sprintf(s,"%s",PChar); } ;
*PChar = NULL;
PChar=fnc(2);
if (PChar) { sprintf(s,"%s",PChar); } ;
First things first. The following statement is not what you intend:
*PChar = NULL;
PChar=fnc(2);
You are NOT assigning null to the pointer, but putting value zero (0) to the first character of the said buffer. You might be willing to do:
PChar = NULL;
PChar=fnc(2);
As a good programming practice, yes you should assign a pointer to null after it is used (AND possibility memory-deallocated). But assigning a pointer to null will not free the memory - the pointer will not point to allocated memory, but to non-existent memory location. You need to call delete if it was allocated using new, or need to call free if allocated by malloc.
As for the given statement, the compiler would anyway remove the following statement, as the process of optimization:
// PChar = NULL;
PChar=fnc(2);
You need to be very careful while using pointers, and assignment to it with a statically allocated data or dynamically allocated buffer!
I would suggest declaring a buffer of the PChar type and pass pointer to this buffer in a function call.
Good programming practice cals for passing also the allowed length of the buffer that should be checked in th function.
#define MAX_PCHAR_LEN 1024 // or constant const DWORD . . .
PChar PCharbuf[MAX_PCHAR_LEN] = {0}; // initialize array with 0s
//make a call
fnc (&PCharbuf, MAX_PCHAR_LEN, 2); // whatever 2 means
This way you do not have to worry about who allocates and who released memory, since release is automatic after PCharbuf goes out of scope.

How to use lua_pop() function correctly?

Can anyone pls tell me that how to use lua_pop() function correctly in C++.
Should I call it when I use a lua_get*() function ? like.
lua_getglobal(L, "something");
lua_pop(L, 1);
or how to use it ? Will the garbage collector clear those stuff after the threshold ? Thanks.
You call lua_pop() to remove items from the Lua stack. For simple functions, this can be entirely unnecessary since the core will clean up the stack as part of handling the return values.
For more complex functions, and especially for C code that is calling into Lua, you will often need to pop things from the stack to prevent the stack from growing indefinitely.
The lua_getglobal() function adds one item to the stack when called, which is either nil if the global doesn't exist or the value of the named global variable. Having a copy of that value on the stack protects it from the garbage collector as long as it is there. That value needs to remain on the stack as long as it is in use by the C code that retrieved it, because if the global were modified, the copy on the stack might be the only remaining reference.
So the general patterns for using a global are something like these:
void doMyEvent(lua_State *L) {
lua_getglobal(L, "MyEvent");
lua_call(L, 0, 0); /* pops the function and 0 parameters, pushes 0 results */
}
double getGlobalDouble(lua_State *L, const char *name) {
double d;
lua_getglobal(L,name);
d = lua_tonumber(L,1); /* extracts the value, leaves stack unchanged */
lua_pop(L,1); /* pop the value to leave stack balanced */
return d;
}
char *copyGlobalString(lua_State *L, const char *name) {
char *s = NULL;
lua_getglobal(L,name);
if (!lua_isnil(L,-1))
s = strdup(lua_tostring(L,-1));
lua_pop(L,1);
return s;
}
In the last example, I am careful to copy the content of the string because the pointer returned by lua_tostring() is only guaranteed to be valid as long as the value remains on the stack. The requires that a caller of copyGlobalString() is responsible for calling free() later.
Note too that recent editions of the Lua manual include a notation along with each function that identifies the number of stack entries consumed, and the number pushed. This helps avoid unexpected stack growth.

Resources