Is it possible to store pointers in shared memory without using offsets? - multithreading

When using shared memory, each process may mmap the shared region into a different area of its respective address space. This means that when storing pointers within the shared region, you need to store them as offsets of the start of the shared region. Unfortunately, this complicates use of atomic instructions (e.g. if you're trying to write a lock free algorithm). For example, say you have a bunch of reference counted nodes in shared memory, created by a single writer. The writer periodically atomically updates a pointer 'p' to point to a valid node with positive reference count. Readers want to atomically write to 'p' because it points to the beginning of a node (a struct) whose first element is a reference count. Since p always points to a valid node, incrementing the ref count is safe, and makes it safe to dereference 'p' and access other members. However, this all only works when everything is in the same address space. If the nodes and the 'p' pointer are stored in shared memory, then clients suffer a race condition:
x = read p
y = x + offset
Increment refcount at y
During step 2, p may change and x may no longer point to a valid node. The only workaround I can think of is somehow forcing all processes to agree on where to map the shared memory, so that real pointers rather than offsets can be stored in the mmap'd region. Is there any way to do that? I see MAP_FIXED in the mmap documentation, but I don't know how I could pick an address that would be safe.
Edit: Using inline assembly and the 'lock' prefix on x86 maybe it's possible to build a "increment ptr X with offset Y by value Z"? Equivalent options on other architectures? Haven't written a lot of assembly, don't know if the needed instructions exist.

On low level the x86 atomic inctruction can do all this tree steps at once:
x = read p
y = x + offset Increment
refcount at y
//
mov edi, Destination
mov edx, DataOffset
mov ecx, NewData
#Repeat:
mov eax, [edi + edx] //load OldData
//Here you can also increment eax and save to [edi + edx]
lock cmpxchg dword ptr [edi + edx], ecx
jnz #Repeat
//

This is trivial on a UNIX system; just use the shared memory functions:
shgmet, shmat, shmctl, shmdt
void *shmat(int shmid, const void *shmaddr, int shmflg);
shmat() attaches the shared memory
segment identified by shmid to the
address space of the calling process.
The attaching address is specified by
shmaddr with one of the following
criteria:
If shmaddr is NULL, the system chooses
a suitable (unused) address at which
to attach the segment.
Just specify your own address here; e.g. 0x20000000000
If you shmget() using the same key and size in every process, you will get the same shared memory segment. If you shmat() at the same address, the virtual addresses will be the same in all processes. The kernel doesn't care what address range you use, as long as it doesn't conflict with wherever it normally assigns things. (If you leave out the address, you can see the general region that it likes to put things; also, check addresses on the stack and returned from malloc() / new[] .)
On Linux, make sure root sets SHMMAX in /proc/sys/kernel/shmmax to a large enough number to accommodate your shared memory segments (default is 32MB).
As for atomic operations, you can get them all from the Linux kernel source, e.g.
include/asm-x86/atomic_64.h
/*
* Make sure gcc doesn't try to be clever and move things around
* on us. We need to use _exactly_ the address the user gave us,
* not some alias that contains the same information.
*/
typedef struct {
int counter;
} atomic_t;
/**
* atomic_read - read atomic variable
* #v: pointer of type atomic_t
*
* Atomically reads the value of #v.
*/
#define atomic_read(v) ((v)->counter)
/**
* atomic_set - set atomic variable
* #v: pointer of type atomic_t
* #i: required value
*
* Atomically sets the value of #v to #i.
*/
#define atomic_set(v, i) (((v)->counter) = (i))
/**
* atomic_add - add integer to atomic variable
* #i: integer value to add
* #v: pointer of type atomic_t
*
* Atomically adds #i to #v.
*/
static inline void atomic_add(int i, atomic_t *v)
{
asm volatile(LOCK_PREFIX "addl %1,%0"
: "=m" (v->counter)
: "ir" (i), "m" (v->counter));
}
64-bit version:
typedef struct {
long counter;
} atomic64_t;
/**
* atomic64_add - add integer to atomic64 variable
* #i: integer value to add
* #v: pointer to type atomic64_t
*
* Atomically adds #i to #v.
*/
static inline void atomic64_add(long i, atomic64_t *v)
{
asm volatile(LOCK_PREFIX "addq %1,%0"
: "=m" (v->counter)
: "er" (i), "m" (v->counter));
}

We have code that's similar to your problem description. We use a memory-mapped file, offsets, and file locking. We haven't found an alternative.

You shouldn't be afraid to make up an address at random, because the kernel will just reject addresses it doesn't like (ones that conflict). See my shmat() answer above, using 0x20000000000
With mmap:
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
If addr is not NULL, then the kernel
takes it as a hint about where to
place the mapping; on Linux, the
mapping will be created at the next
higher page boundary. The address of
the new mapping is returned as the
result of the call.
The flags argument determines whether
updates to the mapping are visible to
other processes mapping the same
region, and whether updates are
carried through to the underlying
file. This behavior is determined by
including exactly one of the following
values in flags:
MAP_SHARED Share this mapping.
Updates to the mapping are visible to
other processes that map this
file, and are carried through to the
underlying file. The file may not
actually be updated until msync(2) or
munmap() is called.
ERRORS
EINVAL We don’t like addr, length, or
offset (e.g., they are too large, or
not aligned on a page boundary).

Adding the offset to the pointer does not create the potential for a race, it already exists. Since at least neither ARM nor x86 can atomically read a pointer then access the memory it refers to you need to protect the pointer access with a lock regardless of whether you add an offset.

Related

Why don't multiple threads have to share a lock to call mmap like they do malloc/calloc/sbrk?

I'm working with ptmalloc, and something interesting I came across is when an arena runs out of available chunks (and the top chunk is not large enough) and has to either extend the arena using sbrk() or allocate a non-contiguous region using mmap(). What particularly stood out to me is that in order to allocate more memory using sbrk(), a lock had to be acquired before being able to call it (in addition to the lock previously obtained to be in sole possession of the current arena). However, no lock needs to be acquired before calling mmap(). I have included the specific parts of the sys_alloc() function from the malloc.c file included in the ptmalloc implementation (for reference) below:
Call to extend arena using sbrk():
if (HAVE_MORECORE && tbase == CMFAIL) { /* Try noncontiguous MORECORE */
size_t asize = granularity_align(nb + TOP_FOOT_SIZE + SIZE_T_ONE);
if (asize < HALF_MAX_SIZE_T) {
char* br = CMFAIL;
char* end = CMFAIL;
ACQUIRE_MORECORE_LOCK(); /* LOCK */
br = (char*)(CALL_MORECORE(asize));
end = (char*)(CALL_MORECORE(0));
RELEASE_MORECORE_LOCK(); /* UNLOCK */
if (br != CMFAIL && end != CMFAIL && br < end) {
size_t ssize = end - br;
if (ssize > nb + TOP_FOOT_SIZE) {
tbase = br;
tsize = ssize;
}
}
}
}
Call to extend arena using mmap():
if (HAVE_MMAP && tbase == CMFAIL) { /* Try MMAP */
size_t req = nb + TOP_FOOT_SIZE + SIZE_T_ONE;
size_t rsize = granularity_align(req);
if (rsize > nb) { /* Fail if wraps around zero */
char* mp = (char*)(CALL_MMAP(rsize));
if (mp != CMFAIL) {
tbase = mp;
tsize = rsize;
mmap_flag = IS_MMAPPED_BIT;
}
}
}
Any help understanding why this is able to work even with multiple threads that have the exact same memory pattern (and thus have to extend their arenas at the same time) without having to use locks (i.e., how mmap() is guaranteed to return distinct addresses, even if called simultaneously with a NULL suggested address) would be greatly appreciated.
In the code snippet using sbrk(). It is used to increased the process global heap area. Two calls are issued: the 1st one extends the heap area by rsize bytes and the second gets the resulting address of the new top of the heap (the so-called program's break). The heap area is shared by all the threads of the process. The cuurent top is a global variable for all the threads. Hence, it is protected by a mutex whenever a thread modifies it (shrink/grow operations);
In the code snippet using mmap(), the current thread is allocating a single memory mapped area for itself. The resulting address is only for the calling thread. Hence, no mutual exclusion is necessary from the ptmalloc global data structures point of view as the latter are not modified. A flag IS_MMAPPED_BIT is set in the internal allocated header to indicate to ptmalloc that this is a memory mapped region when it is requested to free it. Concerning mmap() internals, the mutual exclusion is managed inside the kernel.

Local memory for each CUDA thread

I have a simple program below. My question is that where is "temp" actually stored? is it in global or local memory? I need array temp for each idx so that every thread has individual array temp. In this case, it is working properly. But in my actual program, when I tried to fill temp[0] from test2 it made the program stopped. Suppose we have 1024 threads then it only run the kernel around 200 threads. So, I am wondering whether temp is shared or not. If yes, maybe there is a collision there. I also did not get any error messsage. Please someone explain about this.
__device__ void test2(int temp[], int idx) {
temp[0] = idx;
printf("%d ", temp[0]);
}
__global__ void test() {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int *temp = (int *) malloc(100 * sizeof (int));
test2(temp, idx);
}
int main() {
test << <1, 1024 >> >();
return 0;
}
My question is that where is "temp" actually stored?
The allocation for temp is stored in a place called the device heap. It is a form of global memory. However the temp variable itself (i.e. the pointer value) is in local memory - not shared or visible to other threads.
I need array temp for each idx so that every thread has individual array temp.
You will get that, subject to caveats below. Each thread will have its own individual array, referenced by its local variable temp. Each thread will have a separate allocation for storage on the device heap.
People commonly have problems with in-kernel new or malloc. One of the main reasons is that the device heap is initially limited to 8MB, across all of your device heap allocations. So if enough threads do a new or malloc of enough allocation requests, you will run out of space.
When you run out of space, the API way to signal that is to return a zero pointer value for the allocation (a NULL pointer). If you then attempt to use this NULL pointer, you will have trouble.
For debugging purposes (i.e. to prove this is happening), test the pointer for NULL (i.e. == 0) before using it. If it is NULL, don't use it (perhaps print an error message instead).
You can read more about this in the documentation or in many questions here on the SO cuda tag. If you read any of these sources, you will discover that you can increase the size of the device heap.

Multi-threaded reference counting

I was just thinking about multi-threaded reference counting, searched for it and found many posts, that basicly only mention the problem of atomicity, many answers even here on stackoverflow miss the actual problems involved in multi-threaded reference counting. So what's the fundamental problem.
Let's assume an object type with a reference counter like
struct SomethingRefcounted {
int refcount;
// other stuff
} * sr;
Reference counting means, our reference counter shall equal the number of references to the object at all times, possibly being slightly higher temporarily during pointer assignments.
For all further code, I assume volatile and atomic operations.
Now, each time, we create a new reference, we do an implicit ++sr->refcount; each time we remove the reference, we do an implicit if (!--sr->refcount) free(sr);.
However, if one thread does the decrement and another thread tries the increment at the same time, we get a race, with can only be understood considering CPU registers.
SomethingRefcounted * sr1, * sr2;
int thread_1() {
for (;;) sr1 = NULL;
}
int thread_2() {
for (;;) sr2 = sr1;
}
int thread_3() {
for (;;) sr1 = new SomethingRefcounted;
}
int main() {
create the threads and let them run
}
The problem here is thread_2: The moment it reads the pointer sr1 into a CPU register, it violates the assumption, the refcounter correctly counts the number of references. The refcounter is 1 after the new assignment to sr1 but even ignoring thread_1 for a moment, once sr1 is read into a CPU register by thread_2, there are 2 reference to the object, first the variable sr1 and the CPU register of thread_2, but the refcounter is still only 1, violating our constraint from above. The following increment will fix it, but this fix will come too late, if thread_1 decrements it to 0 before.
There are solutions involving locks (global locks, maybe hashed for many objects sharing one lock out of a pool of locks, so not every object needs it's own lock but also there doesn't need to be a single application wide lock causing lock contention). One solution I came up with however is to xchg the pointer on read, so the constraint, refcounter >= number of references is enforced. thread_2 in assembly could then look like this:
LOAD 0xdeadbeef -> reg1
L1:
XCHG [sr1] <-> reg1
CMP reg1 <-> 0xdeadbeef
JUMPIFEQUAL L1
INC [reg1]
STORE reg1 -> [sr1]
STORE reg1 -> [sr2] ; Actually, here, we have to do the
; refcount decrement on sr2 first, but you get the point
So, this is a simple spinlock, waiting for other concurrently running accesses like this to complete. Once we successfully XCHGed the pointer, no one else can get it, so we can be sure, the ref counter is at least 1 (it can't be 0, because we just found a reference to it) and it can't be decremented down to zero (even if there are more references, the one we have now, we have it exclusively in your CPU register, and that one contributes 1 to the refcounter, preventing it from reaching zero).
Similarily, thread_1 would look like this:
LOAD 0xdeadbeef -> reg1
L1:
XCHG [sr1] <-> reg1
CMP reg1 <-> 0xdeadbeef
JUMPIFEQUAL L1
LOAD 0 -> reg2
STORE reg2 -> [sr1]
DEC [reg1]
JUMPIFZERO free_it
Now, I am wondering if there are any more efficient solution to that problem. (Or even whether I miss something here).

Confusing result from counting page fault in linux

I was writing programs to count the time of page faults in a linux system. More precisely, the time kernel execute the function __do_page_fault.
And somehow I wrote two global variables, named pfcount_at_beg and pfcount_at_end, which increase once when the function __do_page_fault is executed at different locations of the function.
To illustrate, the modified function goes as:
unsigned long pfcount_at_beg = 0;
unsigned long pfcount_at_end = 0;
static void __kprobes
__do_page_fault(...)
{
struct vm_area_sruct *vma;
... // VARIABLES DEFINITION
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
pfcount_at_beg++; // I add THIS
...
...
// ORIGINAL CODE OF THE FUNCTION
...
pfcount_at_end++; // I add THIS
}
I expected that the value of pfcount_at_end is smaller than the value of pfcount_at_beg.
Because, I think, every time kernel executes the instructions of code pfcount_at_end++, it must have executed pfcount_at_beg++(Every function starts at the very beginning of the code).
On the other hand, as there are many conditional return between these two lines of code.
However, the result turns out oppositely. The value of pfcount_at_end is larger than the value of pfcount_at_beg.
I use printk to print these kernel variables through a self-defined syscall. And I wrote the user level program to call the system call.
Here is my simple syscall and user-level program:
// syscall
asmlinkage int sys_mysyscall(void)
{
printk( KERN_INFO "total pf_at_beg%lu\ntotal pf_at_end%lu\n", pfcount_at_beg, pfcount_at_end)
return 0;
}
// user-level program
#include<linux/unistd.h>
#include<sys/syscall.h>
#define __NR_mysyscall 223
int main()
{
syscall(__NR_mysyscall);
return 0;
}
Is there anybody who knows what exactly happened during this?
Just now I modified the code, to make pfcount_at_beg and pfcount_at_end static. However the result did not change, i.e. the value of pfcount_at_end is larger than the value of pfcount_at_beg.
So possibly it might be caused by in-atomic operation of increment. Would it be better if I use read-write lock?
The ++ operator is not garanteed to be atomic, so your counters may suffer concurrent access and have incorrect values. You should protect your increment as a critical section, or use the atomic_t type defined in <asm/atomic.h>, and its related atomic_set() and atomic_add() functions (and a lot more).
Not directly connected to your issue, but using a specific syscall is overkill (but maybe it is an exercise). A lighter solution could be to use a /proc entry (also an interesting exercise).

Can I use a single call to munmap() to unmap two memory mappings located in a contiguous range of virtual memory?

For example, if I do this:
char *pMap1; /* First mapping */
char *pReq; /* Address we would like the second mapping at */
char *pMap2; /* Second mapping */
/* Map the first 1 MB of the file. */
pMap1 = (char *)mmap(0, 1024*1024, PROT_READ, MAP_SHARED, fd, 0);
assert( pMap1!=MAP_FAILED );
/* Now map the second MB of the file. Request that the OS positions the
** second mapping immediately after the first in virtual memory. */
pReq = pMap1 + 1024*1024;
pMap2 = (char *)mmap(pReq, 1024*1024, PROT_READ, MAP_SHARED, fd, 1024*1024);
assert( pMap2!=MAP_FAILED );
/* Unmap the mappings created above. */
if( pMap2==pReq ){
munmap(pMap1, 2 * 1024*1024);
}else{
munmap(pMap1, 1 * 1024*1024);
munmap(pMap2, 1 * 1024*1024);
}
And the OS does place my second mapping in the requested position (so
that the (pMap2==pReq) condition is true), will the single call to
munmap() serve to release all resources allocated?
The Linux man page says "The munmap() system call deletes the mappings
for the specified address range...", which suggests that this will work,
but I'm still a little nervous about it. Even if it does work on Linux,
does anybody know how portable this is likely to be?
Thanks very much in advance.
The glibc manual says it's fine:
munmap removes any memory maps from (addr) to (addr + length). length
should be the length of the mapping.
It is safe to unmap multiple
mappings in one command, or include unmapped space in the range. It is
also possible to unmap only part of an existing mapping. However, only
entire pages can be removed.
The POSIX spec says:
int munmap(void *addr, size_t len);
The munmap() function shall remove any mappings for those entire pages containing any part of the address space of the process starting at addr and continuing for len bytes.
To me, the wording clearly reads as if removing multiple mappings with a single munmap() is fine, and should be supported by any compliant implementation.
I think it should work. The POSIX spec says that it removes
any mappings for those entire pages containing any part of the address space of the process starting at addr and continuing for len bytes
The only unspecified behavior it describes is:
The behavior of this function is unspecified if the mapping was not established by a call to mmap().

Resources