Multi-threaded reference counting - multithreading

I was just thinking about multi-threaded reference counting, searched for it and found many posts, that basicly only mention the problem of atomicity, many answers even here on stackoverflow miss the actual problems involved in multi-threaded reference counting. So what's the fundamental problem.
Let's assume an object type with a reference counter like
struct SomethingRefcounted {
int refcount;
// other stuff
} * sr;
Reference counting means, our reference counter shall equal the number of references to the object at all times, possibly being slightly higher temporarily during pointer assignments.
For all further code, I assume volatile and atomic operations.
Now, each time, we create a new reference, we do an implicit ++sr->refcount; each time we remove the reference, we do an implicit if (!--sr->refcount) free(sr);.
However, if one thread does the decrement and another thread tries the increment at the same time, we get a race, with can only be understood considering CPU registers.
SomethingRefcounted * sr1, * sr2;
int thread_1() {
for (;;) sr1 = NULL;
}
int thread_2() {
for (;;) sr2 = sr1;
}
int thread_3() {
for (;;) sr1 = new SomethingRefcounted;
}
int main() {
create the threads and let them run
}
The problem here is thread_2: The moment it reads the pointer sr1 into a CPU register, it violates the assumption, the refcounter correctly counts the number of references. The refcounter is 1 after the new assignment to sr1 but even ignoring thread_1 for a moment, once sr1 is read into a CPU register by thread_2, there are 2 reference to the object, first the variable sr1 and the CPU register of thread_2, but the refcounter is still only 1, violating our constraint from above. The following increment will fix it, but this fix will come too late, if thread_1 decrements it to 0 before.
There are solutions involving locks (global locks, maybe hashed for many objects sharing one lock out of a pool of locks, so not every object needs it's own lock but also there doesn't need to be a single application wide lock causing lock contention). One solution I came up with however is to xchg the pointer on read, so the constraint, refcounter >= number of references is enforced. thread_2 in assembly could then look like this:
LOAD 0xdeadbeef -> reg1
L1:
XCHG [sr1] <-> reg1
CMP reg1 <-> 0xdeadbeef
JUMPIFEQUAL L1
INC [reg1]
STORE reg1 -> [sr1]
STORE reg1 -> [sr2] ; Actually, here, we have to do the
; refcount decrement on sr2 first, but you get the point
So, this is a simple spinlock, waiting for other concurrently running accesses like this to complete. Once we successfully XCHGed the pointer, no one else can get it, so we can be sure, the ref counter is at least 1 (it can't be 0, because we just found a reference to it) and it can't be decremented down to zero (even if there are more references, the one we have now, we have it exclusively in your CPU register, and that one contributes 1 to the refcounter, preventing it from reaching zero).
Similarily, thread_1 would look like this:
LOAD 0xdeadbeef -> reg1
L1:
XCHG [sr1] <-> reg1
CMP reg1 <-> 0xdeadbeef
JUMPIFEQUAL L1
LOAD 0 -> reg2
STORE reg2 -> [sr1]
DEC [reg1]
JUMPIFZERO free_it
Now, I am wondering if there are any more efficient solution to that problem. (Or even whether I miss something here).

Related

Concurrent array access with std::atomic

I know how mutex synchronization works, but I have problems deciding how synchronization need to be done in following over simplified case:
We have an array with 10 elements.
Thread 1 access the array in read only way - read elements. e.g. something like this:
// const int *my_array;
int something = my_array[5];
Thread 2 doing unrelated stuff, but from time to time it might decide to update all 10 elements at once. e.g. something like:
// const int *my_array;
const int *my_temp_array = load_new_data();
// suppose pointer memory are correct.
// because it is pointers, the following operation is instant
my_array = my_temp_array;
Both threads need to use primitive such std::unique_lock.
But is there a way this to be done with std::atomic?
Note:
as Igor mentioned:
if thread 1 loops over the array, and thread 2 flips it in the middle
of the loop - is it OK to process half old elements and half new ones?
Who allocates memory for the array, and when and by whom should it be
deallocated?
The example is oversimplified and I am interested only in general thread synchronization. This is why number of elements are fixed to 10. Also let suppose there is no memory allocation and it is OK to process half old elements and half new ones.
if old memory not need to free, std::atomic<int*> works like this
//global atomic ptr
std::atomic<int*> myarray = nullptr;
// Thread 1
const int * current_array = myarray;
// do something current_array[10]
int something = current_array[5];
// Thread 2
int * tmp_array = new int[10];
myarray = tmp_array;

Confusing result from counting page fault in linux

I was writing programs to count the time of page faults in a linux system. More precisely, the time kernel execute the function __do_page_fault.
And somehow I wrote two global variables, named pfcount_at_beg and pfcount_at_end, which increase once when the function __do_page_fault is executed at different locations of the function.
To illustrate, the modified function goes as:
unsigned long pfcount_at_beg = 0;
unsigned long pfcount_at_end = 0;
static void __kprobes
__do_page_fault(...)
{
struct vm_area_sruct *vma;
... // VARIABLES DEFINITION
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
pfcount_at_beg++; // I add THIS
...
...
// ORIGINAL CODE OF THE FUNCTION
...
pfcount_at_end++; // I add THIS
}
I expected that the value of pfcount_at_end is smaller than the value of pfcount_at_beg.
Because, I think, every time kernel executes the instructions of code pfcount_at_end++, it must have executed pfcount_at_beg++(Every function starts at the very beginning of the code).
On the other hand, as there are many conditional return between these two lines of code.
However, the result turns out oppositely. The value of pfcount_at_end is larger than the value of pfcount_at_beg.
I use printk to print these kernel variables through a self-defined syscall. And I wrote the user level program to call the system call.
Here is my simple syscall and user-level program:
// syscall
asmlinkage int sys_mysyscall(void)
{
printk( KERN_INFO "total pf_at_beg%lu\ntotal pf_at_end%lu\n", pfcount_at_beg, pfcount_at_end)
return 0;
}
// user-level program
#include<linux/unistd.h>
#include<sys/syscall.h>
#define __NR_mysyscall 223
int main()
{
syscall(__NR_mysyscall);
return 0;
}
Is there anybody who knows what exactly happened during this?
Just now I modified the code, to make pfcount_at_beg and pfcount_at_end static. However the result did not change, i.e. the value of pfcount_at_end is larger than the value of pfcount_at_beg.
So possibly it might be caused by in-atomic operation of increment. Would it be better if I use read-write lock?
The ++ operator is not garanteed to be atomic, so your counters may suffer concurrent access and have incorrect values. You should protect your increment as a critical section, or use the atomic_t type defined in <asm/atomic.h>, and its related atomic_set() and atomic_add() functions (and a lot more).
Not directly connected to your issue, but using a specific syscall is overkill (but maybe it is an exercise). A lighter solution could be to use a /proc entry (also an interesting exercise).

Atomicity of a read on SPARC

I'm writing a multithreaded application and having a problem on the SPARC platform. Ultimately my question comes down to atomicity of this platform and how I could be obtaining this result.
Some pseudocode to help clarify my question:
// Global variable
typdef struct pkd_struct{
uint16_t a;
uint16_t b;
} __attribute__(packed) pkd_struct_t;
pkd_struct_t shared;
Thread 1:
swap_value() {
pkd_struct_t prev = shared;
printf("%d%d\n", prev.a, prev.b);
...
}
Thread 2:
use_value() {
pkd_struct_t next;
next.a = 0; next.b = 0;
shared = next;
printf("%d%d\n", shared.a, shared.b);
...
}
Thread 1 and 2 are accessing the shared variable "shared". One is setting, the other is getting. If Thread 2 is setting "shared" to zero, I'd expect Thread 1 to read count either before OR after the setting -- since "shared" is aligned on a 4-byte boundary. However, I will occasionally see Thread 1 reading the value of the form 0xFFFFFF00. That is the high-order 24 bits are OLD, but the low-order byte is NEW. It appears I've gotten an intermediate value.
Looking at the disassembly, the use_value function simply does an "ST" instruction. Given that the data is aligned and isn't crossing a word boundary, is there any explanation for this behavior? If ST is indeed NOT atomic to use this way, does this explain the result I see (only 1 byte changed?!?)? There is no problem on x86.
UPDATE 1:
I've found the problem, but not the cause. GCC appears to be generating assembly that reads the shared variably byte-by-byte (thus allowing a partial update to be obtained). Comments added, but I am not terribly comfortable with SPARC assembly. %i0 is a pointer to the shared variable.
xxx+0xc: ldub [%i0], %g1 // ld unsigned byte g1 = [i0] -- 0 padded
xxx+0x10: ...
xxx+0x14: ldub [%i0 + 0x1], %g5 // ld unsigned byte g5 = [i0+1] -- 0 padded
xxx+0x18: sllx %g1, 0x18, %g1 // g1 = [i0+0] left shifted by 24
xxx+0x1c: ldub [%i0 + 0x2], %g4 // ld unsigned byte g4 = [i0+2] -- 0 padded
xxx+0x20: sllx %g5, 0x10, %g5 // g5 = [i0+1] left shifted by 16
xxx+0x24: or %g5, %g1, %g5 // g5 = g5 OR g1
xxx+0x28: sllx %g4, 0x8, %g4 // g4 = [i0+2] left shifted by 8
xxx+0x2c: or %g4, %g5, %g4 // g4 = g4 OR g5
xxx+0x30: ldub [%i0 + 0x3], %g1 // ld unsigned byte g1 = [i0+3] -- 0 padded
xxx+0x34: or %g1, %g4, %g1 // g1 = g4 OR g1
xxx+0x38: ...
xxx+0x3c: st %g1, [%fp + 0x7df] // store g1 on the stack
Any idea why GCC is generating code like this?
UPDATE 2: Adding more info to the example code. Appologies -- I'm working with a mix of new and legacy code and it's difficult to separate what's relevant. Also, I understand sharing a variable like this is highly-discouraged in general. However, this is actually in a lock implementation where higher-level code will be using this to provide atomicity and using pthreads or platform-specific locking is not an option for this.
Because you've declared the type as packed, it gets one byte alignment, which means it must be read and written one byte at a time, as SPARC does not allow unaligned loads/stores. You need to give it 4-byte alignment if you want the compiler to use word load/store instructions:
typdef struct pkd_struct {
uint16_t a;
uint16_t b;
} __attribute__((packed, aligned(4))) pkd_struct_t;
Note that packed is essentially meaningless for this struct, so you could leave that out.
Answering my own question here -- this has bugged me for too long and hopefully I can save someone a bit of frustration at some point.
The problem is that although the shared data is aligned, because it is packed GCC reads it byte-by-byte.
There is some discussion here on how packing leading to load/store bloat on SPARC (and other RISC platforms I'd assume...), but in my case it has lead to a race.

Is there a way in c++11 to prevent "normal" operations from sliping before or after atomic operation

I'm interested in doing something like(single thread update, multiple threads read banneedURLs):
atomic<bannedURLList*> bannedURLs;//global variable pointing to the currently used instance of struct
void updateList()
{
//no need for mutex because only 1 thread updates
bannedURLList* newList= new bannedURLList();
bannedURLList* oldList=bannedURLs;
newList->initialize();
bannedURLs=newList;// line must be after previous line, because list must be initialized before it is ready to be used
//while refcnt on the oldList >0 wait, then delete oldList;
}
reader threads do something like this:
{
bannedURLs->refCnt++;
//use bannedURLs
bannedURLs->refCnt--;
}
struct memeber refCnt is also atomic integer
My question is how to prevent reordering of this 2 lines:
newList->initialize();
bannedURLs=newList;
Can it be done in std:: way?
Use bannedURLs.store(newList); instead of bannedURLs=newList;. Since you didn't pass a weak ordering specifier, this forces full ordering in the store.

Is it possible to store pointers in shared memory without using offsets?

When using shared memory, each process may mmap the shared region into a different area of its respective address space. This means that when storing pointers within the shared region, you need to store them as offsets of the start of the shared region. Unfortunately, this complicates use of atomic instructions (e.g. if you're trying to write a lock free algorithm). For example, say you have a bunch of reference counted nodes in shared memory, created by a single writer. The writer periodically atomically updates a pointer 'p' to point to a valid node with positive reference count. Readers want to atomically write to 'p' because it points to the beginning of a node (a struct) whose first element is a reference count. Since p always points to a valid node, incrementing the ref count is safe, and makes it safe to dereference 'p' and access other members. However, this all only works when everything is in the same address space. If the nodes and the 'p' pointer are stored in shared memory, then clients suffer a race condition:
x = read p
y = x + offset
Increment refcount at y
During step 2, p may change and x may no longer point to a valid node. The only workaround I can think of is somehow forcing all processes to agree on where to map the shared memory, so that real pointers rather than offsets can be stored in the mmap'd region. Is there any way to do that? I see MAP_FIXED in the mmap documentation, but I don't know how I could pick an address that would be safe.
Edit: Using inline assembly and the 'lock' prefix on x86 maybe it's possible to build a "increment ptr X with offset Y by value Z"? Equivalent options on other architectures? Haven't written a lot of assembly, don't know if the needed instructions exist.
On low level the x86 atomic inctruction can do all this tree steps at once:
x = read p
y = x + offset Increment
refcount at y
//
mov edi, Destination
mov edx, DataOffset
mov ecx, NewData
#Repeat:
mov eax, [edi + edx] //load OldData
//Here you can also increment eax and save to [edi + edx]
lock cmpxchg dword ptr [edi + edx], ecx
jnz #Repeat
//
This is trivial on a UNIX system; just use the shared memory functions:
shgmet, shmat, shmctl, shmdt
void *shmat(int shmid, const void *shmaddr, int shmflg);
shmat() attaches the shared memory
segment identified by shmid to the
address space of the calling process.
The attaching address is specified by
shmaddr with one of the following
criteria:
If shmaddr is NULL, the system chooses
a suitable (unused) address at which
to attach the segment.
Just specify your own address here; e.g. 0x20000000000
If you shmget() using the same key and size in every process, you will get the same shared memory segment. If you shmat() at the same address, the virtual addresses will be the same in all processes. The kernel doesn't care what address range you use, as long as it doesn't conflict with wherever it normally assigns things. (If you leave out the address, you can see the general region that it likes to put things; also, check addresses on the stack and returned from malloc() / new[] .)
On Linux, make sure root sets SHMMAX in /proc/sys/kernel/shmmax to a large enough number to accommodate your shared memory segments (default is 32MB).
As for atomic operations, you can get them all from the Linux kernel source, e.g.
include/asm-x86/atomic_64.h
/*
* Make sure gcc doesn't try to be clever and move things around
* on us. We need to use _exactly_ the address the user gave us,
* not some alias that contains the same information.
*/
typedef struct {
int counter;
} atomic_t;
/**
* atomic_read - read atomic variable
* #v: pointer of type atomic_t
*
* Atomically reads the value of #v.
*/
#define atomic_read(v) ((v)->counter)
/**
* atomic_set - set atomic variable
* #v: pointer of type atomic_t
* #i: required value
*
* Atomically sets the value of #v to #i.
*/
#define atomic_set(v, i) (((v)->counter) = (i))
/**
* atomic_add - add integer to atomic variable
* #i: integer value to add
* #v: pointer of type atomic_t
*
* Atomically adds #i to #v.
*/
static inline void atomic_add(int i, atomic_t *v)
{
asm volatile(LOCK_PREFIX "addl %1,%0"
: "=m" (v->counter)
: "ir" (i), "m" (v->counter));
}
64-bit version:
typedef struct {
long counter;
} atomic64_t;
/**
* atomic64_add - add integer to atomic64 variable
* #i: integer value to add
* #v: pointer to type atomic64_t
*
* Atomically adds #i to #v.
*/
static inline void atomic64_add(long i, atomic64_t *v)
{
asm volatile(LOCK_PREFIX "addq %1,%0"
: "=m" (v->counter)
: "er" (i), "m" (v->counter));
}
We have code that's similar to your problem description. We use a memory-mapped file, offsets, and file locking. We haven't found an alternative.
You shouldn't be afraid to make up an address at random, because the kernel will just reject addresses it doesn't like (ones that conflict). See my shmat() answer above, using 0x20000000000
With mmap:
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
If addr is not NULL, then the kernel
takes it as a hint about where to
place the mapping; on Linux, the
mapping will be created at the next
higher page boundary. The address of
the new mapping is returned as the
result of the call.
The flags argument determines whether
updates to the mapping are visible to
other processes mapping the same
region, and whether updates are
carried through to the underlying
file. This behavior is determined by
including exactly one of the following
values in flags:
MAP_SHARED Share this mapping.
Updates to the mapping are visible to
other processes that map this
file, and are carried through to the
underlying file. The file may not
actually be updated until msync(2) or
munmap() is called.
ERRORS
EINVAL We don’t like addr, length, or
offset (e.g., they are too large, or
not aligned on a page boundary).
Adding the offset to the pointer does not create the potential for a race, it already exists. Since at least neither ARM nor x86 can atomically read a pointer then access the memory it refers to you need to protect the pointer access with a lock regardless of whether you add an offset.

Resources