Is there something wrong with my spin lock? - linux

Here is my implementation of a spin lock, but it seems it can not protect the critical code. Is there something wrong with my implementation?
static __inline__ int xchg_asm(int* lock, int val)
{
int ret;
__asm__ __volatile__(
LOCK "movl (%1),%%eax;
xchg (%1),%2;
movl %%eax, %0" :"=m" (ret) :"d"(lock), "c"(val)
);
return ret;
}
void spin_init(spinlock_t* sl)
{
sl->val = 0;
}
void spin_lock(spinlock_t* sl)
{
int ret;
do {
ret = xchg_asm(&(sl->val), 1);
} while ( ret==0 );
}
void spin_unlock(spinlock_t* sl)
{
xchg_asm(&(sl->val), 0);
}

Your code equals to:
static __inline__ int xchg_asm(int* lock, int val) {
int save_old_value_at_eax;
save_old_value_at_eax = *lock; /* with a wrong lock prefix */
xchg *lock with val and discard the original value of *lock.
return save_old_value_at_eax; /* but it not the real original value of *lock */
}
You can see from the code, save_old_value_at_eax is no the real original value while the cpu perform xchg. You should get the old/original value by the xchg instruction, not by saving it before perform xchg. ("it is not the real old/original value" means, if another CPU takes the lock after this CPU saves the value but before this CPU performs the xchg instruction, this CPU will get the wrong old value, and it think it took the lock successful, thus, two CPUs enter the C.S. at the same time). You have separated a read-modify-write instruction to three instructions, the whole three instructions are not atomically(even you move the lock prefix to xchg).
I guess you thought the lock prefix will lock the WHOLE three instructions, but actually lock prefix can only be used for the only instruction which it is attached(not all instructions can be attached)
And we don't need lock prefix on SMP for xchg. Quote from linux_kernel_src/arch/x86//include/asm/cmpxchg.h
/*
* Note: no "lock" prefix even on SMP: xchg always implies lock anyway.
* Since this is generally used to protect other memory information, we
* use "asm volatile" and "memory" clobbers to prevent gcc from moving
* information around.
*/
My suggestions:
DON'T REPEAT YOURSELF, please use the spin lock of the linux kernel.
DON'T REPEAT YOURSELF, please use the xchg(), cmpxchg() of the linux kernel if you do want to implement a spin lock.
learn more about instructions. you can also find out how the linux kernel implement it.

Related

Do I need a QMutex for variable that is accessed by single statement?

In this document, a QMutex is used to protect "number" from being modified by multiple threads at same time.
I have a code in which a thread is instructed to do different work according to a flag set by another thread.
//In thread1
if(flag)
dowork1;
else
dowork2;
//In thread2
void setflag(bool f)
{
flag=f;
}
I want to know if a QMutex is needed to protect flag, i.e.,
//In thread1
mutex.lock();
if(flag)
{
mutex.unlock();
dowork1;
}
else
{
mutex.unlock();
dowork2;
}
//In thread2
void setflag(bool f)
{
mutex.lock();
flag=f;
mutex.unlock();
}
The code is different from the document in that flag is accessed(read/written) by single statement in both threads, and only one thread modifies the value of flag.
PS:
I always see the example in multi-thread programming tutorials that one thread does "count++", the other thread does "count--", and the tutorials say you should use a Mutex to protect the variable "count". I cannot get the point of using a mutex. Does it mean the execution of single statement "count++" or "count--" can be interrupted in the middle and produce unexpected result? What unexpected results can be gotten?
Does it mean the execution of single statement "count++" or "count--"
can be interrupted in the middle and produce unexpected result? What
unexpected results can be gotten?
Just answering to this part: Yes, the execution can be interrupted in the middle of a statement.
Let's imagine a simple case:
class A {
void foo(){
++a;
}
int a = 0;
};
The single statement ++a is translated in assembly to
mov eax, DWORD PTR [rdi]
add eax, 1
mov DWORD PTR [rdi], eax
which can be seen as
eax = a;
eax += 1;
a = eax;
If foo() is called on the same instance of A in 2 different threads (be it on a single core, or multiple cores) you cannot predict what will be the result of the program.
It can behave nicely:
thread 1 > eax = a // eax in thread 1 is equal to 0
thread 1 > eax += 1 // eax in thread 1 is equal to 1
thread 1 > a = eax // a is set to 1
thread 2 > eax = a // eax in thread 2 is equal to 1
thread 2 > eax += 1 // eax in thread 2 is equal to 2
thread 2 > a = eax // a is set to 2
or not:
thread 1 > eax = a // eax in thread 1 is equal to 0
thread 2 > eax = a // eax in thread 2 is equal to 0
thread 2 > eax += 1 // eax in thread 2 is equal to 1
thread 2 > a = eax // a is set to 1
thread 1 > eax += 1 // eax in thread 1 is equal to 1
thread 1 > a = eax // a is set to 1
In a well defined program, N calls to foo() should result in a == N.
But calling foo() on the same instance of A from multiple threads creates undefined behavior. There is no way to know the value of a after N calls to foo().
It will depend on how you compiled your program, what optimization flags were used, which compiler was used, what was the load of your CPU, the number of core of your CPU,...
NB
class A {
public:
bool check() const { return a == b; }
int get_a() const { return a; }
int get_b() const { return b; }
void foo(){
++a;
++b;
}
private:
int a = 0;
int b = 0;
};
Now we have a class that, for an external observer, keeps a and b equal at all time.
The optimizer could optimize this class into:
class A {
public:
bool check() const { return true; }
int get_a() const { return a; }
int get_b() const { return b; }
void foo(){
++a;
++b;
}
private:
int a = 0;
int b = 0;
};
because it does not change the observable behavior of the program.
However if you invoke undefined behavior by calling foo() on the same instance of A from multiple threads, you could end up if a = 3, b = 2 and check() still returning true. Your code has lost its meaning, the program is not doing what it is supposed to and can be doing about anything.
From here you can imagine more complex cases, like if A manages network connections, you can end up sending the data for client #10 to client #6. If your program is running in a factory, you can end up activating the wrong tool.
If you want the definition of undefined behavior you can look here : https://en.cppreference.com/w/cpp/language/ub
and in the C++ standard
For a better understanding of UB you can look for CppCon talks on the topic.
For any standard object (including bool) that is accessed from multiple threads, where at least one of the threads may modify the object's state, you need to protect access to that object using a mutex, otherwise you will invoke undefined behavior.
As a practical matter, for a bool that undefined behavior probably won't come in the form of a crash, but more likely in the form of thread B sometimes not "seeing" changes made to the bool by thread A, due to caching and/or optimization issues (e.g. the optimizer "knows" that the bool can't change during a function call, so it doesn't bother checking it more than once)
If you don't want to guard your accesses with a mutex, the other option is to change flag from a bool to a std::atomic<bool>; the std::atomic<bool> type has exactly the semantics you are looking for, i.e. it can be read and/or written from any thread without invoking undefined behavior.
Look here for an explanation: Do I have to use atomic<bool> for "exit" bool variable?
To synchronize access to flag you can make it a std::atomic<bool>.
Or you can use a QReadWriteLock together with a QReadLocker and a QWriteLocker. Compared to using a QMutex this gives you the advantage that you do not need to care about the call to QMutex::unlock() if you use exceptions or early return statements.
Alternatively you can use a QMutexLocker if the QReadWriteLock does not match your use case.
QReadWriteLock lock;
...
//In thread1
{
QReadLocker readLocker(&lock);
if(flag)
dowork1;
else
dowork2;
}
...
//In thread2
void setflag(bool f)
{
QWriteLocker writeLocker(&lock);
flag=f;
}
Keeping your program expressing its intent (ie. accessing shared vars under locks) is a big win for program maintenance and clarity. You need to have some pretty good reasons to abandon that clarity for obscure approaches like the atomics and devising consistent race conditions.
Good reasons include you have measured your program spending too much time toggling the mutex. In any decent implementation, the difference between a non-contested mutex and an atomic is minute -- the mutex lock and unlock typical employ an optimistic compare-and-swap, returning quickly. If your vendor doesn't provide a decent implementation, you might bring that up with them.
In your example, dowork1 and dowork2 are invoked with the mutex locked; so the mutex isn't just protecting flag, but also serializing these functions. If that is just an artifact of how you posed the question, then race conditions (variants of atomics travesty) are less scary.
In your PS (dup of comment above):
Yes, count++ is best thought of as:
mov $_count, %r1
ld (%r1), %r0
add $1, %r0, %r2
st %r2,(%r1)
Even machines with natural atomic inc (x86,68k,370,dinosaurs) instructions might not be used consistently by the compiler.
So, if two threads do count--; and count++; at close to the same time, the result could be -1, 0, 1. (ignoring the language weenies that say your house might burn down).
barriers:
if CPU0 executes:
store $1 to b
store $2 to c
and CPU1 executes:
load barrier -- discard speculatively read values.
load b to r0
load c to r1
Then CPU1 could read r0,r1 as: (0,0), (1,0), (1,2), (0,2).
This is because the observable order of the memory writes is weak; the processor may make them visible in an arbitrary fashion.
So, we change CPU0 to execute:
store $1 to b
store barrier -- stop storing until all previous stores are visible
store $2 to c
Then, if CPU1 saw that r1 (c) was 2, then r0 (b) has to be 1. The store barrier enforces that.
For me, its seems to be more handy to use a mutex here.
In general not using mutex when sharing references could lead to
problems.
The only downside of using mutex here seems to be, that you will slightly decrease the performance, because your threads have to wait for each other.
What kind of errors could happen ?
Like somebody in the comments said its a different situation if
your share fundamental datatype e.g. int, bool, float
or a object references. I added some qt code
example, which emphases 2 possible problems during NOT using mutex. The problem #3 is a fundamental one and pretty well described in details by Benjamin T and his nice answer.
Blockquote
main.cpp
#include <QCoreApplication>
#include <QThread>
#include <QtDebug>
#include <QTimer>
#include "countingthread.h"
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
int amountThread = 3;
int counter = 0;
QString *s = new QString("foo");
QMutex *mutex = new QMutex();
//we construct a lot of thread
QList<CountingThread*> threadList;
//we create all threads
for(int i=0;i<amountThread;i++)
{
CountingThread *t = new CountingThread();
#ifdef TEST_ATOMIC_VAR_SHARE
t->addCounterdRef(&counter);
#endif
#ifdef TEST_OBJECT_VAR_SHARE
t->addStringRef(s);
//we add a mutex, which is shared to read read write
//just used with TEST_OBJECT_SHARE_FIX define uncommented
t->addMutexRef(mutex);
#endif
//t->moveToThread(t);
threadList.append(t);
}
//we start all with low prio, otherwise we produce something like a fork bomb
for(int i=0;i<amountThread;i++)
threadList.at(i)->start(QThread::Priority::LowPriority);
return a.exec();
}
countingthread.h
#ifndef COUNTINGTHREAD_H
#define COUNTINGTHREAD_H
#include <QThread>
#include <QtDebug>
#include <QTimer>
#include <QMutex>
//atomic var is shared
//#define TEST_ATOMIC_VAR_SHARE
//more complex object var is shared
#define TEST_OBJECT_VAR_SHARE
// we add the fix
#define TEST_OBJECT_SHARE_FIX
class CountingThread : public QThread
{
Q_OBJECT
int *m_counter;
QString *m_string;
QMutex *m_locker;
public :
void addCounterdRef(int *r);
void addStringRef(QString *s);
void addMutexRef(QMutex *m);
void run() override;
};
#endif // COUNTINGTHREAD_H
countingthread.cpp
#include "countingthread.h"
void CountingThread::run()
{
//forever
while(1)
{
#ifdef TEST_ATOMIC_VAR_SHARE
//first use of counter
int counterUse1Copy= (*m_counter);
//some other operations, here sleep 10 ms
this->msleep(10);
//we will retry to use a second time
int counterUse2Copy= (*m_counter);
if(counterUse1Copy != counterUse2Copy)
qDebug()<<this->thread()->currentThreadId()<<" problem #1 found, counter not like we expect";
//we increment afterwards our counter
(*m_counter) +=1; //this works for fundamental types, like float, int, ...
#endif
#ifdef TEST_OBJECT_VAR_SHARE
#ifdef TEST_OBJECT_SHARE_FIX
m_locker->lock();
#endif
m_string->replace("#","-");
//this will crash here !!, with problem #2,
//segmentation fault, is not handle by try catch
m_string->append("foomaster");
m_string->append("#");
if(m_string->length()>10000)
qDebug()<<this->thread()->currentThreadId()<<" string is: " << m_string;
#ifdef TEST_OBJECT_SHARE_FIX
m_locker->unlock();
#endif
#endif
}//end forever
}
void CountingThread::addCounterdRef(int *r)
{
m_counter = r;
qDebug()<<this->thread()->currentThreadId()<<" add counter with value: " << *m_counter << " and address : "<< m_counter ;
}
void CountingThread::addStringRef(QString *s)
{
m_string = s;
qDebug()<<this->thread()->currentThreadId()<<" add string with value: " << *m_string << " and address : "<< m_string ;
}
void CountingThread::addMutexRef(QMutex *m)
{
m_locker = m;
}
If you follow up the code you are able to perform 2 tests.
If you uncomment TEST_ATOMIC_VAR_SHARE and comment TEST_OBJECT_VAR_SHARE in countingthread.h
your will see
problem #1 if you use your variable multiple times in your thread, it could be changes in the background from another thread, besides my expectation there was no app crash or weird exception in my build environment during execution using an int counter.
If you uncomment TEST_OBJECT_VAR_SHARE and comment TEST_OBJECT_SHARE_FIX and comment TEST_ATOMIC_VAR_SHARE in countingthread.h
your will see
problem #2 you get a segmentation fault, which is not possible to handle via try catch. This appears because multiple threads are using string functions for editing on the same object.
If you uncomment TEST_OBJECT_SHARE_FIX too you see the right handling via mutex.
problem #3 see answer from Benjamin T
What is Mutex:
I really like the chicken explanation which vallabh suggested.
I also found an good explanation here

Why my implementation of sbrk system call does not work?

I try to write a very simple os to better understand the basic principles. And I need to implement user-space malloc. So at first I want to implement and test it on my linux-machine.
At first I have implemented the sbrk() function by the following way
void* sbrk( int increment ) {
return ( void* )syscall(__NR_brk, increment );
}
But this code does not work. Instead, when I use sbrk given by os, this works fine.
I have tryed to use another implementation of the sbrk()
static void *sbrk(signed increment)
{
size_t newbrk;
static size_t oldbrk = 0;
static size_t curbrk = 0;
if (oldbrk == 0)
curbrk = oldbrk = brk(0);
if (increment == 0)
return (void *) curbrk;
newbrk = curbrk + increment;
if (brk(newbrk) == curbrk)
return (void *) -1;
oldbrk = curbrk;
curbrk = newbrk;
return (void *) oldbrk;
}
sbrk invoked from this function
static Header *morecore(unsigned nu)
{
char *cp;
Header *up;
if (nu < NALLOC)
nu = NALLOC;
cp = sbrk(nu * sizeof(Header));
if (cp == (char *) -1)
return NULL;
up = (Header *) cp;
up->s.size = nu; // ***Segmentation fault
free((void *)(up + 1));
return freep;
}
This code also does not work, on the line (***) I get segmentation fault.
Where is a problem ?
Thanks All. I have solved my problem using new implementation of the sbrk. The given code works fine.
void* __sbrk__(intptr_t increment)
{
void *new, *old = (void *)syscall(__NR_brk, 0);
new = (void *)syscall(__NR_brk, ((uintptr_t)old) + increment);
return (((uintptr_t)new) == (((uintptr_t)old) + increment)) ? old :
(void *)-1;
}
The first sbrk should probably have a long increment. And you forgot to handle errors (and set errno)
The second sbrk function does not change the address space (as sbrk does). You could use mmap to change it (but using mmap instead of sbrk won't update the kernel's view of data segment end as sbrk does). You could use cat /proc/1234/maps to query the address space of process of pid 1234). or even read (e.g. with fopen&fgets) the /proc/self/maps from inside your program.
BTW, sbrk is obsolete (most malloc implementations use mmap), and by definition every system call (listed in syscalls(2)) is executed by the kernel (for sbrk the kernel maintains the "data segment" limit!). So you cannot avoid the kernel, and I don't even understand why you want to emulate any system call. Almost by definition, you cannot emulate syscalls since they are the only way to interact with the kernel from a user application. From the user application, every syscall is an atomic elementary operation (done by a single SYSENTER machine instruction with appropriate contents in machine registers).
You could use strace(1) to understand the actual syscalls done by your running program.
BTW, the GNU libc is a free software. You could look into its source code. musl-libc is a simpler libc and its code is more readable.
At last compile with gcc -Wall -Wextra -g and use the gdb debugger (you can even query the registers, if you wanted to). Perhaps read the x86/64-ABI specification and the Linux Assembly HowTo.

How does linux know when to allocate more pages to a call stack?

Given the program below, segfault() will (As the name suggests) segfault the program by accessing 256k below the stack. nofault() however, gradually pushes below the stack all the way to 1m below, but never segfaults.
Additionally, running segfault() after nofault() doesn't result in an error either.
If I put sleep()s in nofault() and use the time to cat /proc/$pid/maps I see the allocated stack space grows between the first and second call, this explains why segfault() doesn't crash afterwards - there's plenty of memory.
But the disassembly shows there's no change to %rsp. This makes sense since that would screw up the call stack.
I presumed that the maximum stack size would be baked into the binary at compile time (In retrospect that would be very hard for a compiler to do) or that it would just periodically check %rsp and add a buffer after that.
How does the kernel know when to increase the stack memory?
#include <stdio.h>
#include <unistd.h>
void segfault(){
char * x;
int a;
for( x = (char *)&x-1024*256; x<(char *)(&x+1); x++){
a = *x & 0xFF;
printf("%p = 0x%02x\n",x,a);
}
}
void nofault(){
char * x;
int a;
sleep(20);
for( x = (char *)(&x); x>(char *)&x-1024*1024; x--){
a = *x & 0xFF;
printf("%p = 0x%02x\n",x,a);
}
sleep(20);
}
int main(){
nofault();
segfault();
}
The processor raises a page fault when you access an unmapped page. The kernel's page fault handler checks whether the address is reasonably close to the process's %rsp and if so, it allocates some memory and resumes the process. If you are too far below %rsp, the kernel passes the fault along to the process as a signal.
I tried to find the precise definition of what addresses are close enough to %rsp to trigger stack growth, and came up with this from linux/arch/x86/mm.c:
/*
* Accessing the stack below %sp is always a bug.
* The large cushion allows instructions like enter
* and pusha to work. ("enter $65535, $31" pushes
* 32 pointers and then decrements %sp by 65535.)
*/
if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
bad_area(regs, error_code, address);
return;
}
But experimenting with your program I found that 65536+32*sizeof(unsigned long) isn't the actual cutoff point between segfault and no segfault. It seems to be about twice that value. So I'll just stick with the vague "reasonably close" as my official answer.

Why accessing pthread keys' sequence number is not synchronized in glibc's NPTL implementation?

Recently when I look into how the thread-local storage is implemented in glibc, I found the following code, which implements the API pthread_key_create()
int
__pthread_key_create (key, destr)
pthread_key_t *key;
void (*destr) (void *);
{
/* Find a slot in __pthread_kyes which is unused. */
for (size_t cnt = 0; cnt < PTHREAD_KEYS_MAX; ++cnt)
{
uintptr_t seq = __pthread_keys[cnt].seq;
if (KEY_UNUSED (seq) && KEY_USABLE (seq)
/* We found an unused slot. Try to allocate it. */
&& ! atomic_compare_and_exchange_bool_acq (&__pthread_keys[cnt].seq,
seq + 1, seq))
{
/* Remember the destructor. */
__pthread_keys[cnt].destr = destr;
/* Return the key to the caller. */
*key = cnt;
/* The call succeeded. */
return 0;
}
}
return EAGAIN;
}
__pthread_keys is a global array accessed by all threads. I don't understand why the read of its member seq is not synchronized as in the following:
uintptr_t seq = __pthread_keys[cnt].seq;
although it is syncrhonized when modified later.
FYI, __pthread_keys is an array of type struct pthread_key_struct, which is defined as follows:
/* Thread-local data handling. */
struct pthread_key_struct
{
/* Sequence numbers. Even numbers indicated vacant entries. Note
that zero is even. We use uintptr_t to not require padding on
32- and 64-bit machines. On 64-bit machines it helps to avoid
wrapping, too. */
uintptr_t seq;
/* Destructor for the data. */
void (*destr) (void *);
};
Thanks in advance.
In this case, the loop can avoid an expensive lock acquisition. The atomic compare and swap operation done later (atomic_compare_and_exchange_bool_acq) will make sure only one thread can successfully increment the sequence value and return the key to the caller. Other threads reading the same value in the first step will keep looping since the CAS can only succeed for a single thread.
This works because the sequence value alternates between even (empty) and odd (occupied). Incrementing the value to odd prevents other threads from acquiring the slot.
Just reading the value is fewer cycles than the CAS instruction typically, so it makes sense to peek at the value, before doing the CAS.
There are many wait-free and lock-free algorithms that take advantage of the CAS instruction to achieve low-overhead synchronization.

Is it possible to store pointers in shared memory without using offsets?

When using shared memory, each process may mmap the shared region into a different area of its respective address space. This means that when storing pointers within the shared region, you need to store them as offsets of the start of the shared region. Unfortunately, this complicates use of atomic instructions (e.g. if you're trying to write a lock free algorithm). For example, say you have a bunch of reference counted nodes in shared memory, created by a single writer. The writer periodically atomically updates a pointer 'p' to point to a valid node with positive reference count. Readers want to atomically write to 'p' because it points to the beginning of a node (a struct) whose first element is a reference count. Since p always points to a valid node, incrementing the ref count is safe, and makes it safe to dereference 'p' and access other members. However, this all only works when everything is in the same address space. If the nodes and the 'p' pointer are stored in shared memory, then clients suffer a race condition:
x = read p
y = x + offset
Increment refcount at y
During step 2, p may change and x may no longer point to a valid node. The only workaround I can think of is somehow forcing all processes to agree on where to map the shared memory, so that real pointers rather than offsets can be stored in the mmap'd region. Is there any way to do that? I see MAP_FIXED in the mmap documentation, but I don't know how I could pick an address that would be safe.
Edit: Using inline assembly and the 'lock' prefix on x86 maybe it's possible to build a "increment ptr X with offset Y by value Z"? Equivalent options on other architectures? Haven't written a lot of assembly, don't know if the needed instructions exist.
On low level the x86 atomic inctruction can do all this tree steps at once:
x = read p
y = x + offset Increment
refcount at y
//
mov edi, Destination
mov edx, DataOffset
mov ecx, NewData
#Repeat:
mov eax, [edi + edx] //load OldData
//Here you can also increment eax and save to [edi + edx]
lock cmpxchg dword ptr [edi + edx], ecx
jnz #Repeat
//
This is trivial on a UNIX system; just use the shared memory functions:
shgmet, shmat, shmctl, shmdt
void *shmat(int shmid, const void *shmaddr, int shmflg);
shmat() attaches the shared memory
segment identified by shmid to the
address space of the calling process.
The attaching address is specified by
shmaddr with one of the following
criteria:
If shmaddr is NULL, the system chooses
a suitable (unused) address at which
to attach the segment.
Just specify your own address here; e.g. 0x20000000000
If you shmget() using the same key and size in every process, you will get the same shared memory segment. If you shmat() at the same address, the virtual addresses will be the same in all processes. The kernel doesn't care what address range you use, as long as it doesn't conflict with wherever it normally assigns things. (If you leave out the address, you can see the general region that it likes to put things; also, check addresses on the stack and returned from malloc() / new[] .)
On Linux, make sure root sets SHMMAX in /proc/sys/kernel/shmmax to a large enough number to accommodate your shared memory segments (default is 32MB).
As for atomic operations, you can get them all from the Linux kernel source, e.g.
include/asm-x86/atomic_64.h
/*
* Make sure gcc doesn't try to be clever and move things around
* on us. We need to use _exactly_ the address the user gave us,
* not some alias that contains the same information.
*/
typedef struct {
int counter;
} atomic_t;
/**
* atomic_read - read atomic variable
* #v: pointer of type atomic_t
*
* Atomically reads the value of #v.
*/
#define atomic_read(v) ((v)->counter)
/**
* atomic_set - set atomic variable
* #v: pointer of type atomic_t
* #i: required value
*
* Atomically sets the value of #v to #i.
*/
#define atomic_set(v, i) (((v)->counter) = (i))
/**
* atomic_add - add integer to atomic variable
* #i: integer value to add
* #v: pointer of type atomic_t
*
* Atomically adds #i to #v.
*/
static inline void atomic_add(int i, atomic_t *v)
{
asm volatile(LOCK_PREFIX "addl %1,%0"
: "=m" (v->counter)
: "ir" (i), "m" (v->counter));
}
64-bit version:
typedef struct {
long counter;
} atomic64_t;
/**
* atomic64_add - add integer to atomic64 variable
* #i: integer value to add
* #v: pointer to type atomic64_t
*
* Atomically adds #i to #v.
*/
static inline void atomic64_add(long i, atomic64_t *v)
{
asm volatile(LOCK_PREFIX "addq %1,%0"
: "=m" (v->counter)
: "er" (i), "m" (v->counter));
}
We have code that's similar to your problem description. We use a memory-mapped file, offsets, and file locking. We haven't found an alternative.
You shouldn't be afraid to make up an address at random, because the kernel will just reject addresses it doesn't like (ones that conflict). See my shmat() answer above, using 0x20000000000
With mmap:
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
If addr is not NULL, then the kernel
takes it as a hint about where to
place the mapping; on Linux, the
mapping will be created at the next
higher page boundary. The address of
the new mapping is returned as the
result of the call.
The flags argument determines whether
updates to the mapping are visible to
other processes mapping the same
region, and whether updates are
carried through to the underlying
file. This behavior is determined by
including exactly one of the following
values in flags:
MAP_SHARED Share this mapping.
Updates to the mapping are visible to
other processes that map this
file, and are carried through to the
underlying file. The file may not
actually be updated until msync(2) or
munmap() is called.
ERRORS
EINVAL We don’t like addr, length, or
offset (e.g., they are too large, or
not aligned on a page boundary).
Adding the offset to the pointer does not create the potential for a race, it already exists. Since at least neither ARM nor x86 can atomically read a pointer then access the memory it refers to you need to protect the pointer access with a lock regardless of whether you add an offset.

Resources