Why accessing pthread keys' sequence number is not synchronized in glibc's NPTL implementation? - multithreading

Recently when I look into how the thread-local storage is implemented in glibc, I found the following code, which implements the API pthread_key_create()
int
__pthread_key_create (key, destr)
pthread_key_t *key;
void (*destr) (void *);
{
/* Find a slot in __pthread_kyes which is unused. */
for (size_t cnt = 0; cnt < PTHREAD_KEYS_MAX; ++cnt)
{
uintptr_t seq = __pthread_keys[cnt].seq;
if (KEY_UNUSED (seq) && KEY_USABLE (seq)
/* We found an unused slot. Try to allocate it. */
&& ! atomic_compare_and_exchange_bool_acq (&__pthread_keys[cnt].seq,
seq + 1, seq))
{
/* Remember the destructor. */
__pthread_keys[cnt].destr = destr;
/* Return the key to the caller. */
*key = cnt;
/* The call succeeded. */
return 0;
}
}
return EAGAIN;
}
__pthread_keys is a global array accessed by all threads. I don't understand why the read of its member seq is not synchronized as in the following:
uintptr_t seq = __pthread_keys[cnt].seq;
although it is syncrhonized when modified later.
FYI, __pthread_keys is an array of type struct pthread_key_struct, which is defined as follows:
/* Thread-local data handling. */
struct pthread_key_struct
{
/* Sequence numbers. Even numbers indicated vacant entries. Note
that zero is even. We use uintptr_t to not require padding on
32- and 64-bit machines. On 64-bit machines it helps to avoid
wrapping, too. */
uintptr_t seq;
/* Destructor for the data. */
void (*destr) (void *);
};
Thanks in advance.

In this case, the loop can avoid an expensive lock acquisition. The atomic compare and swap operation done later (atomic_compare_and_exchange_bool_acq) will make sure only one thread can successfully increment the sequence value and return the key to the caller. Other threads reading the same value in the first step will keep looping since the CAS can only succeed for a single thread.
This works because the sequence value alternates between even (empty) and odd (occupied). Incrementing the value to odd prevents other threads from acquiring the slot.
Just reading the value is fewer cycles than the CAS instruction typically, so it makes sense to peek at the value, before doing the CAS.
There are many wait-free and lock-free algorithms that take advantage of the CAS instruction to achieve low-overhead synchronization.

Related

Why don't multiple threads have to share a lock to call mmap like they do malloc/calloc/sbrk?

I'm working with ptmalloc, and something interesting I came across is when an arena runs out of available chunks (and the top chunk is not large enough) and has to either extend the arena using sbrk() or allocate a non-contiguous region using mmap(). What particularly stood out to me is that in order to allocate more memory using sbrk(), a lock had to be acquired before being able to call it (in addition to the lock previously obtained to be in sole possession of the current arena). However, no lock needs to be acquired before calling mmap(). I have included the specific parts of the sys_alloc() function from the malloc.c file included in the ptmalloc implementation (for reference) below:
Call to extend arena using sbrk():
if (HAVE_MORECORE && tbase == CMFAIL) { /* Try noncontiguous MORECORE */
size_t asize = granularity_align(nb + TOP_FOOT_SIZE + SIZE_T_ONE);
if (asize < HALF_MAX_SIZE_T) {
char* br = CMFAIL;
char* end = CMFAIL;
ACQUIRE_MORECORE_LOCK(); /* LOCK */
br = (char*)(CALL_MORECORE(asize));
end = (char*)(CALL_MORECORE(0));
RELEASE_MORECORE_LOCK(); /* UNLOCK */
if (br != CMFAIL && end != CMFAIL && br < end) {
size_t ssize = end - br;
if (ssize > nb + TOP_FOOT_SIZE) {
tbase = br;
tsize = ssize;
}
}
}
}
Call to extend arena using mmap():
if (HAVE_MMAP && tbase == CMFAIL) { /* Try MMAP */
size_t req = nb + TOP_FOOT_SIZE + SIZE_T_ONE;
size_t rsize = granularity_align(req);
if (rsize > nb) { /* Fail if wraps around zero */
char* mp = (char*)(CALL_MMAP(rsize));
if (mp != CMFAIL) {
tbase = mp;
tsize = rsize;
mmap_flag = IS_MMAPPED_BIT;
}
}
}
Any help understanding why this is able to work even with multiple threads that have the exact same memory pattern (and thus have to extend their arenas at the same time) without having to use locks (i.e., how mmap() is guaranteed to return distinct addresses, even if called simultaneously with a NULL suggested address) would be greatly appreciated.
In the code snippet using sbrk(). It is used to increased the process global heap area. Two calls are issued: the 1st one extends the heap area by rsize bytes and the second gets the resulting address of the new top of the heap (the so-called program's break). The heap area is shared by all the threads of the process. The cuurent top is a global variable for all the threads. Hence, it is protected by a mutex whenever a thread modifies it (shrink/grow operations);
In the code snippet using mmap(), the current thread is allocating a single memory mapped area for itself. The resulting address is only for the calling thread. Hence, no mutual exclusion is necessary from the ptmalloc global data structures point of view as the latter are not modified. A flag IS_MMAPPED_BIT is set in the internal allocated header to indicate to ptmalloc that this is a memory mapped region when it is requested to free it. Concerning mmap() internals, the mutual exclusion is managed inside the kernel.

Where can PTHRED_MUTEX_ADAPTIVE_NP be specified and how does it work?

I found that there's a macro called PTHRED_MUTEX_ADAPTIVE_NP which is somehow given as a value to a mutex so that the mutex does an adaptive spinning, meaning that it spins in the magnitude of an immediate wakeup through the kernel would last. But how do I utilize this configuration-macro to a thread ?
And as I've developed an improved shared readers-writer lock (it needs only one atomic operation at best in contrast to the three operations given in the Wikipedia-solution) with relative writer-priority (further readers are stalled when there's a writer and the readers before are allowed to proceed) which could also make use of adaptive spinning: how is the number of spinning-cycles calculated ?
I found that there's a macro called PTHRED_MUTEX_ADAPTIVE_NP
Some pthreads implementations provide a macro PTHREAD_MUTEX_ADAPTIVE_NP (note spelling) that is one of the possible values of the kind_np mutex attribute, but neither that attribute nor the macro are standard. It looks like at least BSD and AIX have them, or at least did at one time, but this is not something you should be using in new code.
But how do I utilize this configuration-macro to a thread ?
You don't. Even if you are using a pthreads implementation that supports it, this is the value of a mutex attribute, not a thread attribute. You obtain a mutex with that attribute value by explicitly requesting it when you initialize the mutex. It would look something like this:
pthread_mutexattr_t attr;
pthread_mutex_t mutex;
int rval;
// Return-value checks omitted for brevity and clarity
rval = pthread_mutexattr_init(&attr);
rval = pthread_mutexattr_setkind_np(&attr, PTHREAD_MUTEX_ADAPTIVE_NP);
rval = pthread_mutex_init(&mutex, &attr);
There are other mutex attributes that you can set in analogous ways, which is one of the reasons I wrote this answer. Although you should not be using the kind_np attribute, you can follow this general model for other mutex attributes. There are also thread attributes, which work similarly.
I found the code in the glibc:
That's the "adaptive" mutex locking code of pthread_mutex_lock
in the glibc 2.31:
else if (__builtin_expect (PTHREAD_MUTEX_TYPE (mutex)
== PTHREAD_MUTEX_ADAPTIVE_NP, 1))
{
if (! __is_smp)
goto simple;
if (LLL_MUTEX_TRYLOCK (mutex) != 0)
{
int cnt = 0;
int max_cnt = MIN (max_adaptive_count (),
mutex->__data.__spins * 2 + 10);
do
{
if (cnt++ >= max_cnt)
{
LLL_MUTEX_LOCK (mutex);
break;
}
atomic_spin_nop ();
}
while (LLL_MUTEX_TRYLOCK (mutex) != 0);
mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
}
assert (mutex->__data.__owner == 0);
}
So the spin count is doubled up to a maximum plus 10 first (system configurable or 1000 if thre's no configuration) and after the locking the difference between the actual spins and the predefined spins divided by 8 is added to the next spin-count.

Confusing result from counting page fault in linux

I was writing programs to count the time of page faults in a linux system. More precisely, the time kernel execute the function __do_page_fault.
And somehow I wrote two global variables, named pfcount_at_beg and pfcount_at_end, which increase once when the function __do_page_fault is executed at different locations of the function.
To illustrate, the modified function goes as:
unsigned long pfcount_at_beg = 0;
unsigned long pfcount_at_end = 0;
static void __kprobes
__do_page_fault(...)
{
struct vm_area_sruct *vma;
... // VARIABLES DEFINITION
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
pfcount_at_beg++; // I add THIS
...
...
// ORIGINAL CODE OF THE FUNCTION
...
pfcount_at_end++; // I add THIS
}
I expected that the value of pfcount_at_end is smaller than the value of pfcount_at_beg.
Because, I think, every time kernel executes the instructions of code pfcount_at_end++, it must have executed pfcount_at_beg++(Every function starts at the very beginning of the code).
On the other hand, as there are many conditional return between these two lines of code.
However, the result turns out oppositely. The value of pfcount_at_end is larger than the value of pfcount_at_beg.
I use printk to print these kernel variables through a self-defined syscall. And I wrote the user level program to call the system call.
Here is my simple syscall and user-level program:
// syscall
asmlinkage int sys_mysyscall(void)
{
printk( KERN_INFO "total pf_at_beg%lu\ntotal pf_at_end%lu\n", pfcount_at_beg, pfcount_at_end)
return 0;
}
// user-level program
#include<linux/unistd.h>
#include<sys/syscall.h>
#define __NR_mysyscall 223
int main()
{
syscall(__NR_mysyscall);
return 0;
}
Is there anybody who knows what exactly happened during this?
Just now I modified the code, to make pfcount_at_beg and pfcount_at_end static. However the result did not change, i.e. the value of pfcount_at_end is larger than the value of pfcount_at_beg.
So possibly it might be caused by in-atomic operation of increment. Would it be better if I use read-write lock?
The ++ operator is not garanteed to be atomic, so your counters may suffer concurrent access and have incorrect values. You should protect your increment as a critical section, or use the atomic_t type defined in <asm/atomic.h>, and its related atomic_set() and atomic_add() functions (and a lot more).
Not directly connected to your issue, but using a specific syscall is overkill (but maybe it is an exercise). A lighter solution could be to use a /proc entry (also an interesting exercise).

Is there something wrong with my spin lock?

Here is my implementation of a spin lock, but it seems it can not protect the critical code. Is there something wrong with my implementation?
static __inline__ int xchg_asm(int* lock, int val)
{
int ret;
__asm__ __volatile__(
LOCK "movl (%1),%%eax;
xchg (%1),%2;
movl %%eax, %0" :"=m" (ret) :"d"(lock), "c"(val)
);
return ret;
}
void spin_init(spinlock_t* sl)
{
sl->val = 0;
}
void spin_lock(spinlock_t* sl)
{
int ret;
do {
ret = xchg_asm(&(sl->val), 1);
} while ( ret==0 );
}
void spin_unlock(spinlock_t* sl)
{
xchg_asm(&(sl->val), 0);
}
Your code equals to:
static __inline__ int xchg_asm(int* lock, int val) {
int save_old_value_at_eax;
save_old_value_at_eax = *lock; /* with a wrong lock prefix */
xchg *lock with val and discard the original value of *lock.
return save_old_value_at_eax; /* but it not the real original value of *lock */
}
You can see from the code, save_old_value_at_eax is no the real original value while the cpu perform xchg. You should get the old/original value by the xchg instruction, not by saving it before perform xchg. ("it is not the real old/original value" means, if another CPU takes the lock after this CPU saves the value but before this CPU performs the xchg instruction, this CPU will get the wrong old value, and it think it took the lock successful, thus, two CPUs enter the C.S. at the same time). You have separated a read-modify-write instruction to three instructions, the whole three instructions are not atomically(even you move the lock prefix to xchg).
I guess you thought the lock prefix will lock the WHOLE three instructions, but actually lock prefix can only be used for the only instruction which it is attached(not all instructions can be attached)
And we don't need lock prefix on SMP for xchg. Quote from linux_kernel_src/arch/x86//include/asm/cmpxchg.h
/*
* Note: no "lock" prefix even on SMP: xchg always implies lock anyway.
* Since this is generally used to protect other memory information, we
* use "asm volatile" and "memory" clobbers to prevent gcc from moving
* information around.
*/
My suggestions:
DON'T REPEAT YOURSELF, please use the spin lock of the linux kernel.
DON'T REPEAT YOURSELF, please use the xchg(), cmpxchg() of the linux kernel if you do want to implement a spin lock.
learn more about instructions. you can also find out how the linux kernel implement it.

Does epoll(), do its job in O(1)?

Wikipedia says
unlike the older system calls, which
operate at O(n), epoll operates in
O(1) [2]).
http://en.wikipedia.org/wiki/Epoll
However, the source code at fs/eventpoll.c on Linux-2.6.38,
seems it is implemented with an RB tree for searching, which has O(logN)
/*
* Search the file inside the eventpoll tree. The RB tree operations
* are protected by the "mtx" mutex, and ep_find() must be called with
* "mtx" held.
*/
static struct epitem *ep_find(struct eventpoll *ep, struct file *file, int fd)
{
In fact, I couldn't see any man page saying the complexity of epoll() is O(1).
Why is it known as O(1)?
This makes sense once you look for ep_find. I only spent a few minutes with it and I see ep_find is only called by epoll_ctl.
So indeed, when you add the descriptors (EPOLL_CTL_ADD) that costly operation is performed. BUT when doing the real work (epoll_wait) it isn't. You only add the descriptors in the beginning.
In conclusion, it's not enough to ask the complexity of epoll, since there is no epoll system call. You want the individual complexities of epoll_ctl, epoll_wait etc.
Other stuff
There are other reasons to avoid select and use epoll. When using select, you don't know how many descriptors need attention. So you must keep track of the biggest and loop to it.
rc = select(...);
/* check rc */
for (s = 0; s <= maxfd; s++) {
if (FD_ISSET(s)) {
/* ... */
}
}
Now with epoll it's a lot cleaner:
nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
/* check nfds */
for (n = 0; n < nfds; ++n) {
/* events[n].data.fd needs attention */
}
I think epoll wait is O(1) with epollet if you ask for 1 event.
And upd and add could be amortized O(1) if they used a descent hashtbl implementation.
This needs checking and man pages should mention complexity!

Resources