Thread local storage does not work using threads created with clone() - linux

The program below creates ten threads using Linux's clone() system call. The static variable tls has the C11 thread_local attribute. The threads execute the child_func function, which just increments the tls variable and stores the incremented value in the location pointed to by the argument arg. The incremented values are stored, one for each thread, in the array tlsvals. The main function waits for the threads to finish and then prints the values of the ten instances of the tls variable. (I have skipped error checking and the freeing of the allocated stacks in this version of the code, to keep the example brief.)
#define _GNU_SOURCE
#include <sched.h>
#include <sys/wait.h>
#include <stdio.h>
#include <stdlib.h>
#include <threads.h> /* For thread_local. */
#define NTHREADS 10
static thread_local int tls = 0;
static int child_func(void *arg)
{
tls++;
*((int *)arg) = tls;
return 0;
}
int main(int argc, char** argv)
{
int tlsvals[NTHREADS];
for (int n = 0; n < NTHREADS; n++) {
const int STACK_SIZE = 65536;
if (clone(child_func, malloc(STACK_SIZE) + STACK_SIZE, CLONE_VM | SIGCHLD, tlsvals+n) == -1) {
perror("clone");
exit(1);
}
}
int ret, status;
while ((ret = wait(&status)) != -1)
;
for (int *p = tlsvals; p < tlsvals + NTHREADS; p++)
printf("The value of tls is %d\n", *p);
return 0;
}
Since the tls variable is marked thread_local, I expected the program to print ten lines of The value of tls is 1, since each thread increments the variable once from its initial zero value. Instead I get the following output:
The value of tls is 1
The value of tls is 3
The value of tls is 2
The value of tls is 4
The value of tls is 5
The value of tls is 6
The value of tls is 7
The value of tls is 8
The value of tls is 9
The value of tls is 10
So it seems the tls variable is not thread-local at all, but shared by all the threads, and each thread increments the same variable.
The code was compiled with GCC version 9.1.0 on x86_64 using the following command line:
gcc -O2 -Wall -std=c11 tfoo.c -o tfooc
I also tried using the GCC specific __thread attribute instead of thread_local, with the same result.
Looking at the assembly produced by GCC, I can see that the variable is accessed via the %fs register, as it should be:
child_func:
.LFB24:
.cfi_startproc
movl %fs:tls#tpoff, %eax
addl $1, %eax
movl %eax, %fs:tls#tpoff
movl %eax, (%rdi)
xorl %eax, %eax
ret
.cfi_endproc
What am I doing wrong, or is there a bug in the implementation of thread_local in Gcc?

Related

What caused the performance degradation when reading a local modified cache line with concurrent readers accessing it?

Assume that we have multiple threads accessing the same cache line parallelly. One of them (writer) repeatedly write to that cache line, and read from it. The other threads (readers) only repeatedly read from it. I am clear that the readers suffer from performance degradation since the invalidation-based MESI protocol requires the readers to invalidate their local cache before the writer writes to it, which occurs frequently. But I think the writer's read to that cache line should be fast because local write will not cause such invalidation.
However, the strange thing is that, when I run such experiment on a dual-socket machine with two Intel Xeon Scalable Gold 5220R processors (24 cores each) running at 2.20GHz, the writer's read to that cache line becomes a performance bottleneck.
This is my test program (compiled with gcc 8.4.0, -O2):
#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <sched.h>
#include <unistd.h>
#include <sys/syscall.h>
#define CACHELINE_SIZE 64
volatile struct {
/* cacheline 1 */
size_t x, y;
char padding[CACHELINE_SIZE - 2 * sizeof(size_t)];
/* cacheline 2 */
size_t p, q;
} __attribute__((aligned(CACHELINE_SIZE))) data;
static inline void bind_core(int core) {
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(core, &mask);
if ((sched_setaffinity(0, sizeof(cpu_set_t), &mask)) != 0) {
perror("bind core failed\n");
}
}
#define smp_mb() asm volatile("lock; addl $0,-4(%%rsp)" ::: "memory", "cc")
void *writer_work(void *arg) {
long id = (long) arg;
int i;
bind_core(id);
printf("writer tid: %ld\n", syscall(SYS_gettid));
while (1) {
/* read after write */
data.x = 1;
data.y;
for (i = 0; i < 50; i++) __asm__("nop"); // to highlight bottleneck
}
}
void *reader_work(void *arg) {
long id = (long) arg;
bind_core(id);
while (1) {
/* read */
data.y;
}
}
#define NR_THREAD 48
int main() {
pthread_t threads[NR_THREAD];
int i;
printf("%p %p\n", &data.x, &data.p);
data.x = data.y = data.p = data.q = 0;
pthread_create(&threads[0], NULL, writer_work, 0);
for (i = 1; i < NR_THREAD; i++) {
pthread_create(&threads[i], NULL, reader_work, i);
}
for (i = 0; i < NR_THREAD; i++) {
pthread_join(threads[i], NULL);
}
return 0;
}
I use perf record -t <tid> to collect cycles event for the writer thread and use perf annotate writer_work to show detailed proportion in writer_work:
: while (1) {
: /* read after write */
: data.x = 1;
0.20 : a50: movq $0x1,0x200625(%rip) # 201080 <data>
: data.y;
94.40 : a5b: mov 0x200626(%rip),%rax # 201088 <data+0x8>
0.03 : a62: mov $0x32,%eax
0.00 : a67: nopw 0x0(%rax,%rax,1)
: for (i = 0; i < 50; i++) __asm__("nop");
0.03 : a70: nop
0.03 : a71: sub $0x1,%eax
5.17 : a74: jne a70 <writer_work+0x50>
0.15 : a76: jmp a50 <writer_work+0x30>
It seems that the load instruction of data.y becomes the bottleneck.
First, I think the performance degradation should be related to the cache line modification, because when I comment out the writer's store operation, the perf result indicates that the writer's read is not the bottleneck any more. Second, I think it should have something to do with the concurrent readers, because when I comment out the readers' read, writer's read is not blamed by perf. Concurrent readers accessing the same cacheline also slow down the writer, because when I change sum += data.y to sum += data.z in the writer side, which leads to a load to another cache line, the count decreases.
There is another possibility, suggested by Peter Cordes, that it's whatever instruction follows the store that's getting the blame. However, when I move the nop loop before the load, perf still blames for the load instruction.
: writer_work():
: while (1) {
: /* read after write */
: data.x = 1;
0.00 : a50: movq $0x1,0x200625(%rip) # 201080 <data>
0.03 : a5b: mov $0x32,%eax
: for (i = 0; i < 50; i++) __asm__("nop");
0.03 : a60: nop
0.09 : a61: sub $0x1,%eax
6.24 : a64: jne a60 <writer_work+0x40>
: data.y;
93.60 : a66: mov 0x20061b(%rip),%rax # 201088 <data+0x8>
: data.x = 1;
0.02 : a6d: jmp a50 <writer_work+0x30>
So my question is, what caused the performance degradation when reading a local modified cache line with concurrent readers accessing it? Any help would be appreciated!

symbol without any name and completed.7392 symbol in .bss section

In my sample C program, compiled with gcc, .bss section has an index [24], as shown by readelf -S.
When I try to see the things stored in .bss, by running
readelf -s ./pointer | grep 24, I get
Num: Value Size Type Bind Vis Ndx Name
24: 00000000000040a0 0 SECTION LOCAL DEFAULT 24
31: 00000000000040a8 1 OBJECT LOCAL DEFAULT 24 completed.7392
54: 00000000000040b0 8 OBJECT GLOBAL DEFAULT 24 label
68: 00000000000040c0 0 NOTYPE GLOBAL DEFAULT 24 _end
70: 00000000000040b8 4 OBJECT GLOBAL DEFAULT 24 i
71: 0000000000004090 0 NOTYPE GLOBAL DEFAULT 24 __bss_start
79: 00000000000040ac 4 OBJECT GLOBAL DEFAULT 24 err
81: 00000000000040a0 8 OBJECT GLOBAL DEFAULT 24 stderr##GLIBC_2.2.5
size ./pointer gives me
text data bss dec hex filename
3196 680 32 3908 f44 ./pointer
what's the symbol without any name and symbol with name completed.7392?
and, why size doesn't add up to 32 bytes, as shown by size? [ it is 25 now ]
As a side question, where are stdin and stdout symbols? I can find only stderr, and that is in the bss section.
program source attached below. compiled with gcc version 9.2
/*
* A program that will read and print printable characters in it's memory given a memory address
* until it segfaults
*/
#define _GNU_SOURCE /* Bring REG_XXX names from /usr/include/sys/ucontext.h */
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>
#include <ucontext.h>
int i,err=0;
void sighandler(int signum);
void* label;
void readmem(){
long loc;
i=0;
err=0;
printf("enter mem location:");
scanf("%lx",&loc);
printf("--dump begin--\n");
char * addr = (void *) loc;
char ch;
label=&&l;
/* was kept there so that I can find how many bytes to increment rip to recover fro segfault
address can be found using gdb also
*/
while(1){
ch=addr[i];
l:
printf("%c",ch);
i++;
if ( err == 1)
break;
// printf("%d\n",i);
}
}
static void sigaction_segv(int signal, siginfo_t *si, void *arg)
{
ucontext_t *ctx = (ucontext_t *)arg;
/* We are on linux x86, the returning IP is stored in RIP (64bit) or EIP (32bit).
In this example, the length of the offending instruction is 6 bytes.
So we skip the offender !
&&l will return address pointed by label l
(gdb) disass readmem
...
0x0000555555555284 <+139>: mov -0x10(%rbp),%rax
0x0000555555555288 <+143>: add %rdx,%rax
=> 0x000055555555528b <+146>: movzbl (%rax),%eax -> rip on segfault
0x000055555555528e <+149>: mov %al,-0x19(%rbp)
0x0000555555555291 <+152>: movsbl -0x19(%rbp),%eax
0x0000555555555295 <+156>: mov %eax,%edi
0x0000555555555297 <+158>: callq 0x555555555030 <putchar#plt>
...
>>
(gdb) p (void*) label
$1 = (void *) 0x555555555291 <readmem+152> ->address of next instruction
>>
we need to go to <readmem+152> ie, next instruction
so we add decimal 6 to rip in sighandler
*/
#if __WORDSIZE == 64
printf("\nCaught SIGSEGV, addr %p, RIP 0x%lx\n",si->si_addr,ctx->uc_mcontext.gregs[REG_RIP]);
ctx->uc_mcontext.gregs[REG_RIP] += 6;
#else
printf("Caught SIGSEGV, addr %p, EIP 0x%x\n",si->si_addr,ctx->uc_mcontext.gregs[REG_EIP]);
ctx->uc_mcontext.gregs[REG_EIP] += 6;
#endif
err=1;
printf("no of bytes read:%d\n",i);
}
int main () {
// 0x0 is hex literal that defaults to signed integer
// here we are casting it to a void pointer
// and then assigning it to a value declared to be a void pointer
// this is the correct way to create an arbitrary pointer in C
struct sigaction sa;
memset(&sa, 0, sizeof(sa));
sigemptyset(&sa.sa_mask);
sa.sa_sigaction = sigaction_segv;
sa.sa_flags = SA_SIGINFO;
if (sigaction(SIGSEGV, &sa, NULL) == -1) {
fprintf(stderr, "failed to setup SIGSEGV handler\n");
exit(1);
}
char c[25];
sprintf(c,"cat /proc/%d/maps",getpid());
system(c);
while(1){
readmem();
}
}

VMX performance issue with rdtsc (no rdtsc exiting, using rdtsc offseting)

I am working a Linux kernel module (VMM) to test Intel VMX, to run a self-made VM (The VM starts in real-mode, then switches to 32bit protected mode with Paging enabled).
The VMM is configured to NOT use rdtsc exit, and use rdtsc offsetting.
Then, the VM runs rdtsc to check the performance, like below.
static void cpuid(uint32_t code, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx) {
__asm__ volatile(
"cpuid"
:"=a"(*eax),"=b"(*ebx),"=c"(*ecx), "=d"(*edx)
:"a"(code)
:"cc");
}
uint64_t rdtsc(void)
{
uint32_t lo, hi;
// RDTSC copies contents of 64-bit TSC into EDX:EAX
asm volatile("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}
void i386mode_tests(void)
{
u32 eax, ebx, ecx, edx;
u32 i = 0;
asm ("mov %%cr0, %%eax\n"
"mov %%eax, %0 \n" : "=m" (eax) : :);
my_printf("Guest CR0 = 0x%x\n", eax);
cpuid(0x80000001, &eax, &ebx, &ecx, &edx);
vm_tsc[0]= rdtsc();
for (i = 0; i < 100; i ++) {
rdtsc();
}
vm_tsc[1]= rdtsc();
my_printf("Rdtsc takes %d\n", vm_tsc[1] - vm_tsc[0]);
}
The output is something like this,
Guest CR0 = 0x80050033
Rdtsc takes 2742
On the other hand, I make a host application to do the same thing, like above
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
static void cpuid(uint32_t code, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx) {
__asm__ volatile(
"cpuid"
:"=a"(*eax),"=b"(*ebx),"=c"(*ecx), "=d"(*edx)
:"a"(code)
:"cc");
}
uint64_t rdtsc(void)
{
uint32_t lo, hi;
// RDTSC copies contents of 64-bit TSC into EDX:EAX
asm volatile("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}
int main(int argc, char **argv)
{
uint64_t vm_tsc[2];
uint32_t eax, ebx, ecx, edx, i;
cpuid(0x80000001, &eax, &ebx, &ecx, &edx);
vm_tsc[0]= rdtsc();
for (i = 0; i < 100; i ++) {
rdtsc();
}
vm_tsc[1]= rdtsc();
printf("Rdtsc takes %ld\n", vm_tsc[1] - vm_tsc[0]);
return 0;
}
It outputs followings,
Rdtsc takes 2325
Running above two codes in 40 iterations to get the average value as followings,
avag(VM) = 3188.000000
avag(host) = 2331.000000
The performance difference can NOT be ignored, when running the codes in VM and in host. It is NOT expected.
My understanding is, using TSC offsetting + no RDTSC exit, there should be little difference in rdtsc, running in VM and host.
Here are VMCS fields,
0xA501E97E = control_VMX_cpu_based
0xFFFFFFFFFFFFFFF0 = control_CR0_mask
0x0000000080050033 = control_CR0_shadow
In the last level of EPT PTEs, bit[5:3] = 6 (Write Back), bit[6] = 1. EPTP[2:0] = 6 (Write Back)
I tested in bare-metal, and in VMware, I got the similar results.
I am wondering if there is anything I missed in this case.

what's the difference between gcc __sync_bool_compare_and_swap and cmpxchg?

to use cas, gcc provides some useful functions such as
__sync_bool_compare_and_swap
but we can also use asm code like cmpxchg
bool ret;
__asm__ __volatile__(
"lock cmpxchg16b %1;\n"
"sete %0;\n"
:"=m"(ret),"+m" (*(volatile pointer_t *) (addr))
:"a" (old_value.ptr), "d" (old_value.tag), "b" (new_value.ptr), "c" (new_value.tag));
return ret;
I have grep the source code of gcc 4.6.3, and found that __sync_bool_compare_and_swap is implemented use
typedef int (__kernel_cmpxchg_t) (int oldval, int newval, int *ptr);
#define __kernel_cmpxchg (*(__kernel_cmpxchg_t *) 0xffff0fc0)
it seems that 0xffff0fc0 is the adress of some kernel helper functions
but in gcc 4.1.2 , there is no such codes like __kernel_cmpxchg_t, and I can't find the implementation of __sync_bool_compare_and_swap.
so what's the difference between __sync_bool_compare_and_swap and cmpxchg?
is __sync_bool_compare_and_swap implemented by cmpxchg?
and with kernel helper function __kernel_cmpxchg_t, is it implementd by cmpxchg?
thanks!
I think the __kernel_cmpxchg is a fallback which Linux makes available on some architectures which don't have native hardware support for CAS. E.g. ARMv5 or something like that.
Usually, GCC inline expands the _sync* builtins. Unless you're really interested in GCC internals, an easier way to find out what it does is to make a simple C example and look at the ASM the compiler generates.
Consider
#include <stdbool.h>
bool my_cmpchg(int *ptr, int oldval, int newval)
{
return __sync_bool_compare_and_swap(ptr, oldval, newval);
}
Compiling this on an x86_64 Linux machine with GCC 4.4 the following asm is generated:
my_cmpchg:
.LFB0:
.cfi_startproc
movl %esi, %eax
lock cmpxchgl %edx, (%rdi)
sete %al
ret
.cfi_endproc

Pointer initialization doubt

We could initialize a character pointer like this in C.
char *c="test";
Where c points to the first character(t).
But when I gave code like below. It gives segmentation fault.
#include<stdio.h>
#include<stdlib.h>
main()
{
int *i=0;
printf("%d",*i);
}
Also when I give
#include<stdio.h>
#include<stdlib.h>
main()
{
int *i;
i=(int *)malloc(2);
*i=0;
printf("%d",*i);
}
It worked(gave output 0).
When I gave malloc(0), it worked(gave output 0).
Please tell what is happening
Your first example is seg faulting because you are trying to de-reference a null pointer which you have created with the line:
int *i=0;
You can't de-reference a pointer that doesn't point to anything and expect good things to happen. =)
The second code segment works because you have actually assigned memory to your pointer using malloc which you may de-reference. I would think it's possible for you to get values other than zero depending on the memory adjacent to the address you're allocated with malloc. I say this because typically an int is 4 bytes and you've only assigned 2. When de-referencing the int pointer, it should return the value as an int based on the 4 bytes pointed to. In your case, the first 2 bytes being what you received from the malloc and the adjacent 2 bytes being whatever is there which could be anything and whatever it is will be treated as if it was an int. You could get strange behavior like this and you should malloc the size of memory needed for the type you are trying to use/point at.
(i.e. int *i = (int *) malloc(sizeof(int)); )
Once you have the pointer pointing at memory that is of the correct size, you can then set the values as such:
#include <stdlib.h>
#include <stdio.h>
int main (int argc, char *argv[])
{
int *i = (int *)malloc(sizeof(int));
*i = 25;
printf("i = %d\n",*i);
*i = 12;
printf("i = %d\n",*i);
return 0;
}
Edit based on comment:
A pointer points to memory, not to values. When initializing char *ptr="test"; You're not assigning the value of "test", you're assigning the memory address of where the compiler is placing "test" which is placed in your processes data segment and is read only. It you tried to modify the string "test", you program would likely seg-fault. What you need to realize about a char * is that it points at a single (i.e. the first) character in the string. When you de-reference the char *, you will see 1 character and one character only. C uses null terminated strings, and notice that you do not de-reference ptr when calling printf, you pass it the pointer itself and that points at just the first character. How this is displayed depends on the format passed to printf. When printf is passed the '%c' format, it will print the single character ptr points at, if you pass the format '%p' it will print the address that ptr points. To get the entire string, you pass '%s' as the format. What this makes printf do is to start at the pointer you passed in and read each successive byte until a null is reached. Below is some code that demonstrates these.
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
int main (int argc, char *argv[])
{
// Initialize to data segement/read only string
char *ptr = "test";
printf("ptr points at = %p\n", ptr); // Prints the address ptr points to
printf("ptr dereferenced = %c\n", *ptr); // Prints the value at address ptr
printf("ptr value = %s\n", ptr); // Prints the string of chars pointed to by ptr
// Uncomment this to see bad behavior!
// ptr[1] = 'E'; // SEG FAULT -> Attempting to modify read-only memory
printf("--------------------\n");
// Use memory you have allocated explicitly and can modify
ptr = malloc(10);
strncpy(ptr, "foo", 10);
printf("ptr now points at = %p\n", ptr); // Prints the address ptr points to
printf("ptr dereferenced = %c\n", *ptr); // Prints the value at address ptr
printf("ptr value = %s\n", ptr); // Prints the string of chars pointed to by ptr
ptr[1] = 'F'; // Change the second char in string to F
printf("ptr value (mod) = %s\n", ptr);
return 0;
}

Resources