Race condition on ticket-based ARM spinlock - linux

I found that spinlocks in Linux kernel are all using "ticket-based" spinlock now. However after looking at the ARM implementation of it, I'm confused because the "load-add-store" operation is not atomic at all. Please see the code below:
74 static inline void arch_spin_lock(arch_spinlock_t *lock)
75 {
76 unsigned long tmp;
77 u32 newval;
78 arch_spinlock_t lockval;
79
80 __asm__ __volatile__(
81 "1: ldrex %0, [%3]\n" /*Why this load-add-store is not atomic?*/
82 " add %1, %0, %4\n"
83 " strex %2, %1, [%3]\n"
84 " teq %2, #0\n"
85 " bne 1b"
86 : "=&r" (lockval), "=&r" (newval), "=&r" (tmp)
87 : "r" (&lock->slock), "I" (1 << TICKET_SHIFT)
88 : "cc");
89
90 while (lockval.tickets.next != lockval.tickets.owner) {
91 wfe();
92 lockval.tickets.owner = ACCESS_ONCE(lock->tickets.owner);
93 }
94
95 smp_mb();
96 }
As you can see, on line 81~83 it loads lock->slock to "lockval" and increment it by one and then store it back to the lock->slock.
However I didn't see anywhere this is ensured to be atomic. So it could be possible that:
Two users on different cpu are reading lock->slock to their own variable "lockval" at the same time; Then they add "lockval" by one respectively and then store it back.
This will cause these two users are having the same "number" in hand and once the "owner" field becomes that number, both of them will acquire the lock and do operations on some shared-resources!
I don't think kernel can have such a bug in spinlock. Am I wrong somewhere?

STREX is a conditional store, this code has Load Link-Store Conditional semantics, even if ARM doesn't use that name.
The operation either completes atomically, or fails.
The assembler block tests for failure (the tmp variable indicates failure) and reattempts the modification, using the new value (updated by another core).

Related

Where and When Linux Kernel Setup GDT?

I have some doubt regarding GDT in linux. I try to get GDT info in kernel space (Ring0), and my test code called in system call context. In the test code, I try to print ss register (Segment Selector), and get ss segment descriptor by GDTR and ss-segment-selector.
77 void printGDTInfo(void) {
78 struct desc_ptr pgdt, *pss_desc;
79 unsigned long ssr;
80 struct desc_struct *ss_desc;
81
82 // Get GDTR
83 native_store_gdt(&pgdt);
84 unsigned long gdt_addr = pgdt.address;
85 unsigned long gdt_size = pgdt.size;
86 printk("[GDT] Addr:%lu |Size:%lu\n", gdt_addr, gdt_size);
87
88 // Get SS Register
89 asm("mov %%ss, %%eax"
90 :"=a"(ssr));
91 printk("SSR In Kernel:%lu\n", ssr);
92 unsigned long desc_index = ssr >> 3; // SHIFT for Descriptor Index
93 printk("SSR Shift:%lu\n", desc_index);
94 ss_desc = (struct desc_struct*)(gdt_addr + desc_index * sizeof(struct desc_struct));
95 printk("SSR:Base0:%lu, Base1:%lu,Base2:%lu\n", ss_desc->base0, ss_desc->base1, ss_desc->base2);
96 }
What confused me most is the "base" fields in ss-descriptor are all zero (line95 print). I try to print __USER_DS segment descriptor, the "base" fields are also zero.
Is that true? All the segment in Linux use same Base Address(zero)?
I want to check the GDT initialization in Linux Source Code but I am not sure where and when Linux setup GDT?
I find codes in "arch/x86/kernel/cpu/common.c"like this, the second parameter(zero) of GDT_ENTRY_INIT is zero, which means the base0/base1/base2 fields in segment descriptor are all zero.
125 [GDT_ENTRY_KERNEL32_CS] = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
126 [GDT_ENTRY_KERNEL_CS] = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
127 [GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(0xc093, 0, 0xfffff),
If that is true, all the segment has same base address(zero). As a result, same virtual address in Ring0 and Ring1 will mapping to the same linear address?
I am appreciate for your help :)

How does BL instruction jump to invalid instruction still manage to work corretly

I'm practice to reverse engineering a il2cpp unity project
Things I done:
get the apk
using Apktool to extract files
open libunity.so with Ghidra ( or IDA works too )
And I found a wired block of instructions like :
004ac818 f4 0f 1e f8 str x20,[sp, #local_20]!
004ac81c f3 7b 01 a9 stp x19,x30,[sp, #local_10]
004ac820 e1 03 1f 2a mov w1,wzr
004ac824 77 b5 00 94 bl FUN_004d9e00
I follow bl FUN_004d9e00 and I found :
FUN_004d9e00
004d9e00 6e ?? 6Eh n
004d9e01 97 ?? 97h
004d9e02 85 ?? 85h
004d9e03 60 ?? 60h `
004d9e04 6d ?? 6Dh m
But here is the thing, the instruction in FUN_004d9e00 is not a valid one. How can the libunity.so still work properly
Perhaps there is a relocation symbol for address 0x004ac824? In that case the linker would modify the instruction when libunity.so is loaded, and it would end up calling a different address (maybe in a different shared library).

Find frame base and variable locations using DWARF version 4

I'm following Eli Bendersky's blog on parsing the DWARF debug information. He shows an example of parsing the binary with DWARF version 2 in his blog. The frame base of a function (further used for retrieving local variables) can be retrieved from the location list:
<1><71>: Abbrev Number: 5 (DW_TAG_subprogram)
<72> DW_AT_external : 1
<73> DW_AT_name : (...): do_stuff
<77> DW_AT_decl_file : 1
<78> DW_AT_decl_line : 4
<79> DW_AT_prototyped : 1
<7a> DW_AT_low_pc : 0x8048604
<7e> DW_AT_high_pc : 0x804863e
<82> DW_AT_frame_base : 0x0 (location list)
<86> DW_AT_sibling : <0xb3>
...
$ objdump --dwarf=loc tracedprog2
Contents of the .debug_loc section:
Offset Begin End Expression
00000000 08048604 08048605 (DW_OP_breg4: 4 )
00000000 08048605 08048607 (DW_OP_breg4: 8 )
00000000 08048607 0804863e (DW_OP_breg5: 8 )
However, I find in DWARF version 4 there is no such .debug_loc section. Here is the function info on my machine:
<1><300>: Abbrev Number: 17 (DW_TAG_subprogram)
<301> DW_AT_external : 1
<301> DW_AT_name : (indirect string, offset: 0x1e0): do_stuff
<305> DW_AT_decl_file : 1
<306> DW_AT_decl_line : 3
<307> DW_AT_decl_column : 6
<308> DW_AT_prototyped : 1
<308> DW_AT_low_pc : 0x1149
<310> DW_AT_high_pc : 0x47
<318> DW_AT_frame_base : 1 byte block: 9c (DW_OP_call_frame_cfa)
<31a> DW_AT_GNU_all_tail_call_sites: 1
Line <318> indicates the frame base is 1 byte block: 9c (DW_OP_call_frame_cfa). Any idea how to find the frame base for the DWARF v4 binaries?
Update based on #Employed Russian's answer:
The frame_base of a subprogram seems to point to the Canonical Frame Address (CFA), which is the RBP value before the call instruction.
<2><329>: Abbrev Number: 19 (DW_TAG_variable)
<32a> DW_AT_name : (indirect string, offset: 0x7d): my_local
<32e> DW_AT_decl_file : 1
<32f> DW_AT_decl_line : 5
<330> DW_AT_decl_column : 9
<331> DW_AT_type : <0x65>
<335> DW_AT_location : 2 byte block: 91 6c (DW_OP_fbreg: -20)
So a local variable (my_local in the above example) can be located by the CFA using this calculation: &my_local = CFA - 20 = (current RBP + 16) - 20 = current RBP - 4.
Verify it by checking the assembly:
void do_stuff(int my_arg)
{
1149: f3 0f 1e fa endbr64
114d: 55 push %rbp
114e: 48 89 e5 mov %rsp,%rbp
1151: 48 83 ec 20 sub $0x20,%rsp
1155: 89 7d ec mov %edi,-0x14(%rbp)
int my_local = my_arg + 2;
1158: 8b 45 ec mov -0x14(%rbp),%eax
115b: 83 c0 02 add $0x2,%eax
115e: 89 45 fc mov %eax,-0x4(%rbp)
my_local is at -0x4(%rbp).
This isn't about DWARFv2 vs. DWARFv4 -- using either version the compiler may chose to use or not use location lists. Your compiler chose not to.
Any idea how to find the frame base for the DWARF v4 binaries?
It tells you right there: use the CFA pseudo-register, also known as "canonical frame address".
That "imaginary" register has the same value that %rsp had just before the current function was called. That is, current function's return address is always stored at CFA+0, and %rsp == CFA+8 on entry into the function.
If the function uses frame pointer, then previous value of %rbp is usually stored at CFA+8.
More info here.

How could sys_sigsuspend is atomical in linux kernel 2.6.11?

I'm reading linux 2.6.11
the implementation of sys_sigsuspend is as the following
34 /*
35 * Atomically swap in the new signal mask, and wait for a signal.
36 */
37 asmlinkage int
38 sys_sigsuspend(int history0, int history1, old_sigset_t mask)
39 {
40 struct pt_regs * regs = (struct pt_regs *) &history0;
41 sigset_t saveset;
42
43 mask &= _BLOCKABLE;
44 spin_lock_irq(&current->sighand->siglock);
45 saveset = current->blocked;
46 siginitset(&current->blocked, mask);
47 recalc_sigpending();
48 spin_unlock_irq(&current->sighand->siglock);
49
50 regs->eax = -EINTR;
51 while (1) {
52 current->state = TASK_INTERRUPTIBLE;
53 schedule();
54 if (do_signal(regs, &saveset))
55 return -EINTR;
56 }
57 }
in ULK3 the author says
the sigsuspend( ) system call does not allow signals to be sent after unblocking and before the schedule( ) invocation, because other processes cannot grab the CPU during that time interval.
Between spin_unlock_irq and schedule the syscall can be interrupted and preempted, so the other process can have enough time to send a signal which is not blocked to the process
But in this case, the signal will be lost, because the process schedule after the signal is delivered.
That's why sigsuspend should be atomical, but it's NOT according the its implementation.
sigsuspend implementation is correct, but the explanation in ULK is seems to be misleading.
When process executes kernel code, that execution is never interrupted by the user's signals. Instead, such signals are accumulated inside current task structure. At the moment the process leaves kernel code and returns to the user one, all signals accumulated(and not blocked) are fired.
schedule() kernel's function checks, whether some signals are accumulated. If they are, and current->state is TASK_INTERRUPTIBLE, schedule() returns. So all signals collected before schedule() call are not lost.
Atomicity of sigsuspend() system call means that if signals, temporary unblocked by the call, are emitted, then that call will garantee see them and return. Such atomicity is simply achived by placing both unblocking and checking signals inside same kernel function.

How does the Linux kernel get info about the processors and the cores?

Assume we have a blank computer without any OS and we are installing a Linux. Where in the kernel is the code that identifies the processors and the cores and get information about/from them?
This info eventually shows up in places like /proc/cpuinfo but how does the kernel get it in the first place?!
Short answer
Kernel uses special CPU instruction cpuid and saves results in internal structure - cpuinfo_x86 for x86
Long answer
Kernel source is your best friend.
Start from entry point - file /proc/cpuinfo.
As any proc file it has to be cretaed somewhere in kernel and declared with some file_operations. This is done at fs/proc/cpuinfo.c. Interesting piece is seq_open that uses reference to some cpuinfo_op. This ops are declared in arch/x86/kernel/cpu/proc.c where we see some show_cpuinfo function. This function is in the same file on line 57.
Here you can see
64 seq_printf(m, "processor\t: %u\n"
65 "vendor_id\t: %s\n"
66 "cpu family\t: %d\n"
67 "model\t\t: %u\n"
68 "model name\t: %s\n",
69 cpu,
70 c->x86_vendor_id[0] ? c->x86_vendor_id : "unknown",
71 c->x86,
72 c->x86_model,
73 c->x86_model_id[0] ? c->x86_model_id : "unknown");
Structure c declared on the first line as struct cpuinfo_x86. This structure is declared in arch/x86/include/asm/processor.h. And if you search for references on that structure you will find function cpu_detect and that function calls function cpuid which is finally resolved to native_cpuid that looks like this:
189 static inline void native_cpuid(unsigned int *eax, unsigned int *ebx,
190 unsigned int *ecx, unsigned int *edx)
191 {
192 /* ecx is often an input as well as an output. */
193 asm volatile("cpuid"
194 : "=a" (*eax),
195 "=b" (*ebx),
196 "=c" (*ecx),
197 "=d" (*edx)
198 : "" (*eax), "2" (*ecx)
199 : "memory");
200 }
And here you see assembler instruction cpuid. And this little thing does real work.
This information from BIOS + Hardware DB. You can get info direct by dmidecode, for example (if you need more info - try to check dmidecode source code)
sudo dmidecode -t processor

Resources