CMPXCHG is not atomic in this NASM program

CMPXCHG is not atomic in this NASM program - multithreading

I have implemented a spinlock in NASM-64 on Windows. I'm using a spinlock to block a shared memory buffer so that each core will write to the shared buffer one core at a time in order (core 0 first, core 1 second, core 2 third, etc.) As far as I know, a mutex or semaphore will not allow me to do the buffer writes in core order, and a spinlock is preferable because it doesn't use an OS call.
Here is the code section at issue. This is not a complete example because the problem is in this small section of a much larger assembly program. I used cmpxchg for atomicity.
On entry to this section, rax contains the core number and rbx contains a memory variable called spinlock_core, which is set to zero (first core) on entry to this section. After each core is finished, spinlock_core is incremented to the next core number.
mov rdi,Test_Array
movq rbx,xmm11
mov [rdi+rax],rbx ; rax contains the core number offset (0, 8, 16, 24)
push rax
; Spin Lock
spin_lock_01:
lock cmpxchg [spinlock_core],rax ; spinlock_core is set to zero on first entry
jnz spin_lock_01
; To test the result:
mov rdi,Test_Array
mov [rdi+rax+32],rax
jmp out_of_here
The results of this test are in Test_Array, which is populated with the number of bytes to write for each core. On return it contains:
40, 40, 40, 16, 0, 8, 16, 24
showing that cores 0-2 each have 40 bytes to write and core 3 has 16 bytes to write. However, the last four elements of Test_Array contain the core offset for each of the four cores. If the spinlock was working correctly (cmpxchg rax,rbx), the last four elements should all contain zero, showing that only the first core was allowed through. But it shows that all four cores were allowed through.
I assume that my cmpxchg is not atomic, and that's why other cores leak through -- they should only be allowed in when spinlock_core is incremented to the next core, but that doesn't happen before we exit with jmp out_of_here. According to the docs, cmpxchg should be preceded by a "lock" prefix, as in lock cmpxchg rax,rbx, but when I do that the NASM assembler says "warning: instruction is not lockable [-w+lock]." Felix Cloutier's site says the lock prefix is only needed if a memory operand is involved, but when I write "cmpxchg rax,[spinlock_core]" I get "invalid combination of opcodes and operands."
To summarize, my questions are: why is cmpxchg not atomic as written above, and why does the NASM assembler not allow the use of the lock prefix?
There are a number of detailed posts on Stack Overflow and elsewhere on the issue of atomicity but I haven't found any that address this specific issue.
Thanks for any help.

Here is how I solved this, with the help of comments below from #Peter Cordes and #prl. The changes are (1) use bx, not ax as a destination register and (2) use only the low 8-bit registers ax and bx, not rax and rbx.
xor rbx,rbx
spin_lock_01:
pop rax ; the core number is stored in the stack
mov bl,al ; put core number into bx
mov rcx,rax ; preserve core number for later use
push rax ; push core number back to stack
lock cmpxchg [spinlock_core],bl ; spinlock_core is set to zero on first entry
jnz spin_lock_01
; now test it
mov rdi,Test_Array
mov rbx,[spinlock_core]
mov rdx,[order_of_execution]
add rdx,1
mov [order_of_execution],rdx
mov [rdi+rcx+0],rcx
mov [rdi+rcx+32],rdx
add rbx,8
mov [spinlock_core],rbx
jmp out_of_here
The output in Test_Array shows as: 0,8,16,24,1,2,3,4. That shows that each of the cores is passed through in core order (1,2,3,4). The memory location spinlock_core is incremented by each core when it completes, signalling the next core.

Related

Read file from a specific position in x86

Is it possible to start reading a file from a specific line or byte. Currently I use this code to read 4 bytes of a file:
section .data
filename db "file.txt", 0
section .bss
read_data resb 4
section .text
global _start
_start:
mov rax, SYS_OPEN
mov rdi, filename
mov rsi, O_RDONLY
mov rdx, 0
syscall
push rax
mov rdi, rax
mov rax, SYS_READ
mov rsi, read_data
mov rdx, 4
syscall
mov rax, SYS_CLOSE
pop rdi
syscall
This code always reads the first 4 bytes, but I want to start reading from other parts of the file, like the middle for example. What do I need to add or change?

A freshly-opened file descriptor starts at position = 0. If you keep reading from the same fd in a loop, you'll get successive chunks. (Use a larger buffer like 8kiB and loop over dwords in user-space, though, using the value that read returned as an upper limit! A system call is very expensive in CPU time.)
Is it possible to start reading a file from a specific line or byte.
Byte: yes
Line: no. In Unix/Linux, the kernel doesn't have an index of line-start byte offsets or any other line-oriented API. The line handling in stdio fgets for example is purely done in user-space. There have been some historical OSes with record-based files, but Unix files are flat arrays of bytes. (They can have holes, unwritten extents, and extended attributes... But the kernel APIs for the main file contents only operate with by byte offsets).
If you want to do lines, read a big block and loop forward until you've seen some number of newlines. If you're not there yet, read another block; repeat until you find the start and end of the line number you want, or you hit EOF. x86-64 can efficiently search 16 bytes at a time with pcmpeqb / pmovmskb / popcnt (popcnt requires SSE4.2 or the specific popcnt feature bit).
Or with just SSE2, or when optimizing for large blocks, with pcmpeqb / psadbw (against all-zero) to hsum bytes to qwords / paddd. Then check how many lines you went every so often with some scalar code. Or keep it simple and branch on finding the first newline in a SIMD vector.
Obviously the slow and simple option is a byte-at-a-time loop that counts '\n' characters - if you know how to do strchr with SSE2 it should be straightforward to vectorize that search using the above suggestions.
But if you only want some specific byte positions, you have two main options:
seek with lseek(2) before read(2) (see #Nicolae Natea's answer)
Use POSIX/Linux pread(2) to read from a specified offset, without moving the fd's file offset for future read calls. The Linux system call name is pread64 (__NR_pread64 equ 17 from asm/unistd_64.h)
ssize_t pread(int fd, void *buf, size_t count, off_t offset); The only difference from read is the offset arg, the 4th arg thus passed in R10 (not RCX like the user-space function calling convention). off_t is a 64-bit type simply passed in a single register in 64-bit code.
Other than the pread64 name in the .h, there's nothing special about the asm interface compared to the C interface, it follows the standard system-calling convention. (It exists since Linux 2.1.60 ; before that glibc's wrapper emulated it with lseek.)
There are other things you can do like mmap, or a preadv system call, but pread is most exactly what you're looking for if you have a known position you want to read from.

Before performing the read you should perform a lseek, so that the file position is updated.
so something along the lines:
mov rdi, rax ; fd
mov rax, SYS_LSEEK
mov rsi, <whatever offset you want>
mov rdx, 0 ; keep 0 if the offset should be from the begining of the file
syscall
note: RDI will still hold the same fd value after a syscall so you don't need extra save/restore for the fd across lseek / read / close.
Tip:
It might be easier to write the code in c and compile it with gcc -g -S -fverbose-asm -Og -c main.c and then look at main.s. (How to remove "noise" from GCC/clang assembly output?). But that will only show the compiler making calls to libc wrapper functions, unless you use inline system call macros like MUSL libc provides.

CPU_ZERO "undefined symbol" using pthread_setaffinity_np in NASM

I am using the pthreads library in NASM under Ubuntu 18.04. Thread creation works correctly, but I want to assign each thread to a separate core with pthread_setaffinity_np.
Following is the section of code I use to initialize threads. It compiles as written but at run time I get "undefined symbol: CPU_ZERO."
Using examples from C, I inserted %define _GNU_SOURCE at the top of the program, but I still get the undefined symbol CPU_ZERO error.
section .data align=16
; For thread scheduling:
cpuset: times 4 dq 0
section .text
label_0:
mov rdi,ThreadID ; ThreadCount
mov rsi,pthread_attr_t ; Thread Attributes
mov rdx,Test_fn ; Function Pointer
mov rcx,pthread_arg
call pthread_create wrt ..plt
; Set affinity mask
mov rdi,cpuset
call CPU_ZERO wrt ..plt
call pthread_self wrt ..plt
push rax
mov rdi,rax
mov rsi,cpuset
call CPU_SET wrt ..plt
pop rax
mov rdi,rax
mov rsi,32
mov rdx,cpuset
call pthread_setaffinity_np wrt ..plt
; check the result with pthread_getaffinity_np
mov rax,[tcounter]
add rax,8
mov [tcounter],rax
mov rbx,[Number_Of_Cores]
cmp rax,rbx
jl label_0
My question is: how do I use CPU_ZERO and CPU_SET in NASM (or any other assembly language; I can translate to NASM).
Thanks for any help.

CPU_ZERO and CPU_SET are C macros, not functions which you can call.
You'll have to roll your own function to perform equivalent zeroing / setting.

Those are CPP macros, not function. You can tell from the all-caps names. And from the fact the man page calls them macros.
As usual, the notes section of the man page has details that are useful for asm:
Since CPU sets are bit masks allocated in units of long words, the
actual number of CPUs in a dynamically allocated CPU set will be
rounded up to the next multiple of sizeof(unsigned long). An
application should consider the contents of these extra bits to be
undefined.
Notwithstanding the similarity in the names, note that the constant
CPU_SETSIZE indicates the number of CPUs in the cpu_set_t data type
(thus, it is effectively a count of the bits in the bit mask), while
the setsize argument of the CPU_*_S() macros is a size in bytes.
On my system (Arch Linux, glibc 2.29-4)
/usr/include/bits/cpu-set.h says
...
#define __CPU_SETSIZE 1024
#define __NCPUBITS (8 * sizeof (__cpu_mask))
...
typedef __CPU_MASK_TYPE __cpu_mask; // ultimately unsigned long via some other headers
...
typedef struct
{
__cpu_mask __bits[__CPU_SETSIZE / __NCPUBITS];
} cpu_set_t;
So a cpu_set_t is 1024 bits = 128 bytes = times 16 dq 0 or resq 16, at least on my system with that kernel config.
CPU_ZERO is free in your case; your statically-allocated cpu_set_t is statically zero-initialized. For some reason you put it in .data instead of .bss, so the executable will have to actually contain those zeros, but same difference.
If you did want to zero one on the stack, for example, rep stosd is one easy way, or xorps xmm0, xmm0 and 8x movups stores would also work.
Since high performance is not essential (CPU affinity-setting code probably only runs once), bts is a very convenient way to set bits in a bitmap (CPU_SET). With a memory destination, it takes a bit-index that can go outside the dword selected by the addressing mode. bts mem, reg is slow and microcoded (like 10 uops on Skylake), but nice for code size. bts mem, imm is only 3 uops, but or byte [mem + i/8], 1<<(i%8) is only 2 uops.
or also lets you set more than 1 bit at once, or more simply just mov store some bytes that contain the desired pattern of zeros and ones.
But TL:DR: it's just a bitmap, manipulate it however you like using asm, or even statically initialize it with non-zero values.

My approach to solving this problem is summarized in my answer at Reproduce these C types in assembly?.

Is there advantage of reading data without using pop operation?

According to this PDF document (Page-66), the following bunch of statements
mov eax, DWORD PTR SS:[esp]
mov eax, DWORD PTR SS:[esp + 4]
mov eax, DWORD PTR SS:[esp + 8]
are equivalent to the following bunch of statements:
pop eax
pop eax
pop eax
Is there any advantage of the the former over the latter?

mov leaves the data on the stack, pop removes it so you can only read it once, and only in order. Data below ESP has to be considered "lost", unless you're using a calling convention / ABI that includes a red-zone below the stack pointer.
Data is usually still there below ESP, but asynchronous stuff like signal handlers, or a debugger evaluating a call fflush(0) in the context of your process, can step on it.
Also, pop modifies ESP, so each pop requires stack-unwind metadata1 in another section of the executable/library, for it to be fully ABI compliant with SEH on Windows or the i386 / x86-64 System V ABI on other OSes (which specifies that all functions need unwind metadata, even if they're not C++ functions that actually support propagating exceptions).
But if you're reading data for the last time, and you actually need it all, then yes pop is an efficient way to read it on modern CPUs (like Pentium-M and later, with a stack engine to handle the ESP updates without a separate uop.)
On older CPUs, like Pentium III, pop was actually slower than 3x mov + add esp,12 and compilers did generate code the way Brendan's answer shows.
void foo() {
asm("" ::: "ebx", "esi", "edi");
}
This function forces the compiler to save/restore 3 call-preserved registers (by declaring clobbers on them.) It doesn't actually touch them; the inline asm string is empty. But this makes it easy to see what compilers will do for saving/restoring. (Which is the only time they'll use pop normally.)
GCC's default (tune=generic) code-gen, or with -march=skylake for example, is like this (from the Godbolt compiler explorer)
foo: # gcc8.3 -O3 -m32
push edi
push esi
push ebx
pop ebx
pop esi
pop edi
ret
But telling it to tune for an old CPU without a stack engine makes it do this:
foo: # gcc8.3 -march=pentium3 -O3 -m32
sub esp, 12
mov DWORD PTR [esp], ebx
mov DWORD PTR [esp+4], esi
mov DWORD PTR [esp+8], edi
mov ebx, DWORD PTR [esp]
mov esi, DWORD PTR [esp+4]
mov edi, DWORD PTR [esp+8]
add esp, 12
ret
gcc thinks -march=pentium-m doesn't have a stack engine, or at least chooses not to use push/pop there. I think that's a mistake, because Agner Fog's microarch pdf definitely describes the stack engine as being present in Pentium-M.
On P-M and later, push/pop are single-uop instructions, with the ESP update handled outside the out-of-order backend, and for push the store-address+store-data uops are micro-fused.
On Pentium 3, they're 2 or 3 uops each. (Again, see Agner Fog's instruction tables.)
On in-order P5 Pentium, push and pop are actually fine. (But memory-destination instructions like add [mem], reg were generally avoided, because P5 didn't split them into uops to pipeline better.)
Mixing pop with direct references to [esp] will actually be potentially slower than just one or the other, on modern Intel CPUs, because it costs extra stack-sync uops.
Obviously writing EAX 3 times back to back means the first 2 loads are useless in both sequences.
See Extreme Fibonacci for an example of pop (1 uop, or like 1.1 uop with the stack sync uops amortized) being more efficient than lodsd (2 uops on Skylake) for reading through an array. (In evil code that assumes a large red-zone because it doesn't install signal handlers. Don't actually do this unless you know exactly what you're doing and when it will break; this is more of a silly computer tricks / extreme optimization for code-golf than anything that's practically useful.)
Footnote 1: The Godbolt compiler explorer normally filters out extra assembler directives, but if you uncheck that box you can see gcc's function that uses push/pop has .cfi_def_cfa_offset 12 after every push/pop.
pop ebx
.cfi_restore 3
.cfi_def_cfa_offset 12
pop esi
.cfi_restore 6
.cfi_def_cfa_offset 8
pop edi
.cfi_restore 7
.cfi_def_cfa_offset 4
The .cfi_restore 7 metadata directives have to be there regardless of push/pop vs. mov, because that lets stack unwinding restore call-preserved registers as it unwinds. (7 is the register number).
But for other uses of push/pop inside a function (like pushing args to a function call, or a dummy pop to remove it from the stack), you wouldn't have .cfi_restore, only metadata for the stack pointer changing relative to the stack frame.
Normally you don't worry about this in hand-written asm, but compilers have to get this right so there's a small extra cost to using push/pop in terms of total executable size. But only in parts of the file that aren't mapped into memory normally, and not mixed with code.

This:
pop eax
pop ebx
pop ecx
.. is sort of equivalent to this:
mov eax,[esp]
add esp,4
mov ebx,[esp]
add esp,4
mov ecx,[esp]
add esp,4
..which can be like this:
mov eax,[esp] ;Do this instruction
add esp,4 ; ...and this instruction in parallel
;Stall until the previous instruction completes (and the value
mov ebx,[esp] ;in ESP becomes known); then do this instruction
add esp,4 ; ...and this instruction in parallel
;Stall until the previous instruction completes (and the value
mov ecx,[esp] ;in ESP becomes known); then do this instruction
add esp,4 ; ...and this instruction in parallel
For this code:
mov eax, [esp]
mov ebx, [esp + 4]
mov ecx, [esp + 8]
add esp,12
.. all of the instructions can happen in parallel (in theory).
Note: In practice all of the above depends on which CPU, etc.

Why does VC++ 2010 often use ebx as a "zero register"?

Yesterday I was looking at some 32 bit code generated by VC++ 2010 (most probably; don't know about the specific options, sorry) and I was intrigued by a curious recurring detail: in many functions, it zeroed out ebx in the prologue, and it always used it like a "zero register" (think $zero on MIPS). In particular, it often:
used it to zero out memory; this is not unusual, as the encoding for a mov mem,imm is 1 to 4 bytes bigger than mov mem,reg (the full immediate value size has to be encoded even for 0), but usually (gcc) the necessary register is zeroed out "on demand", and kept for more useful purposes otherwise;
used it for compares against zero - as in cmp reg,ebx. This is what stroke me as really unusual, as it should be exactly the same as test reg,reg, but adds a dependency to an extra register. Now, keep in mind that this happened in non-leaf functions, with ebx being often pushed (by the callee) on and off the stack, so I would not trust this dependency to be always completely free. Also, it also used test reg,reg in the exact same fashion (test/cmp => jg).
Most importantly, registers on "classic" x86 are a scarce resource, if you start having to spill registers you waste a lot of time for no good reason; why waste one through all the function just to keep a zero in it? (still, thinking about it, I don't remember seeing much register spillage in functions that used this "zero-register" pattern).
So: what am I missing? Is it a compiler blooper or some incredibly smart optimization that was particularly interesting in 2010?
Here's an excerpt:
; standard prologue: ebp/esp, SEH, overflow protection, ... then:
xor ebx, ebx
mov [ebp+4], ebx ; zero out some locals
mov [ebp], ebx
call function_1
xor ecx, ecx ; ebx _not_ used to zero registers
cmp eax, ebx ; ... but used for compares?! why not test eax,eax?
setnz cl ; what? it goes through cl to check if eax is not zero?
cmp ecx, ebx ; still, why not test ecx,ecx?
jnz function_body
push 123456
call throw_something
function_body:
mov edx, [eax]
mov ecx, eax ; it's not like it was interested in ecx anyway...
mov eax, [edx+0Ch]
call eax ; virtual method call; ebx is preserved but possibly pushed/popped
lea esi, [eax+10h]
mov [ebp+0Ch], esi
mov eax, [ebp+10h]
mov ecx, [eax-0Ch]
xor edi, edi ; ugain, registers are zeroed as usual
mov byte ptr [ebp+4], 1
mov [ebp+8], ecx
cmp ecx, ebx ; why not test ecx,ecx?
jg somewhere
label1:
lea eax, [esi-10h]
mov byte ptr [ebp+4], bl ; ok, uses bl to write a zero to memory
lea ecx, [eax+0Ch]
or edx, 0FFFFFFFFh
lock xadd [ecx], edx
dec edx
test edx, edx ; now it's using the regular test reg,reg!
jg somewhere_else
Notice: an earlier version of this question said that it used mov reg,ebx instead of xor ebx,ebx; this was just me not remembering stuff correctly. Sorry if anybody put too much thought trying to understand that.

Everything you commented on as odd looks sub-optimal to me. test eax,eax sets all flags (except AF) the same as cmp against zero, and is preferred for performance and code-size.
On P6 (PPro through Nehalem), reading long-dead registers is bad because it can lead to register-read stalls. P6 cores can only read 2 or 3 not-recently-modified architectural registers from the permanent register file per clock (to fetch operands for the issue stage: the ROB holds operands for uops, unlike on SnB-family where it only holds references to the physical register file).
Since this is from VS2010, Sandybridge wasn't released yet, so it should have put a lot of weight on tuning for Pentium II/III, Pentium-M, Core2, and Nehalem where reading "cold" registers is a possible bottleneck.
IDK if anything like this ever made sense for integer regs, but I don't know much about optimizing for CPUs older than P6.
The cmp / setz / cmp / jnz sequence looks particularly braindead. Maybe it came from a compiler-internal canned sequence for producing a boolean value from something, and it failed to optimize a test of the boolean back into just using the flags directly? That still doesn't explain the use of ebx as a zero-register, which is also completely useless there.
Is it possible that some of that was from inline-asm that returned a boolean integer (using a silly that wanted a zero in a register)?
Or maybe the source code was comparing two unknown values, and it was only after inlining and constant-propagation that it turned into a compare against zero? Which MSVC failed to optimize fully, so it still kept 0 as a constant in a register instead of using test?
(the rest of this was written before the question included code).
Sounds weird, or like a case of CSE / constant-hoisting run amok. i.e. treating 0 like any other constant that you might want to load once and then reg-reg copy throughout the function.
Your analysis of the data-dependency behaviour is correct: moving from a register that was zeroed a while ago essentially starts a new dependency chain.
When gcc wants two zeroed registers, it often xor-zeroes one and then uses a mov or movdqa to copy to the other.
This is sub-optimal on Sandybridge where xor-zeroing doesn't need an execution port, but a possible win on Bulldozer-family where mov can run on the AGU or ALU, but xor-zeroing still needs an ALU port.
For vector moves, it's a clear win on Bulldozer: handled in register rename with no execution unit. But xor-zeroing of an XMM or YMM register still needs an execution port on Bulldozer-family (or two for ymm, so always use xmm with implicit zero-extension).
Still, I don't think that justifies tying up a register for the duration of a whole function, especially not if it costs extra saves/restores. And not for P6-family CPUs where register-read stalls are a thing.

How to limit the address space of 32bit application on 64bit Linux to 3GB?

Is it possible to make 64bit Linux loader to limit the address space of the loaded 32bit program to some upper limit?
Or to set some holes in the address space that to not be allocated by the kernel?
I mean for specific executable, not globally for all processes, neither through kernel configuration. Some code or ELF executable flags are examples of appropriate solution.
The limit should be forced for all loaded shared libraries as well.
Clarification:
The problem I want to fix is that my code uses the numbers above 0xc0000000 as a handle values and I want to clearly distinct between handle values and memory addresses, even when the memory addresses are allocated and returned by some third party library function.
As long as the address space in 64bit Linux is very close to 4G limit, there is no enough addressing space left for the handle values.
On the other hand 3GB or even less is far enough for all my needs.

OK, I found the answer of this question elsewhere.
The solution is to change the "personality" of your program to PER_LINUX32_3GB, using the Linux system call sys_personality.
But there is a problem. After switching to PER_LINUX32_3GB Linux kernel will not allocate space in the upper 1GB, but the already allocated space, for example the application stack, remains there.
The solution is to "restart" your program through sys_execve system call.
Here is the code where I packed everything in one:
proc ___SwitchLinuxTo3GB
begin
cmp esp, $c0000000
jb .finish ; the system is native 32bit
; check the current personality.
mov eax, sys_personality
mov ebx, -1
int $80
; and exit if it is what intended
test eax, ADDR_LIMIT_3GB
jnz .finish ; everything is OK.
; set the needed personality
mov eax, sys_personality
mov ebx, PER_LINUX32_3GB
int $80
; and restart the process
mov eax, [esp+4] ; argument count
mov ebx, [esp+8] ; the filename of the executable.
lea ecx, [esp+8] ; the arguments list.
lea edx, [ecx+4*eax+4] ; the environment list.
mov eax, sys_execve
int $80
; if something gone wrong, it comes here and stops!
int3
.finish:
return
endp

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string