What's the meaning of mov 0x8(%r14,%r15,8),%rax

What's the meaning of mov 0x8(%r14,%r15,8),%rax - linux

In here what's the meaning of 0x8(%r14,%r15,8), I know 0x8(%r14,%r15,8) is SRC, but I don't understand why use two register %r14 and %r15 in here, and I don't understand how to cal the src address.
Thanks so much for any input.

Information pulled from http://flint.cs.yale.edu/cs421/papers/x86-asm/asm.html
AT&T Addressing:
Memory Address Reference: Address_or_Offset(%base_or_offset, %Index_Register, Scale)
Final Address Calculation: Address_or_Offset + %base_or_offset + [Scale * %Index_Reg]
Example:
mov (%esi,%ebx,4), %edx /* Move the 4 bytes of data at address ESI+4*EBX into EDX. */

Related

How to use unsafe get a byte slice from a string without memory copy

I have read about "https://github.com/golang/go/issues/25484" about no-copy conversion from []byte to string.
I am wondering if there is a way to convert a string to a byte slice without memory copy?
I am writing a program which processes terra-bytes data, if every string is copied twice in memory, it will slow down the progress. And I do not care about mutable/unsafe, only internal usage, I just need the speed as fast as possible.
Example:
var s string
// some processing on s, for some reasons, I must use string here
// ...
// then output to a writer
gzipWriter.Write([]byte(s)) // !!! Here I want to avoid the memory copy, no WriteString
So the question is: is there a way to prevent from the memory copying? I know maybe I need the unsafe package, but I do not know how. I have searched a while, no answer till now, neither the SO showed related answers works.

Getting the content of a string as a []byte without copying in general is only possible using unsafe, because strings in Go are immutable, and without a copy it would be possible to modify the contents of the string (by changing the elements of the byte slice).
So using unsafe, this is how it could look like (corrected, working solution):
func unsafeGetBytes(s string) []byte {
return (*[0x7fff0000]byte)(unsafe.Pointer(
(*reflect.StringHeader)(unsafe.Pointer(&s)).Data),
)[:len(s):len(s)]
}
This solution is from Ian Lance Taylor.
One thing to note here: the empty string "" has no bytes as its length is zero. This means there is no guarantee what the Data field may be, it may be zero or an arbitrary address shared among the zero-size variables. If an empty string may be passed, that must be checked explicitly (although there's no need to get the bytes of an empty string without copying...):
func unsafeGetBytes(s string) []byte {
if s == "" {
return nil // or []byte{}
}
return (*[0x7fff0000]byte)(unsafe.Pointer(
(*reflect.StringHeader)(unsafe.Pointer(&s)).Data),
)[:len(s):len(s)]
}
Original, wrong solution was:
func unsafeGetBytesWRONG(s string) []byte {
return *(*[]byte)(unsafe.Pointer(&s)) // WRONG!!!!
}
See Nuno Cruces's answer below for reasoning.
Testing it:
s := "hi"
data := unsafeGetBytes(s)
fmt.Println(data, string(data))
data = unsafeGetBytes("gopher")
fmt.Println(data, string(data))
Output (try it on the Go Playground):
[104 105] hi
[103 111 112 104 101 114] gopher
BUT: You wrote you want this because you need performance. You also mentioned you want to compress the data. Please know that compressing data (using gzip) requires a lot more computation than just copying a few bytes! You will not see any noticeable performance gain by using this!
Instead when you want to write strings to an io.Writer, it's recommended to do it via io.WriteString() function which if possible will do so without making a copy of the string (by checking and calling WriteString() method which if exists is most likely does it better than copying the string). For details, see What's the difference between ResponseWriter.Write and io.WriteString?
There are also ways to access the contents of a string without converting it to []byte, such as indexing, or using a loop where the compiler optimizes away the copy:
s := "something"
for i, v := range []byte(s) { // Copying s is optimized away
// ...
}
Also see related questions:
[]byte(string) vs []byte(*string)
What are the possible consequences of using unsafe conversion from []byte to string in go?
What is the difference between the string and []byte in Go?
Does conversion between alias types in Go create copies?
How does type conversion internally work? What is the memory utilization for the same?

After some extensive investigation, I believe I've discovered the most efficient way of getting a []byte from a string as of Go 1.17 (this is for i386/x86_64 gc; I haven't tested other architectures.) The trade-off of being efficient code here is being inefficient to code, though.
Before I say anything else, it should be made clear that the differences are ultimately very small and probably inconsequential -- the info below is for fun/educational purposes only.
Summary
With some minor alterations, the accepted answer illustrating the technique of slicing a pointer to array is the most efficient way. That being said, I wouldn't be surprised if unsafe.Slice becomes the (decisively) better choice in the future.
unsafe.Slice
unsafe.Slice currently has the advantage of being slightly more readable, but I'm skeptical about it's performance. It looks like it makes a call to runtime.unsafeslice. The following is the gc amd64 1.17 assembly of the function provided in Atamiri's answer (FUNCDATA omitted). Note the stack check (lack of NOSPLIT):
unsafeGetBytes_pc0:
TEXT "".unsafeGetBytes(SB), ABIInternal, $48-16
CMPQ SP, 16(R14)
PCDATA $0, $-2
JLS unsafeGetBytes_pc86
PCDATA $0, $-1
SUBQ $48, SP
MOVQ BP, 40(SP)
LEAQ 40(SP), BP
PCDATA $0, $-2
MOVQ BX, ""..autotmp_4+24(SP)
MOVQ AX, "".s+56(SP)
MOVQ BX, "".s+64(SP)
MOVQ "".s+56(SP), DX
PCDATA $0, $-1
MOVQ DX, ""..autotmp_5+32(SP)
LEAQ type.uint8(SB), AX
MOVQ BX, CX
MOVQ DX, BX
PCDATA $1, $1
CALL runtime.unsafeslice(SB)
MOVQ ""..autotmp_5+32(SP), AX
MOVQ ""..autotmp_4+24(SP), BX
MOVQ BX, CX
MOVQ 40(SP), BP
ADDQ $48, SP
RET
unsafeGetBytes_pc86:
NOP
PCDATA $1, $-1
PCDATA $0, $-2
MOVQ AX, 8(SP)
MOVQ BX, 16(SP)
CALL runtime.morestack_noctxt(SB)
MOVQ 8(SP), AX
MOVQ 16(SP), BX
PCDATA $0, $-1
JMP unsafeGetBytes_pc0
Other unimportant fun facts about the above (easily subject to change): compiled size of 3326B; has an inline cost of 7; correct escape analysis: s leaks to ~r1 with derefs=0.
Carefully Modifying *reflect.SliceHeader
This method has the advantage/disadvantage of letting one modify the internal state of a slice directly. Unfortunately, due it's multiline nature and use of uintptr, the GC can easily mess things up if one is not careful about keeping a reference to the original string. (Here I avoided creating temporary pointers to reduce inline cost and to avoid needing to add runtime.KeepAlive):
func unsafeGetBytes(s string) (b []byte) {
(*reflect.SliceHeader)(unsafe.Pointer(&b)).Data = (*reflect.StringHeader)(unsafe.Pointer(&s)).Data
(*reflect.SliceHeader)(unsafe.Pointer(&b)).Cap = len(s)
(*reflect.SliceHeader)(unsafe.Pointer(&b)).Len = len(s)
return
}
The corresponding assembly on amd64 (FUNCDATA omitted):
TEXT "".unsafeGetBytes(SB), NOSPLIT|ABIInternal, $32-16
SUBQ $32, SP
MOVQ BP, 24(SP)
LEAQ 24(SP), BP
MOVQ AX, "".s+40(SP)
MOVQ BX, "".s+48(SP)
MOVQ $0, "".b(SP)
MOVUPS X15, "".b+8(SP)
MOVQ "".s+40(SP), DX
MOVQ DX, "".b(SP)
MOVQ "".s+48(SP), CX
MOVQ CX, "".b+16(SP)
MOVQ "".s+48(SP), BX
MOVQ BX, "".b+8(SP)
MOVQ "".b(SP), AX
MOVQ 24(SP), BP
ADDQ $32, SP
RET
Other unimportant fun facts about the above (easily subject to change): compiled size of 3700B; has an inline cost of 20; subpar escape analysis: s leaks to {heap} with derefs=0.
Unsafer version of modifying SliceHeader
Adapted from Nuno Cruces' answer. This relies on the inherent structural similarity between StringHeader and SliceHeader, so in a sense it breaks "more easily". Additionally, it temporarily creates an illegal state where cap(b) (being 0) is less than len(b).
func unsafeGetBytes(s string) (b []byte) {
*(*string)(unsafe.Pointer(&b)) = s
(*reflect.SliceHeader)(unsafe.Pointer(&b)).Cap = len(s)
return
}
Corresponding assembly (FUNCDATA omitted):
TEXT "".unsafeGetBytes(SB), NOSPLIT|ABIInternal, $32-16
SUBQ $32, SP
MOVQ BP, 24(SP)
LEAQ 24(SP), BP
MOVQ AX, "".s+40(FP)
MOVQ $0, "".b(SP)
MOVUPS X15, "".b+8(SP)
MOVQ AX, "".b(SP)
MOVQ BX, "".b+8(SP)
MOVQ BX, "".b+16(SP)
MOVQ "".b(SP), AX
MOVQ BX, CX
MOVQ 24(SP), BP
ADDQ $32, SP
NOP
RET
Other unimportant details: compiled size 3636B, inline cost of 11, with subpar escape analysis: s leaks to {heap} with derefs=0.
Slicing a pointer to array
This is the accepted answer (shown here for comparison) -- its primary disadvantage is its ugliness (viz. magic number 0x7fff0000). There's also the tiniest possibility of getting a string bigger than the array, and an unavoidable bounds check.
func unsafeGetBytes(s string) []byte {
return (*[0x7fff0000]byte)(unsafe.Pointer(
(*reflect.StringHeader)(unsafe.Pointer(&s)).Data),
)[:len(s):len(s)]
}
Corresponding assembly (FUNCDATA removed).
TEXT "".unsafeGetBytes(SB), NOSPLIT|ABIInternal, $24-16
SUBQ $24, SP
MOVQ BP, 16(SP)
LEAQ 16(SP), BP
PCDATA $0, $-2
MOVQ AX, "".s+32(SP)
MOVQ BX, "".s+40(SP)
MOVQ "".s+32(SP), AX
PCDATA $0, $-1
TESTB AL, (AX)
NOP
CMPQ BX, $2147418112
JHI unsafeGetBytes_pc54
MOVQ BX, CX
MOVQ 16(SP), BP
ADDQ $24, SP
RET
unsafeGetBytes_pc54:
MOVQ BX, DX
MOVL $2147418112, BX
PCDATA $1, $1
NOP
CALL runtime.panicSlice3Alen(SB)
XCHGL AX, AX
Other unimportant details: compiled size 3142B, inline cost of 9, with correct escape analysis: s leaks to ~r1 with derefs=0
Note the runtime.panicSlice3Alen -- this is bounds check that checks that len(s) is within 0x7fff0000.
Improved slicing pointer to array
This is what I've concluded to be the most efficient method as of Go 1.17. I basically modified the accepted answer to eliminate the bounds check, and found a "more meaningful" constant (math.MaxInt32) to use than 0x7fff0000. Using MaxInt32 preserves 32-bit compatibility.
func unsafeGetBytes(s string) []byte {
const MaxInt32 = 1<<31 - 1
return (*[MaxInt32]byte)(unsafe.Pointer((*reflect.StringHeader)(
unsafe.Pointer(&s)).Data))[:len(s)&MaxInt32:len(s)&MaxInt32]
}
Corresponding assembly (FUNCDATA removed):
TEXT "".unsafeGetBytes(SB), NOSPLIT|ABIInternal, $0-16
PCDATA $0, $-2
MOVQ AX, "".s+8(SP)
MOVQ BX, "".s+16(SP)
MOVQ "".s+8(SP), AX
PCDATA $0, $-1
TESTB AL, (AX)
ANDQ $2147483647, BX
MOVQ BX, CX
RET
Other unimportant details: compiled size 3188B, inline cost of 13, and correct escape analysis: s leaks to ~r1 with derefs=0

In go 1.17, I'd recommend unsafe.Slice as more readable:
unsafe.Slice((*byte)(unsafe.Pointer((*reflect.StringHeader)(unsafe.Pointer(&s)).Data)), len(s))
I think that this also works (doesn't violate any unsafe.Pointer rules), with the benefit that it works for a const s:
*(*[]byte)(unsafe.Pointer(&struct{string; int}{s, len(s)}))
Commentary bellow is regarding the accepted answer as it originally stood. The accepted answer now mentions an (authoritative) solution from Ian Lance Taylor. Keeping it as it points out a common error.
The accepted answer is wrong, and may produce the panic #RFC mentioned in the comments. The explanation by #icza about GC and keep alive is misguided.
The reason capacity is zero (or even an arbitrary value) is more prosaic.
A slice is:
type SliceHeader struct {
Data uintptr
Len int
Cap int
}
A string is:
type StringHeader struct {
Data uintptr
Len int
}
Converting a byte slice to a string can be "safely" done as the strings.Builder does it:
func (b *Builder) String() string {
return *(*string)(unsafe.Pointer(&b.buf))
}
This will copy the Data pointer and Len from the slice to the string.
The opposite conversion is not "safe" because Cap doesn't get set to the correct value.
The following (originally by me) is also wrong because it violates unsafe.Pointer rule #1.
This is the correct code, that fixes the panic:
var buf = *(*[]byte)(unsafe.Pointer(&str))
(*reflect.SliceHeader)(unsafe.Pointer(&buf)).Cap = len(str)
Or perhaps:
var buf []byte
*(*string)(unsafe.Pointer(&buf)) = str
(*reflect.SliceHeader)(unsafe.Pointer(&buf)).Cap = len(str)
I should add that all these conversions are unsafe in the sense that strings are expected to be immutable, and byte arrays/slices mutable.
But if you know for sure that the byte slice won't be mutated, you won't get bounds (or GC) issues with the above conversions.

In Go 1.17, one can now use unsafe.Slice, so the accepted answer can be rewritten as follows:
func unsafeGetBytes(s string) []byte {
return unsafe.Slice((*byte)(unsafe.Pointer((*reflect.StringHeader)(unsafe.Pointer(&s)).Data)), len(s))
}

I managed to get the goal by this:
func TestString(t *testing.T) {
b := []byte{'a', 'b', 'c', '1', '2', '3', '4'}
s := *(*string)(unsafe.Pointer(&b))
sb := *(*[]byte)(unsafe.Pointer(&s))
addr1 := unsafe.Pointer(&b)
addr2 := unsafe.Pointer(&s)
addr3 := unsafe.Pointer(&sb)
fmt.Print("&b=", addr1, "\n&s=", addr2, "\n&sb=", addr3, "\n")
hdr1 := (*reflect.StringHeader)(unsafe.Pointer(&b))
hdr2 := (*reflect.SliceHeader)(unsafe.Pointer(&s))
hdr3 := (*reflect.SliceHeader)(unsafe.Pointer(&sb))
fmt.Print("b.data=", hdr1.Data, "\ns.data=", hdr2.Data, "\nsb.data=", hdr3.Data, "\n")
b[0] = 'X'
sb[1] = 'Y' // if sb is from a string directly, this will cause nil panic
fmt.Print("s=", s, "\nsb=")
for _, c := range sb {
fmt.Printf("%c", c)
}
fmt.Println()
}
Output:
=== RUN TestString
&b=0xc000218000
&s=0xc00021a000
&sb=0xc000218020
b.data=824635867152
s.data=824635867152
sb.data=824635867152
s=XYc1234
sb=XYc1234
These variables all share the same memory.

Go 1.20 (February 2023)
You can use unsafe.StringData to greatly simplify YenForYang's answer:
StringData returns a pointer to the underlying bytes of str. For an empty string the return value is unspecified, and may be nil.
Since Go strings are immutable, the bytes returned by StringData must not be modified.
func main() {
str := "foobar"
d := unsafe.StringData(str)
b := unsafe.Slice(d, len(str))
fmt.Printf("%T, %s\n", b, b) // []uint8, foobar (byte is alias of uint8)
}
Go tip playground: https://go.dev/play/p/FIXe0rb8YHE?v=gotip
Remember that you can't assign to b[n]. The memory is still read-only.

Simple, no reflect, and I think it is portable. s is your string and b is your bytes slice
var b []byte
bb:=(*[3]uintptr)(unsafe.Pointer(&b))[:]
copy(bb, (*[2]uintptr)(unsafe.Pointer(&s))[:])
bb[2] = bb[1]
// use b
Remember, bytes value should not be modified (will panic). re-slicing is ok (for example: bytes.split(b, []byte{','} )

What is the point of atomic.Load and atomic.Store

In the Go's memory model nothing is stated about atomics and their relation to memory fencing.
Although many internal packages seem to rely on the memory ordering that could be provided if atomics created memory fences around them. See this issue for details.
After not understanding how it really works, I went to the sources, in particular src/runtime/internal/atomic/atomic_amd64.go and found following implementations of Load and Store:
//go:nosplit
//go:noinline
func Load(ptr *uint32) uint32 {
return *ptr
}
Store is implemented in asm_amd64.s in the same package.
TEXT runtime∕internal∕atomic·Store(SB), NOSPLIT, $0-12
MOVQ ptr+0(FP), BX
MOVL val+8(FP), AX
XCHGL AX, 0(BX)
RET
Both look as if they had nothing to do with parallelism.
I did look into other architectures but implementation seems to be equivalent.
However, if atomics are indeed weak and provide no memory ordering guarantees, than the code below could fail, but it does not.
As an addition I tried replacing atomic calls with simple assignments but it still produces consistent and "successful" result in both cases.
func try() {
var a, b int32
go func() {
// atomic.StoreInt32(&a, 1)
// atomic.StoreInt32(&b, 1)
a = 1
b = 1
}()
for {
// if n := atomic.LoadInt32(&b); n == 1 {
if n := b; n == 1 {
if a != 1 {
panic("fail")
}
break
}
runtime.Gosched()
}
}
func main() {
n := 1000000000
for i := 0; i < n ; i++ {
try()
}
}
The next thought was that the compiler does some magic to provide ordering guarantees. So below is the listing of the variant with atomic Store and Load not commented. Full listing is available on the pastebin.
// Anonymous function implementation with atomic calls inlined
TEXT %22%22.try.func1(SB) gofile../path/atomic.go
atomic.StoreInt32(&a, 1)
0x816 b801000000 MOVL $0x1, AX
0x81b 488b4c2408 MOVQ 0x8(SP), CX
0x820 8701 XCHGL AX, 0(CX)
atomic.StoreInt32(&b, 1)
0x822 b801000000 MOVL $0x1, AX
0x827 488b4c2410 MOVQ 0x10(SP), CX
0x82c 8701 XCHGL AX, 0(CX)
}()
0x82e c3 RET
// Important "cycle" part of try() function
0x6ca e800000000 CALL 0x6cf [1:5]R_CALL:runtime.newproc
for {
0x6cf eb12 JMP 0x6e3
runtime.Gosched()
0x6d1 90 NOPL
checkTimeouts()
0x6d2 90 NOPL
mcall(gosched_m)
0x6d3 488d0500000000 LEAQ 0(IP), AX [3:7]R_PCREL:runtime.gosched_m·f
0x6da 48890424 MOVQ AX, 0(SP)
0x6de e800000000 CALL 0x6e3 [1:5]R_CALL:runtime.mcall
if n := atomic.LoadInt32(&b); n == 1 {
0x6e3 488b442420 MOVQ 0x20(SP), AX
0x6e8 8b08 MOVL 0(AX), CX
0x6ea 83f901 CMPL $0x1, CX
0x6ed 75e2 JNE 0x6d1
if a != 1 {
0x6ef 488b442428 MOVQ 0x28(SP), AX
0x6f4 833801 CMPL $0x1, 0(AX)
0x6f7 750a JNE 0x703
0x6f9 488b6c2430 MOVQ 0x30(SP), BP
0x6fe 4883c438 ADDQ $0x38, SP
0x702 c3 RET
As you can see, no fences or locks are in place again.
Note: all tests are done on x86_64 and i5-8259U
The question:
So, is there any point of wrapping simple pointer dereference in a function call or is there some hidden meaning to it and why do these atomics still work as memory barriers? (if they do)

I don't know Go at all, but it looks like the x86-64 implementations of .load() and .store() are sequentially-consistent. Presumably on purpose / for a reason!
//go:noinline on the load means the compiler can't reorder around a blackbox non-inline function, I assume. On x86 that's all you need for the load side of sequential-consistency, or acq-rel. A plain x86 mov load is an acquire load.
The compiler-generated code gets to take advantage of x86's strongly-ordered memory model, which is sequential consistency + a store buffer (with store forwarding), i.e. acq/rel. To recover sequential consistency, you only need to drain the store buffer after a release-store.
.store() is written in asm, loading its stack args and using xchg as a seq-cst store.
XCHG with memory has an implicit lock prefix which is a full barrier; it's an efficient alternative to mov+mfence to implement what C++ would call a memory_order_seq_cst store.
It flushes the store buffer before later loads and stores are allowed to touch L1d cache. Why does a std::atomic store with sequential consistency use XCHG?
See
https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
C/C++11 mappings to processors
describes the sequences of instructions that implement relaxed load/store, acq/rel load/store, seq-cst load/store, and various barriers, on various ISAs. So you can recognize things like xchg with memory.
Does lock xchg have the same behavior as mfence? (TL:DR: yes except for maybe some corner cases with NT loads from WC memory, e.g. from video RAM). You may see a dummy lock add $0, (SP) used as an alternative to mfence in some code.
IIRC, AMD's optimization manual even recommends this. It's good on Intel as well, especially on Skylake where mfence was strengthened by microcode update to fully block out-of-order exec even of ALU instructions (like lfence) as well as memory reordering. (To fix an erratum with NT loads.)
https://preshing.com/20120913/acquire-and-release-semantics/

In MSVC, why do InterlockedOr and InterlockedAnd generate a loop instead of a simple locked instruction?

On MSVC for x64 (19.10.25019),
InterlockedOr(&g, 1)
generates this code sequence:
prefetchw BYTE PTR ?g##3JC
mov eax, DWORD PTR ?g##3JC ; g
npad 3
$LL3#f:
mov ecx, eax
or ecx, 1
lock cmpxchg DWORD PTR ?g##3JC, ecx ; g
jne SHORT $LL3#f
I would have expected the much simpler (and loopless):
mov eax, 1
lock or [?g##3JC], eax
InterlockedAnd generates analogous code to InterlockedOr.
It seems wildly inefficient to have to have a loop for this instruction. Why is this code generated?
(As a side note: the whole reason I was using InterlockedOr was to do an atomic load of the variable - I have since learned that InterlockedCompareExchange is the way to do this. It is odd to me that there is no InterlockedLoad(&x), but I digress...)

The documented contract for InterlockedOr has it returning the original value:
InterlockedOr
Performs an atomic OR operation on the specified LONG values. The function prevents more than one thread from using the same variable simultaneously.
LONG __cdecl InterlockedOr(
_Inout_ LONG volatile *Destination,
_In_ LONG Value
);
Parameters:
Destination [in, out]
A pointer to the first operand. This value will be replaced with the result of the operation.
Value [in]
The second operand.
Return value
The function returns the original value of the Destination parameter.
This is why the unusual code that you've observed is required. The compiler cannot simply emit an OR instruction with a LOCK prefix, because the OR instruction does not return the previous value. Instead, it has to use the odd workaround with LOCK CMPXCHG in a loop. In fact, this apparently unusual sequence is the standard pattern for implementing interlocked operations when they aren't natively supported by the underlying hardware: capture the old value, perform an interlocked compare-and-exchange with the new value, and keep trying in a loop until the old value from this attempt is equal to the captured old value.
As you observed, you see the same thing with InterlockedAnd, for exactly the same reason: the x86 AND instruction doesn't return the original value, so the code-generator has to fallback on the general pattern involving compare-and-exchange, which is directly supported by the hardware.
Note that, at least on x86 where InterlockedOr is implemented as an intrinsic, the optimizer is smart enough to figure out whether you're using the return value or not. If you are, then it uses the workaround code involving CMPXCHG. If you are ignoring the return value, then it goes ahead and emits code using LOCK OR, just like you would expect.
#include <intrin.h>
LONG InterlockedOrWithReturn()
{
LONG val = 42;
return _InterlockedOr(&val, 8);
}
void InterlockedOrWithoutReturn()
{
LONG val = 42;
LONG old = _InterlockedOr(&val, 8);
}
InterlockedOrWithoutReturn, COMDAT PROC
mov DWORD PTR [rsp+8], 42
lock or DWORD PTR [rsp+8], 8
ret 0
InterlockedOrWithoutReturn ENDP
InterlockedOrWithReturn, COMDAT PROC
mov DWORD PTR [rsp+8], 42
prefetchw BYTE PTR [rsp+8]
mov eax, DWORD PTR [rsp+8]
LoopTop:
mov ecx, eax
or ecx, 8
lock cmpxchg DWORD PTR [rsp+8], ecx
jne SHORT LoopTop
ret 0
InterlockedOrWithReturn ENDP
The optimizer is equally as smart for InterlockedAnd, and should be for the other Interlocked* functions, as well.
As intuition would tell you, the LOCK OR implementation is more efficient than the LOCK CMPXCHG in a loop. Not only is there the expanded code size and the overhead of looping, but you risk branch prediction misses, which can cost a large number of cycles. In performance-critical code, if you can avoid relying on the return value for interlocked operations, you can gain a performance boost.
However, what you really should be using in modern C++ is std::atomic, which allows you to specify the desired memory model/semantics, and then let the standard library maintainers deal with the complexity.

Compare user-inputted string/character to another string/character

So I'm a bit of a beginner to ARM Assembly (assembly in general, too). Right now I'm writing a program and one of the biggest parts of it is that the user will need to type in a letter, and then I will compare that letter to some other pre-inputted letter to see if the user typed the same thing.
For instance, in my code I have
.balign 4 /* Forces the next data declaration to be on a 4 byte segment */
dime: .asciz "D\n"
at the top of the file and
addr_dime : .word dime
at the bottom of the file.
Also, based on what I've been reading online I put
.balign 4
inputChoice: .asciz "%d"
at the top of the file, and put
inputVal : .word 0
at the bottom of the file.
Near the middle of the file (just trust me that there is something wrong with this standalone code, and the rest of the file doesn't matter in this context) I have this block of code:
ldr r3, addr_dime
ldr r2, addr_inputChoice
cmp r2, r3 /*See if the user entered D*/
addeq r5, r5, #10 /*add 10 to the total if so*/
Which I THINK should load "D" into r3, load whatever String or character the user inputted into r2, and then add 10 to r5 if they are the same.
For some reason this doesn't work, and the r5, r5, #10 code only works if addne comes before it.

addr_dime : .word dime is poitlessly over-complicated. The address is already a link-time constant. Storing the address in memory (at another location which has its own address) doesn't help you at all, it just adds another layer of indirection. (Which is actually the source of your problem.)
Anyway, cmp doesn't dereference its register operands, so you're comparing pointers. If you single-step with a debugger, you'll see that the values in registers are pointers.
To load the single byte at dime, zero-extended into r3, do
ldrb r3, dime
Using ldr to do a 32-bit load would also get the \n byte, and a 32-bit comparison would have to match that too for eq to be true.
But this can only work if dime is close enough for a PC-relative addressing mode to fit; like most RISC machines, ARM can't use arbitrary absolute addresses because the instruction-width is fixed.
For the constant, the easiest way to avoid that is not to store it in memory in the first place. Use .equ dime, 'D' to define a numeric constant, then you can use
cmp r2, dime # compare with immediate operand
Or ldr r3, =dime to ask the assembler to get the constant into a register for you. You can do this with addresses, so you could do
ldr r2, =inputVal # r2 = &inputVal
ldrb r2, [r2] # load first byte of inputVal
This is the generic way to handle loading from static data that might be too far away for a PC-relative addressing mode.
You could avoid that by using a stack address (sub sp, #16 / mov r5, sp or something). Then you already have the address in a register.
This is exactly what a C compiler does:
char dime[4] = "D\n";
char input[4] = "xyz";
int foo(int start) {
if (dime[0] == input[0])
start += 10;
return start;
}
From ARM32 gcc6.3 on the Godbolt compiler explorer:
foo:
ldr r3, .L4 # load a pointer to the data section at dime / input
ldrb r2, [r3]
ldrb r3, [r3, #4]
cmp r2, r3
addeq r0, r0, #10
bx lr
.L4:
# gcc greated this "literal pool" next to the code
# holding a pointer it can use to access the data section,
# wherever the linker ends up putting it.
.word .LANCHOR0
.section .data
.p2align 2
### These are in a different section, near each other.
### On Godbolt, click the .text button to see full assembler directives.
.LANCHOR0: # actually defined with a .set directive, but same difference.
dime:
.ascii "D\012\000"
input:
.ascii "xyz\000"
Try changing the C to compare with a literal character instead of a global the compiler can't optimize into a constant, and see what you get.

Using relative values in array sorting ( asm )

I need to sort through an array and sort each individual row in array to be in ascending order. I doesn't seem to going so well (surprise!) as I keep getting hit with two errors:
a2101: cannot add two relocatable labels
and
a2026: constant expected
Here's my sort, it makes sense to me, but I think I'm still trying to implement high-lvl language techniqes into assembly. Is there a way to get around not being able to use relative values? (the array is 7 rows by 9 columns, btw).
mov cx, 7; cx = number of rows
outer: ; outer loop walk through the rows
push cx
mov cx, 9
mov row, cx ;rows
middle: ; middle-loop walk through the columns
push cx
sub cx, 1 ;cx = cx-1
mov column, cx ;columns
inner: ;inner loop - compare and exchange column values
cmp mArray[row*9 + column], mArray[row*9 + column+1]
xchg mArray[row*9 + column+1], mArray[row*9 + column]
; compare and exchange values from mArray table
inc column
loop inner
pop cx
loop middle ;end middle loop
pop cx
loop outer ; end outer loop
ret
Thanks for any help.

The following lines are problematic:
cmp mArray[row*9 + column], mArray[row*9 + column+1]
xchg mArray[row*9 + column+1], mArray[row*9 + column]
Unlike HLL, assembly does NOT allow for arbitrary expressions in place of constants or variables. That's why HLL's were invented in the first place. Calculate the offset in the registers before using:
mov ax, row
mov bx, ax
shr bx, 3 ; bx = row*8 now
add bx, ax ; bx = row*9 now
add bx, column ; bx = row*8+column now
mov dx, [bx] ;first comparand
inc bx
cmd dx, [bx] ; that's your compare!
Also, you don't use any branching; the cmp instruction is utterly pointless; you waste its result, and xcng is not executed conditionally. Read up on conditional jump commands (jz/jnz etc.).
Also, I seriously hope this is an exercise, not a real project. If it's for real, please reconsider using assembly. For something as trivial as this, assembly is a wrong, wrong choice. Espec. considering how bad you are at it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string