I need to sort through an array and sort each individual row in array to be in ascending order. I doesn't seem to going so well (surprise!) as I keep getting hit with two errors:
a2101: cannot add two relocatable labels
a2026: constant expected
Here's my sort, it makes sense to me, but I think I'm still trying to implement high-lvl language techniqes into assembly. Is there a way to get around not being able to use relative values? (the array is 7 rows by 9 columns, btw).
mov cx, 7; cx = number of rows
outer: ; outer loop walk through the rows
push cx
mov cx, 9
mov row, cx ;rows
middle: ; middle-loop walk through the columns
push cx
sub cx, 1 ;cx = cx-1
mov column, cx ;columns
inner: ;inner loop - compare and exchange column values
cmp mArray[row*9 + column], mArray[row*9 + column+1]
xchg mArray[row*9 + column+1], mArray[row*9 + column]
; compare and exchange values from mArray table
inc column
loop inner
pop cx
loop middle ;end middle loop
pop cx
loop outer ; end outer loop
Thanks for any help.

The following lines are problematic:
cmp mArray[row*9 + column], mArray[row*9 + column+1]
xchg mArray[row*9 + column+1], mArray[row*9 + column]
Unlike HLL, assembly does NOT allow for arbitrary expressions in place of constants or variables. That's why HLL's were invented in the first place. Calculate the offset in the registers before using:
mov ax, row
mov bx, ax
shr bx, 3 ; bx = row*8 now
add bx, ax ; bx = row*9 now
add bx, column ; bx = row*8+column now
mov dx, [bx] ;first comparand
inc bx
cmd dx, [bx] ; that's your compare!
Also, you don't use any branching; the cmp instruction is utterly pointless; you waste its result, and xcng is not executed conditionally. Read up on conditional jump commands (jz/jnz etc.).
Also, I seriously hope this is an exercise, not a real project. If it's for real, please reconsider using assembly. For something as trivial as this, assembly is a wrong, wrong choice. Espec. considering how bad you are at it.


How to use unsafe get a byte slice from a string without memory copy

I have read about "" about no-copy conversion from []byte to string.
I am wondering if there is a way to convert a string to a byte slice without memory copy?
I am writing a program which processes terra-bytes data, if every string is copied twice in memory, it will slow down the progress. And I do not care about mutable/unsafe, only internal usage, I just need the speed as fast as possible.
var s string
// some processing on s, for some reasons, I must use string here
// ...
// then output to a writer
gzipWriter.Write([]byte(s)) // !!! Here I want to avoid the memory copy, no WriteString
So the question is: is there a way to prevent from the memory copying? I know maybe I need the unsafe package, but I do not know how. I have searched a while, no answer till now, neither the SO showed related answers works.
Getting the content of a string as a []byte without copying in general is only possible using unsafe, because strings in Go are immutable, and without a copy it would be possible to modify the contents of the string (by changing the elements of the byte slice).
So using unsafe, this is how it could look like (corrected, working solution):
func unsafeGetBytes(s string) []byte {
return (*[0x7fff0000]byte)(unsafe.Pointer(
This solution is from Ian Lance Taylor.
One thing to note here: the empty string "" has no bytes as its length is zero. This means there is no guarantee what the Data field may be, it may be zero or an arbitrary address shared among the zero-size variables. If an empty string may be passed, that must be checked explicitly (although there's no need to get the bytes of an empty string without copying...):
func unsafeGetBytes(s string) []byte {
if s == "" {
return nil // or []byte{}
return (*[0x7fff0000]byte)(unsafe.Pointer(
Original, wrong solution was:
func unsafeGetBytesWRONG(s string) []byte {
return *(*[]byte)(unsafe.Pointer(&s)) // WRONG!!!!
See Nuno Cruces's answer below for reasoning.
Testing it:
s := "hi"
data := unsafeGetBytes(s)
fmt.Println(data, string(data))
data = unsafeGetBytes("gopher")
fmt.Println(data, string(data))
Output (try it on the Go Playground):
[104 105] hi
[103 111 112 104 101 114] gopher
BUT: You wrote you want this because you need performance. You also mentioned you want to compress the data. Please know that compressing data (using gzip) requires a lot more computation than just copying a few bytes! You will not see any noticeable performance gain by using this!
Instead when you want to write strings to an io.Writer, it's recommended to do it via io.WriteString() function which if possible will do so without making a copy of the string (by checking and calling WriteString() method which if exists is most likely does it better than copying the string). For details, see What's the difference between ResponseWriter.Write and io.WriteString?
There are also ways to access the contents of a string without converting it to []byte, such as indexing, or using a loop where the compiler optimizes away the copy:
s := "something"
for i, v := range []byte(s) { // Copying s is optimized away
// ...
Also see related questions:
[]byte(string) vs []byte(*string)
What are the possible consequences of using unsafe conversion from []byte to string in go?
What is the difference between the string and []byte in Go?
Does conversion between alias types in Go create copies?
How does type conversion internally work? What is the memory utilization for the same?
After some extensive investigation, I believe I've discovered the most efficient way of getting a []byte from a string as of Go 1.17 (this is for i386/x86_64 gc; I haven't tested other architectures.) The trade-off of being efficient code here is being inefficient to code, though.
Before I say anything else, it should be made clear that the differences are ultimately very small and probably inconsequential -- the info below is for fun/educational purposes only.
With some minor alterations, the accepted answer illustrating the technique of slicing a pointer to array is the most efficient way. That being said, I wouldn't be surprised if unsafe.Slice becomes the (decisively) better choice in the future.
unsafe.Slice currently has the advantage of being slightly more readable, but I'm skeptical about it's performance. It looks like it makes a call to runtime.unsafeslice. The following is the gc amd64 1.17 assembly of the function provided in Atamiri's answer (FUNCDATA omitted). Note the stack check (lack of NOSPLIT):
TEXT "".unsafeGetBytes(SB), ABIInternal, $48-16
CMPQ SP, 16(R14)
PCDATA $0, $-2
JLS unsafeGetBytes_pc86
PCDATA $0, $-1
SUBQ $48, SP
PCDATA $0, $-2
MOVQ BX, ""..autotmp_4+24(SP)
MOVQ AX, "".s+56(SP)
MOVQ BX, "".s+64(SP)
MOVQ "".s+56(SP), DX
PCDATA $0, $-1
MOVQ DX, ""..autotmp_5+32(SP)
LEAQ type.uint8(SB), AX
PCDATA $1, $1
CALL runtime.unsafeslice(SB)
MOVQ ""..autotmp_5+32(SP), AX
MOVQ ""..autotmp_4+24(SP), BX
ADDQ $48, SP
PCDATA $1, $-1
PCDATA $0, $-2
CALL runtime.morestack_noctxt(SB)
PCDATA $0, $-1
JMP unsafeGetBytes_pc0
Other unimportant fun facts about the above (easily subject to change): compiled size of 3326B; has an inline cost of 7; correct escape analysis: s leaks to ~r1 with derefs=0.
Carefully Modifying *reflect.SliceHeader
This method has the advantage/disadvantage of letting one modify the internal state of a slice directly. Unfortunately, due it's multiline nature and use of uintptr, the GC can easily mess things up if one is not careful about keeping a reference to the original string. (Here I avoided creating temporary pointers to reduce inline cost and to avoid needing to add runtime.KeepAlive):
func unsafeGetBytes(s string) (b []byte) {
(*reflect.SliceHeader)(unsafe.Pointer(&b)).Data = (*reflect.StringHeader)(unsafe.Pointer(&s)).Data
(*reflect.SliceHeader)(unsafe.Pointer(&b)).Cap = len(s)
(*reflect.SliceHeader)(unsafe.Pointer(&b)).Len = len(s)
The corresponding assembly on amd64 (FUNCDATA omitted):
TEXT "".unsafeGetBytes(SB), NOSPLIT|ABIInternal, $32-16
SUBQ $32, SP
MOVQ AX, "".s+40(SP)
MOVQ BX, "".s+48(SP)
MOVQ $0, "".b(SP)
MOVUPS X15, "".b+8(SP)
MOVQ "".s+40(SP), DX
MOVQ DX, "".b(SP)
MOVQ "".s+48(SP), CX
MOVQ CX, "".b+16(SP)
MOVQ "".s+48(SP), BX
MOVQ BX, "".b+8(SP)
MOVQ "".b(SP), AX
ADDQ $32, SP
Other unimportant fun facts about the above (easily subject to change): compiled size of 3700B; has an inline cost of 20; subpar escape analysis: s leaks to {heap} with derefs=0.
Unsafer version of modifying SliceHeader
Adapted from Nuno Cruces' answer. This relies on the inherent structural similarity between StringHeader and SliceHeader, so in a sense it breaks "more easily". Additionally, it temporarily creates an illegal state where cap(b) (being 0) is less than len(b).
func unsafeGetBytes(s string) (b []byte) {
*(*string)(unsafe.Pointer(&b)) = s
(*reflect.SliceHeader)(unsafe.Pointer(&b)).Cap = len(s)
Corresponding assembly (FUNCDATA omitted):
TEXT "".unsafeGetBytes(SB), NOSPLIT|ABIInternal, $32-16
SUBQ $32, SP
MOVQ AX, "".s+40(FP)
MOVQ $0, "".b(SP)
MOVUPS X15, "".b+8(SP)
MOVQ AX, "".b(SP)
MOVQ BX, "".b+8(SP)
MOVQ BX, "".b+16(SP)
MOVQ "".b(SP), AX
ADDQ $32, SP
Other unimportant details: compiled size 3636B, inline cost of 11, with subpar escape analysis: s leaks to {heap} with derefs=0.
Slicing a pointer to array
This is the accepted answer (shown here for comparison) -- its primary disadvantage is its ugliness (viz. magic number 0x7fff0000). There's also the tiniest possibility of getting a string bigger than the array, and an unavoidable bounds check.
func unsafeGetBytes(s string) []byte {
return (*[0x7fff0000]byte)(unsafe.Pointer(
Corresponding assembly (FUNCDATA removed).
TEXT "".unsafeGetBytes(SB), NOSPLIT|ABIInternal, $24-16
SUBQ $24, SP
PCDATA $0, $-2
MOVQ AX, "".s+32(SP)
MOVQ BX, "".s+40(SP)
MOVQ "".s+32(SP), AX
PCDATA $0, $-1
CMPQ BX, $2147418112
JHI unsafeGetBytes_pc54
ADDQ $24, SP
MOVL $2147418112, BX
PCDATA $1, $1
CALL runtime.panicSlice3Alen(SB)
Other unimportant details: compiled size 3142B, inline cost of 9, with correct escape analysis: s leaks to ~r1 with derefs=0
Note the runtime.panicSlice3Alen -- this is bounds check that checks that len(s) is within 0x7fff0000.
Improved slicing pointer to array
This is what I've concluded to be the most efficient method as of Go 1.17. I basically modified the accepted answer to eliminate the bounds check, and found a "more meaningful" constant (math.MaxInt32) to use than 0x7fff0000. Using MaxInt32 preserves 32-bit compatibility.
func unsafeGetBytes(s string) []byte {
const MaxInt32 = 1<<31 - 1
return (*[MaxInt32]byte)(unsafe.Pointer((*reflect.StringHeader)(
Corresponding assembly (FUNCDATA removed):
TEXT "".unsafeGetBytes(SB), NOSPLIT|ABIInternal, $0-16
PCDATA $0, $-2
MOVQ AX, "".s+8(SP)
MOVQ BX, "".s+16(SP)
MOVQ "".s+8(SP), AX
PCDATA $0, $-1
ANDQ $2147483647, BX
Other unimportant details: compiled size 3188B, inline cost of 13, and correct escape analysis: s leaks to ~r1 with derefs=0
In go 1.17, I'd recommend unsafe.Slice as more readable:
unsafe.Slice((*byte)(unsafe.Pointer((*reflect.StringHeader)(unsafe.Pointer(&s)).Data)), len(s))
I think that this also works (doesn't violate any unsafe.Pointer rules), with the benefit that it works for a const s:
*(*[]byte)(unsafe.Pointer(&struct{string; int}{s, len(s)}))
Commentary bellow is regarding the accepted answer as it originally stood. The accepted answer now mentions an (authoritative) solution from Ian Lance Taylor. Keeping it as it points out a common error.
The accepted answer is wrong, and may produce the panic #RFC mentioned in the comments. The explanation by #icza about GC and keep alive is misguided.
The reason capacity is zero (or even an arbitrary value) is more prosaic.
A slice is:
type SliceHeader struct {
Data uintptr
Len int
Cap int
A string is:
type StringHeader struct {
Data uintptr
Len int
Converting a byte slice to a string can be "safely" done as the strings.Builder does it:
func (b *Builder) String() string {
return *(*string)(unsafe.Pointer(&b.buf))
This will copy the Data pointer and Len from the slice to the string.
The opposite conversion is not "safe" because Cap doesn't get set to the correct value.
The following (originally by me) is also wrong because it violates unsafe.Pointer rule #1.
This is the correct code, that fixes the panic:
var buf = *(*[]byte)(unsafe.Pointer(&str))
(*reflect.SliceHeader)(unsafe.Pointer(&buf)).Cap = len(str)
Or perhaps:
var buf []byte
*(*string)(unsafe.Pointer(&buf)) = str
(*reflect.SliceHeader)(unsafe.Pointer(&buf)).Cap = len(str)
I should add that all these conversions are unsafe in the sense that strings are expected to be immutable, and byte arrays/slices mutable.
But if you know for sure that the byte slice won't be mutated, you won't get bounds (or GC) issues with the above conversions.
In Go 1.17, one can now use unsafe.Slice, so the accepted answer can be rewritten as follows:
func unsafeGetBytes(s string) []byte {
return unsafe.Slice((*byte)(unsafe.Pointer((*reflect.StringHeader)(unsafe.Pointer(&s)).Data)), len(s))
I managed to get the goal by this:
func TestString(t *testing.T) {
b := []byte{'a', 'b', 'c', '1', '2', '3', '4'}
s := *(*string)(unsafe.Pointer(&b))
sb := *(*[]byte)(unsafe.Pointer(&s))
addr1 := unsafe.Pointer(&b)
addr2 := unsafe.Pointer(&s)
addr3 := unsafe.Pointer(&sb)
fmt.Print("&b=", addr1, "\n&s=", addr2, "\n&sb=", addr3, "\n")
hdr1 := (*reflect.StringHeader)(unsafe.Pointer(&b))
hdr2 := (*reflect.SliceHeader)(unsafe.Pointer(&s))
hdr3 := (*reflect.SliceHeader)(unsafe.Pointer(&sb))
fmt.Print("", hdr1.Data, "\", hdr2.Data, "\", hdr3.Data, "\n")
b[0] = 'X'
sb[1] = 'Y' // if sb is from a string directly, this will cause nil panic
fmt.Print("s=", s, "\nsb=")
for _, c := range sb {
fmt.Printf("%c", c)
=== RUN TestString
These variables all share the same memory.
Go 1.20 (February 2023)
You can use unsafe.StringData to greatly simplify YenForYang's answer:
StringData returns a pointer to the underlying bytes of str. For an empty string the return value is unspecified, and may be nil.
Since Go strings are immutable, the bytes returned by StringData must not be modified.
func main() {
str := "foobar"
d := unsafe.StringData(str)
b := unsafe.Slice(d, len(str))
fmt.Printf("%T, %s\n", b, b) // []uint8, foobar (byte is alias of uint8)
Go tip playground:
Remember that you can't assign to b[n]. The memory is still read-only.
Simple, no reflect, and I think it is portable. s is your string and b is your bytes slice
var b []byte
copy(bb, (*[2]uintptr)(unsafe.Pointer(&s))[:])
bb[2] = bb[1]
// use b
Remember, bytes value should not be modified (will panic). re-slicing is ok (for example: bytes.split(b, []byte{','} )

Cannot find segmentation fault in insertion sort code for arm-v8 assembly

I have generated an array of random integers between 0 and 256 and I tried to sort them using an insertion sort however at some point along the line I messed up and got an error "segmentation fault (core dumped)
I have no idea exactly why I am getting this but I believe the problem has something to do with the macro j representing a loop counter. I have been working away at this for a few hours and I have been stuck on this for a while.
EDIT: After using the debugger I have isolated the problem to be the line:
bl printf
right at the very end when it tries to print the second sorted value.
I have tried to allocate more ram up to 416 bytes, I have tried moving the top2 loop label around (which is the loop that uses j as it's counter) and the farthest i have gotten is for the program to print the first element of the sorted array and then give the error
ALLOC =-(16+400*1)&-16
size = 50
define(arraybase, x19)
i .req x20
define(j, x21)
define(temp, w22)
define(sort1, w23)
print1: .asciz "Array[%d]: %d\n"
print2: .string "\nSorted Array: \n"
print3: .asciz "top value equals %d \n"
mov i, 0
top1: add i, i, 1
ldrb temp, [arraybase, i ]
mov j, i
top2: mov x25, j
sub x25, x25, 1
ldrb sort1, [arraybase, x25 ]
cmp temp, sort1 skip1
strb sort1, [arraybase, j]
skip1: sub j, j, 1
cmp j, 0 top2
strb temp, [arraybase, j]
cmp i, size-1 top1
ldr x0, =print2
bl printf
mov i, 0
ldr x0, =print1
tprint: mov x1, i
ldrb w2, [arraybase, i]
bl printf
add i, i, 1
cmp i, size-1
b.le tprint
mov x0, 0
ldp x29, x30, [sp], DEALLOC
The array should be printed in the random order it was initialized in, then it should print the value at the top of the array, then it is meant to print the sorted array in increasing order.
The exact error message I got was:
Segmentation fault (core dumped)
This appears after it prints the first value of the sorted array

What's the meaning of mov 0x8(%r14,%r15,8),%rax

In here what's the meaning of 0x8(%r14,%r15,8), I know 0x8(%r14,%r15,8) is SRC, but I don't understand why use two register %r14 and %r15 in here, and I don't understand how to cal the src address.
Thanks so much for any input.
Information pulled from
AT&T Addressing:
Memory Address Reference: Address_or_Offset(%base_or_offset, %Index_Register, Scale)
Final Address Calculation: Address_or_Offset + %base_or_offset + [Scale * %Index_Reg]
mov (%esi,%ebx,4), %edx /* Move the 4 bytes of data at address ESI+4*EBX into EDX. */

Two digit string number Assembly

So I have to strings s1 and s2 and I have two obtain the string d that contains the maximum numbers for each of the positions of s1 and s2.
For example:
S1: 1, 3, 6, 2, 3, 10
S2: 6, 3, 11, 1, 2, 5
D: 6, 3, 11, 2, 3, 10
So this is the code
bits 32
global start
extern exit,printf
import exit msvcrt.dll
import printf msvcrt.dll
segment data use32 class=data
format db "%s",0
s1 db "1","3","6","2","3","10"
l equ $-s1
s2 db "6","3" ,"11","1","2", "5"
d times l db 0
segment code use32 class=code
mov esi,0
mov edi,0
mov al,[s1+esi]
mov bl,[s2+esi]
cmp al,bl
jg et1
inc edi
inc esi
jmp et2
inc edi
inc esi
cmp esi,l
jne Repeta
push d
push format
add esp,4*2
push dword 0
call [exit]
The problem is that when it reaches a double digit element(10 or 11) it takes only the first digit(1) and compares it with the number from the other string on the same position and after that it takes the second digit and compares it with the next number from the other string.
How can I solve this?
it says that it should be a string of bytes
The phrase "of bytes" very strongly implies array to me. Ask your instructor for clarification, but I think s1: db 1, 3, 6, 2, 3, 10 is what you're supposed to be working with, so the elements are fixed width single byte integers. (And not ASCII strings at all).
This means you can use a simple pairwise max like SSE2 pmaxub (for unsigned bytes) or SSE4.1 pmaxsb (for signed bytes).
segment data use32 class=data
format db "%s",0
s1 db 1, 3, 6, 2, 3, 10
l equ $-s1
s2 db 6, 3, 11, 1, 2, 5
d times l db 0
movq xmm1, [s1] ; load all 6 elements, plus 2 bytes past the end but that's ok. We ignore those bytes
movq xmm2, [s2]
pmaxub xmm1, xmm2 ; element-wise vertical max
;but avoid storing outside of 6-byte d
movd [d], xmm1 ; store first 4 bytes of the result
pextrw [d+4], xmm1, 2 ; store bytes 4 and 5 (word 2 of xmm1 = 3rd word)
... ; the result isn't a string, you can't print it with printf.
For byte counts that aren't a multiple of 2, e.g. if l was 7, you could use this instead of pextrw:
psrldq xmm1, 3 ; bring the data you want to store down into the low 4 bytes of the register
movd [d+4], xmm1 ; 4-byte store that overlaps by 1
BTW, I realize that you're intended to loop over the elements 1 byte at a time. Maybe use cmp cl, al / cmovg eax, ecx / mov [edi], al to store what was originally in cl if cl > al (signed), otherwise store what was originally in al.
I think your loop structure is a bit broken, because you have one path that doesn't store to d. You always need to store to d, regardless of which source was greater.

In MSVC, why do InterlockedOr and InterlockedAnd generate a loop instead of a simple locked instruction?

On MSVC for x64 (19.10.25019),
InterlockedOr(&g, 1)
generates this code sequence:
prefetchw BYTE PTR ?g##3JC
mov eax, DWORD PTR ?g##3JC ; g
npad 3
mov ecx, eax
or ecx, 1
lock cmpxchg DWORD PTR ?g##3JC, ecx ; g
jne SHORT $LL3#f
I would have expected the much simpler (and loopless):
mov eax, 1
lock or [?g##3JC], eax
InterlockedAnd generates analogous code to InterlockedOr.
It seems wildly inefficient to have to have a loop for this instruction. Why is this code generated?
(As a side note: the whole reason I was using InterlockedOr was to do an atomic load of the variable - I have since learned that InterlockedCompareExchange is the way to do this. It is odd to me that there is no InterlockedLoad(&x), but I digress...)
The documented contract for InterlockedOr has it returning the original value:
Performs an atomic OR operation on the specified LONG values. The function prevents more than one thread from using the same variable simultaneously.
LONG __cdecl InterlockedOr(
_Inout_ LONG volatile *Destination,
_In_ LONG Value
Destination [in, out]
A pointer to the first operand. This value will be replaced with the result of the operation.
Value [in]
The second operand.
Return value
The function returns the original value of the Destination parameter.
This is why the unusual code that you've observed is required. The compiler cannot simply emit an OR instruction with a LOCK prefix, because the OR instruction does not return the previous value. Instead, it has to use the odd workaround with LOCK CMPXCHG in a loop. In fact, this apparently unusual sequence is the standard pattern for implementing interlocked operations when they aren't natively supported by the underlying hardware: capture the old value, perform an interlocked compare-and-exchange with the new value, and keep trying in a loop until the old value from this attempt is equal to the captured old value.
As you observed, you see the same thing with InterlockedAnd, for exactly the same reason: the x86 AND instruction doesn't return the original value, so the code-generator has to fallback on the general pattern involving compare-and-exchange, which is directly supported by the hardware.
Note that, at least on x86 where InterlockedOr is implemented as an intrinsic, the optimizer is smart enough to figure out whether you're using the return value or not. If you are, then it uses the workaround code involving CMPXCHG. If you are ignoring the return value, then it goes ahead and emits code using LOCK OR, just like you would expect.
#include <intrin.h>
LONG InterlockedOrWithReturn()
LONG val = 42;
return _InterlockedOr(&val, 8);
void InterlockedOrWithoutReturn()
LONG val = 42;
LONG old = _InterlockedOr(&val, 8);
InterlockedOrWithoutReturn, COMDAT PROC
mov DWORD PTR [rsp+8], 42
lock or DWORD PTR [rsp+8], 8
ret 0
InterlockedOrWithoutReturn ENDP
InterlockedOrWithReturn, COMDAT PROC
mov DWORD PTR [rsp+8], 42
prefetchw BYTE PTR [rsp+8]
mov eax, DWORD PTR [rsp+8]
mov ecx, eax
or ecx, 8
lock cmpxchg DWORD PTR [rsp+8], ecx
jne SHORT LoopTop
ret 0
InterlockedOrWithReturn ENDP
The optimizer is equally as smart for InterlockedAnd, and should be for the other Interlocked* functions, as well.
As intuition would tell you, the LOCK OR implementation is more efficient than the LOCK CMPXCHG in a loop. Not only is there the expanded code size and the overhead of looping, but you risk branch prediction misses, which can cost a large number of cycles. In performance-critical code, if you can avoid relying on the return value for interlocked operations, you can gain a performance boost.
However, what you really should be using in modern C++ is std::atomic, which allows you to specify the desired memory model/semantics, and then let the standard library maintainers deal with the complexity.
