Understanding GHC assembly output - haskell

When compiling a haskell source file using the -S option in GHC the assembly code generated is not clear. There's no clear distinction between which parts of the assembly code belong to which parts of the haskell code. Unlike GCC were each label is named according to the function it corresponds to.
Is there a certain convention in these names produced by GHC? How can I relate certain parts in the generated assembly code to their corresponding parts in the haskell code?

For top level declarations, it's not too hard. Local definitions can be harder to recognize as their names get mangled and they are likely to get inlined.
Let's see what happens when we compile this simple module.
module Example where
add :: Int -> Int -> Int
add x y = x + y
.data
.align 8
.globl Example_add_closure
.type Example_add_closure, #object
Example_add_closure:
.quad Example_add_info
.text
.align 8
.quad 8589934604
.quad 0
.quad 15
.globl Example_add_info
.type Example_add_info, #object
Example_add_info:
.LckX:
jmp base_GHCziBase_plusInt_info
.data
.align 8
_module_registered:
.quad 0
.text
.align 8
.globl __stginit_Example_
.type __stginit_Example_, #object
__stginit_Example_:
.Lcl7:
cmpq $0,_module_registered
jne .Lcl8
.Lcl9:
movq $1,_module_registered
addq $-8,%rbp
movq $__stginit_base_Prelude_,(%rbp)
.Lcl8:
addq $8,%rbp
jmp *-8(%rbp)
.text
.align 8
.globl __stginit_Example
.type __stginit_Example, #object
__stginit_Example:
.Lcld:
jmp __stginit_Example_
.section .note.GNU-stack,"",#progbits
.ident "GHC 7.0.2"
You can see that our function Example.add resulted in the generation of Example_add_closure and Example_add_info. The _closure part, as the name suggests, has to do with closures. The _info part contains the actual instructions of the function. In this case, this is simply a jump to the built-in function GHC.Base.plusInt.
Note that assembly generated from Haskell code looks quite different from what you might get from other languages. The calling conventions are different, and things can get reordered a lot.
In most cases you don't want to jump straight to assembly. It is usually much easier to understand core, a simplified version of Haskell. (Simpler to compile, not necessarily to read). To get at the core, compile with the -ddump-simpl option.
Example.add :: GHC.Types.Int -> GHC.Types.Int -> GHC.Types.Int
[GblId, Arity=2]
Example.add =
\ (x_abt :: GHC.Types.Int) (y_abu :: GHC.Types.Int) ->
GHC.Num.+ # GHC.Types.Int GHC.Num.$fNumInt x_abt y_abu
For some good resources on how to read core, see this question.

Related

RIP register doesn't understand valid memory address [duplicate]

I want a simple C method to be able to run hex bytecode on a Linux 64 bit machine. Here's the C program that I have:
char code[] = "\x48\x31\xc0";
#include <stdio.h>
int main(int argc, char **argv)
{
int (*func) ();
func = (int (*)()) code;
(int)(*func)();
printf("%s\n","DONE");
}
The code that I am trying to run ("\x48\x31\xc0") I obtained by writting this simple assembly program (it's not supposed to really do anything)
.text
.globl _start
_start:
xorq %rax, %rax
and then compiling and objdump-ing it to obtain the bytecode.
However, when I run my C program I get a segmentation fault. Any ideas?
Machine code has to be in an executable page. Your char code[] is in the read+write data section, without exec permission, so the code cannot be executed from there.
Here is a simple example of allocating an executable page with mmap:
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
int main ()
{
char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi]
0xC3 // ret
};
int (*sum) (int, int) = NULL;
// allocate executable buffer
sum = mmap (0, sizeof(code), PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// copy code to buffer
memcpy (sum, code, sizeof(code));
// doesn't actually flush cache on x86, but ensure memcpy isn't
// optimized away as a dead store.
__builtin___clear_cache (sum, sum + sizeof(sum)); // GNU C
// run code
int a = 2;
int b = 3;
int c = sum (a, b);
printf ("%d + %d = %d\n", a, b, c);
}
See another answer on this question for details about __builtin___clear_cache.
Until recent Linux kernel versions (sometime before 5.4), you could simply compile with gcc -z execstack - that would make all pages executable, including read-only data (.rodata), and read-write data (.data) where char code[] = "..." goes.
Now -z execstack only applies to the actual stack, so it currently works only for non-const local arrays. i.e. move char code[] = ... into main.
See Linux default behavior against `.data` section for the kernel change, and Unexpected exec permission from mmap when assembly files included in the project for the old behaviour: enabling Linux's READ_IMPLIES_EXEC process for that program. (In Linux 5.4, that Q&A shows you'd only get READ_IMPLIES_EXEC for a missing PT_GNU_STACK, like a really old binary; modern GCC -z execstack would set PT_GNU_STACK = RWX metadata in the executable, which Linux 5.4 would handle as making only the stack itself executable. At some point before that, PT_GNU_STACK = RWX did result in READ_IMPLIES_EXEC.)
The other option is to make system calls at runtime to copy into an executable page, or change permissions on the page it's in. That's still more complicated than using a local array to get GCC to copy code into executable stack memory.
(I don't know if there's an easy way to enable READ_IMPLIES_EXEC under modern kernels. Having no GNU-stack attribute at all in an ELF binary does that for 32-bit code, but not 64-bit.)
Yet another option is __attribute__((section(".text"))) const char code[] = ...;
Working example: https://godbolt.org/z/draGeh.
If you need the array to be writeable, e.g. for shellcode that inserts some zeros into strings, you could maybe link with ld -N. But probably best to use -z execstack and a local array.
Two problems in the question:
exec permission on the page, because you used an array that will go in the noexec read+write .data section.
your machine code doesn't end with a ret instruction so even if it did run, execution would fall into whatever was next in memory instead of returning.
And BTW, the REX prefix is totally redundant. "\x31\xc0" xor eax,eax has exactly the same effect as xor rax,rax.
You need the page containing the machine code to have execute permission. x86-64 page tables have a separate bit for execute separate from read permission, unlike legacy 386 page tables.
The easiest way to get static arrays to be in read+exec memory was to compile with gcc -z execstack. (Used to make the stack and other sections executable, now only the stack).
Until recently (2018 or 2019), the standard toolchain (binutils ld) would put section .rodata into the same ELF segment as .text, so they'd both have read+exec permission. Thus using const char code[] = "..."; was sufficient for executing manually-specified bytes as data, without execstack.
But on my Arch Linux system with GNU ld (GNU Binutils) 2.31.1, that's no longer the case. readelf -a shows that the .rodata section went into an ELF segment with .eh_frame_hdr and .eh_frame, and it only has Read permission. .text goes in a segment with Read + Exec, and .data goes in a segment with Read + Write (along with the .got and .got.plt). (What's the difference of section and segment in ELF file format)
I assume this change is to make ROP and Spectre attacks harder by not having read-only data in executable pages where sequences of useful bytes could be used as "gadgets" that end with the bytes for a ret or jmp reg instruction.
// TODO: use char code[] = {...} inside main, with -z execstack, for current Linux
// Broken on recent Linux, used to work without execstack.
#include <stdio.h>
// can be non-const if you use gcc -z execstack. static is also optional
static const char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi] // retval = a+b;
0xC3 // ret
};
static const char ret0_code[] = "\x31\xc0\xc3"; // xor eax,eax ; ret
// the compiler will append a 0 byte to terminate the C string,
// but that's fine. It's after the ret.
int main () {
// void* cast is easier to type than a cast to function pointer,
// and in C can be assigned to any other pointer type. (not C++)
int (*sum) (int, int) = (void*)code;
int (*ret0)(void) = (void*)ret0_code;
// run code
int c = sum (2, 3);
return ret0();
}
On older Linux systems: gcc -O3 shellcode.c && ./a.out (Works because of const on global/static arrays)
On Linux before 5.5 (or so) gcc -O3 -z execstack shellcode.c && ./a.out (works because of -zexecstack regardless of where your machine code is stored). Fun fact: gcc allows -zexecstack with no space, but clang only accepts clang -z execstack.
These also work on Windows, where read-only data goes in .rdata instead of .rodata.
The compiler-generated main looks like this (from objdump -drwC -Mintel). You can run it inside gdb and set breakpoints on code and ret0_code
(I actually used gcc -no-pie -O3 -zexecstack shellcode.c hence the addresses near 401000
0000000000401020 <main>:
401020: 48 83 ec 08 sub rsp,0x8 # stack aligned by 16 before a call
401024: be 03 00 00 00 mov esi,0x3
401029: bf 02 00 00 00 mov edi,0x2 # 2 args
40102e: e8 d5 0f 00 00 call 402008 <code> # note the target address in the next page
401033: 48 83 c4 08 add rsp,0x8
401037: e9 c8 0f 00 00 jmp 402004 <ret0_code> # optimized tailcall
Or use system calls to modify page permissions
Instead of compiling with gcc -zexecstack, you can instead use mmap(PROT_EXEC) to allocate new executable pages, or mprotect(PROT_EXEC) to change existing pages to executable. (Including pages holding static data.) You also typically want at least PROT_READ and sometimes PROT_WRITE, of course.
Using mprotect on a static array means you're still executing the code from a known location, maybe making it easier to set a breakpoint on it.
On Windows you can use VirtualAlloc or VirtualProtect.
Telling the compiler that data is executed as code
Normally compilers like GCC assume that data and code are separate. This is like type-based strict aliasing, but even using char* doesn't make it well-defined to store into a buffer and then call that buffer as a function pointer.
In GNU C, you also need to use __builtin___clear_cache(buf, buf + len) after writing machine code bytes to a buffer, because the optimizer doesn't treat dereferencing a function pointer as reading bytes from that address. Dead-store elimination can remove the stores of machine code bytes into a buffer, if the compiler proves that the store isn't read as data by anything. https://codegolf.stackexchange.com/questions/160100/the-repetitive-byte-counter/160236#160236 and https://godbolt.org/g/pGXn3B has an example where gcc really does do this optimization, because gcc "knows about" malloc.
(And on non-x86 architectures where I-cache isn't coherent with D-cache, it actually will do any necessary cache syncing. On x86 it's purely a compile-time optimization blocker and doesn't expand to any instructions itself.)
Re: the weird name with three underscores: It's the usual __builtin_name pattern, but name is __clear_cache.
My edit on #AntoineMathys's answer added this.
In practice GCC/clang don't "know about" mmap(MAP_ANONYMOUS) the way they know about malloc. So in practice the optimizer will assume that the memcpy into the buffer might be read as data by the non-inline function call through the function pointer, even without __builtin___clear_cache(). (Unless you declared the function type as __attribute__((const)).)
On x86, where I-cache is coherent with data caches, having the stores happen in asm before the call is sufficient for correctness. On other ISAs, __builtin___clear_cache() will actually emit special instructions as well as ensuring the right compile-time ordering.
It's good practice to include it when copying code into a buffer because it doesn't cost performance, and stops hypothetical future compilers from breaking your code. (e.g. if they do understand that mmap(MAP_ANONYMOUS) gives newly-allocated anonymous memory that nothing else has a pointer to, just like malloc.)
With current GCC, I was able to provoke GCC into really doing an optimization we don't want by using __attribute__((const)) to tell the optimizer sum() is a pure function (that only reads its args, not global memory). GCC then knows sum() can't read the result of the memcpy as data.
With another memcpy into the same buffer after the call, GCC does dead-store elimination into just the 2nd store after the call. This results in no store before the first call so it executes the 00 00 add [rax], al bytes, segfaulting.
// demo of a problem on x86 when not using __builtin___clear_cache
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
int main ()
{
char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi]
0xC3 // ret
};
__attribute__((const)) int (*sum) (int, int) = NULL;
// copy code to executable buffer
sum = mmap (0,sizeof(code),PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANON,-1,0);
memcpy (sum, code, sizeof(code));
//__builtin___clear_cache(sum, sum + sizeof(code));
int c = sum (2, 3);
//printf ("%d + %d = %d\n", a, b, c);
memcpy(sum, (char[]){0x31, 0xc0, 0xc3, 0}, 4); // xor-zero eax, ret, padding for a dword store
//__builtin___clear_cache(sum, sum + 4);
return sum(2,3);
}
Compiled on the Godbolt compiler explorer with GCC9.2 -O3
main:
push rbx
xor r9d, r9d
mov r8d, -1
mov ecx, 34
mov edx, 7
mov esi, 4
xor edi, edi
sub rsp, 16
call mmap
mov esi, 3
mov edi, 2
mov rbx, rax
call rax # call before store
mov DWORD PTR [rbx], 12828721 # 0xC3C031 = xor-zero eax, ret
add rsp, 16
pop rbx
ret # no 2nd call, CSEd away because const and same args
Passing different args would have gotten another call reg, but even with __builtin___clear_cache the two sum(2,3) calls can CSE. __attribute__((const)) doesn't respect changes to the machine code of a function. Don't do it. It's safe if you're going to JIT the function once and then call many times, though.
Uncommenting the first __clear_cache results in
mov DWORD PTR [rax], -1019804531 # lea; ret
call rax
mov DWORD PTR [rbx], 12828721 # xor-zero; ret
... still CSE and use the RAX return value
The first store is there because of __clear_cache and the sum(2,3) call. (Removing the first sum(2,3) call does let dead-store elimination happen across the __clear_cache.)
The second store is there because the side-effect on the buffer returned by mmap is assumed to be important, and that's the final value main leaves.
Godbolt's ./a.out option to run the program still seems to always fail (exit status of 255); maybe it sandboxes JITing? It works on my desktop with __clear_cache and crashes without.
mprotect on a page holding existing C variables.
You can also give a single existing page read+write+exec permission. This is an alternative to compiling with -z execstack
You don't need __clear_cache on a page holding read-only C variables because there's no store to optimize away. You would still need it for initializing a local buffer (on the stack). Otherwise GCC will optimize away the initializer for this private buffer that a non-inline function call definitely doesn't have a pointer to. (Escape analysis). It doesn't consider the possibility that the buffer might hold the machine code for the function unless you tell it that via __builtin___clear_cache.
#include <stdio.h>
#include <sys/mman.h>
#include <stdint.h>
// can be non-const if you want, we're using mprotect
static const char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi] // retval = a+b;
0xC3 // ret
};
static const char ret0_code[] = "\x31\xc0\xc3";
int main () {
// void* cast is easier to type than a cast to function pointer,
// and in C can be assigned to any other pointer type. (not C++)
int (*sum) (int, int) = (void*)code;
int (*ret0)(void) = (void*)ret0_code;
// hard-coding x86's 4k page size for simplicity.
// also assume that `code` doesn't span a page boundary and that ret0_code is in the same page.
uintptr_t page = (uintptr_t)code & -4095ULL; // round down
mprotect((void*)page, 4096, PROT_READ|PROT_EXEC|PROT_WRITE); // +write in case the page holds any writeable C vars that would crash later code.
// run code
int c = sum (2, 3);
return ret0();
}
I used PROT_READ|PROT_EXEC|PROT_WRITE in this example so it works regardless of where your variable is. If it was a local on the stack and you left out PROT_WRITE, call would fail after making the stack read only when it tried to push a return address.
Also, PROT_WRITE lets you test shellcode that self-modifies, e.g. to edit zeros into its own machine code, or other bytes it was avoiding.
$ gcc -O3 shellcode.c # without -z execstack
$ ./a.out
$ echo $?
0
$ strace ./a.out
...
mprotect(0x55605aa3f000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC) = 0
exit_group(0) = ?
+++ exited with 0 +++
If I comment out the mprotect, it does segfault with recent versions of GNU Binutils ld which no longer put read-only constant data into the same ELF segment as the .text section.
If I did something like ret0_code[2] = 0xc3;, I would need __builtin___clear_cache(ret0_code+2, ret0_code+2) after that to make sure the store wasn't optimized away, but if I don't modify the static arrays then it's not needed after mprotect. It is needed after mmap+memcpy or manual stores, because we want to execute bytes that have been written in C (with memcpy).
You need to include the assembly in-line via a special compiler directive so that it'll properly end up in a code segment. See this guide, for example: http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html
Your machine code may be all right, but your CPU objects.
Modern CPUs manage memory in segments. In normal operation, the operating system loads a new program into a program-text segment and sets up a stack in a data segment. The operating system tells the CPU never to run code in a data segment. Your code is in code[], in a data segment. Thus the segfault.
This will take some effort.
Your code variable is stored in the .data section of your executable:
$ readelf -p .data exploit
String dump of section '.data':
[ 10] H1À
H1À is the value of your variable.
The .data section is not executable:
$ readelf -S exploit
There are 30 section headers, starting at offset 0x1150:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[...]
[24] .data PROGBITS 0000000000601010 00001010
0000000000000014 0000000000000000 WA 0 0 8
All 64-bit processors I'm familiar with support non-executable pages natively in the pagetables. Most newer 32-bit processors (the ones that support PAE) provide enough extra space in their pagetables for the operating system to emulate hardware non-executable pages. You'll need to run either an ancient OS or an ancient processor to get a .data section marked executable.
Because these are just flags in the executable, you ought to be able to set the X flag through some other mechanism, but I don't know how to do so. And your OS might not even let you have pages that are both writable and executable.
You may need to set the page executable before you may call it.
On MS-Windows, see the VirtualProtect -function.
URL: http://msdn.microsoft.com/en-us/library/windows/desktop/aa366898%28v=vs.85%29.aspx
Sorry, I couldn't follow above examples which are complicated.
So, I created an elegant solution for executing hex code from C.
Basically, you could use asm and .word keywords to place your instructions in hex format.
See below example:
asm volatile(".rept 1024\n"
CNOP
".endr\n");
where CNOP is defined as below:
#define ".word 0x00010001 \n"
Basically, c.nop instruction was not supported by my current assembler. So, I defined CNOP as the hex equivalent of c.nop with proper syntax and used inside asm, with which I was aware of.
.rept <NUM> .endr will basically, repeat the instruction NUM times.
This solution is working and verified.

Using pointers to return results in x86 Assembly

We've been given the following function to try and implement in C as part of a CS course. We are programming on x86 Linux.
function(float x, float y, float *z);
For a function such as example(int x, int y) I understand that the x value resides at [ebp+8] and y at [ebp+12] on the stack, is the same convention used when pushing floats?
We also have to perform some masking and calculations on the float numbers. Do these float numbers behave the same as 32-bit integers just in IEEE-754 format?
here is a simple function and it's asm code :
function(float x, float y, float *z){
float sum = x + y;
float neg = sum - *z;
}
the asm of the above function will be like this:
function:
pushl %ebp
movl %esp,%ebp
subl $8,%esp
pushl %ebx
flds 8(%ebp)
fadds 12(%ebp)
fstps -4(%ebp)
movl 16(%ebp),%ebx
flds -4(%ebp)
fsubs (%ebx)
fstps -8(%ebp)
leal -12(%ebp),%esp
popl %ebx
leave
ret
as you can see from asm above the reference to ebp+x in this case x will be 8/12/16 to get the parameter from the stack,
so as fuz point out it in the comments it is indeed stored on the stack

How to use unsafe get a byte slice from a string without memory copy

I have read about "https://github.com/golang/go/issues/25484" about no-copy conversion from []byte to string.
I am wondering if there is a way to convert a string to a byte slice without memory copy?
I am writing a program which processes terra-bytes data, if every string is copied twice in memory, it will slow down the progress. And I do not care about mutable/unsafe, only internal usage, I just need the speed as fast as possible.
Example:
var s string
// some processing on s, for some reasons, I must use string here
// ...
// then output to a writer
gzipWriter.Write([]byte(s)) // !!! Here I want to avoid the memory copy, no WriteString
So the question is: is there a way to prevent from the memory copying? I know maybe I need the unsafe package, but I do not know how. I have searched a while, no answer till now, neither the SO showed related answers works.
Getting the content of a string as a []byte without copying in general is only possible using unsafe, because strings in Go are immutable, and without a copy it would be possible to modify the contents of the string (by changing the elements of the byte slice).
So using unsafe, this is how it could look like (corrected, working solution):
func unsafeGetBytes(s string) []byte {
return (*[0x7fff0000]byte)(unsafe.Pointer(
(*reflect.StringHeader)(unsafe.Pointer(&s)).Data),
)[:len(s):len(s)]
}
This solution is from Ian Lance Taylor.
One thing to note here: the empty string "" has no bytes as its length is zero. This means there is no guarantee what the Data field may be, it may be zero or an arbitrary address shared among the zero-size variables. If an empty string may be passed, that must be checked explicitly (although there's no need to get the bytes of an empty string without copying...):
func unsafeGetBytes(s string) []byte {
if s == "" {
return nil // or []byte{}
}
return (*[0x7fff0000]byte)(unsafe.Pointer(
(*reflect.StringHeader)(unsafe.Pointer(&s)).Data),
)[:len(s):len(s)]
}
Original, wrong solution was:
func unsafeGetBytesWRONG(s string) []byte {
return *(*[]byte)(unsafe.Pointer(&s)) // WRONG!!!!
}
See Nuno Cruces's answer below for reasoning.
Testing it:
s := "hi"
data := unsafeGetBytes(s)
fmt.Println(data, string(data))
data = unsafeGetBytes("gopher")
fmt.Println(data, string(data))
Output (try it on the Go Playground):
[104 105] hi
[103 111 112 104 101 114] gopher
BUT: You wrote you want this because you need performance. You also mentioned you want to compress the data. Please know that compressing data (using gzip) requires a lot more computation than just copying a few bytes! You will not see any noticeable performance gain by using this!
Instead when you want to write strings to an io.Writer, it's recommended to do it via io.WriteString() function which if possible will do so without making a copy of the string (by checking and calling WriteString() method which if exists is most likely does it better than copying the string). For details, see What's the difference between ResponseWriter.Write and io.WriteString?
There are also ways to access the contents of a string without converting it to []byte, such as indexing, or using a loop where the compiler optimizes away the copy:
s := "something"
for i, v := range []byte(s) { // Copying s is optimized away
// ...
}
Also see related questions:
[]byte(string) vs []byte(*string)
What are the possible consequences of using unsafe conversion from []byte to string in go?
What is the difference between the string and []byte in Go?
Does conversion between alias types in Go create copies?
How does type conversion internally work? What is the memory utilization for the same?
After some extensive investigation, I believe I've discovered the most efficient way of getting a []byte from a string as of Go 1.17 (this is for i386/x86_64 gc; I haven't tested other architectures.) The trade-off of being efficient code here is being inefficient to code, though.
Before I say anything else, it should be made clear that the differences are ultimately very small and probably inconsequential -- the info below is for fun/educational purposes only.
Summary
With some minor alterations, the accepted answer illustrating the technique of slicing a pointer to array is the most efficient way. That being said, I wouldn't be surprised if unsafe.Slice becomes the (decisively) better choice in the future.
unsafe.Slice
unsafe.Slice currently has the advantage of being slightly more readable, but I'm skeptical about it's performance. It looks like it makes a call to runtime.unsafeslice. The following is the gc amd64 1.17 assembly of the function provided in Atamiri's answer (FUNCDATA omitted). Note the stack check (lack of NOSPLIT):
unsafeGetBytes_pc0:
TEXT "".unsafeGetBytes(SB), ABIInternal, $48-16
CMPQ SP, 16(R14)
PCDATA $0, $-2
JLS unsafeGetBytes_pc86
PCDATA $0, $-1
SUBQ $48, SP
MOVQ BP, 40(SP)
LEAQ 40(SP), BP
PCDATA $0, $-2
MOVQ BX, ""..autotmp_4+24(SP)
MOVQ AX, "".s+56(SP)
MOVQ BX, "".s+64(SP)
MOVQ "".s+56(SP), DX
PCDATA $0, $-1
MOVQ DX, ""..autotmp_5+32(SP)
LEAQ type.uint8(SB), AX
MOVQ BX, CX
MOVQ DX, BX
PCDATA $1, $1
CALL runtime.unsafeslice(SB)
MOVQ ""..autotmp_5+32(SP), AX
MOVQ ""..autotmp_4+24(SP), BX
MOVQ BX, CX
MOVQ 40(SP), BP
ADDQ $48, SP
RET
unsafeGetBytes_pc86:
NOP
PCDATA $1, $-1
PCDATA $0, $-2
MOVQ AX, 8(SP)
MOVQ BX, 16(SP)
CALL runtime.morestack_noctxt(SB)
MOVQ 8(SP), AX
MOVQ 16(SP), BX
PCDATA $0, $-1
JMP unsafeGetBytes_pc0
Other unimportant fun facts about the above (easily subject to change): compiled size of 3326B; has an inline cost of 7; correct escape analysis: s leaks to ~r1 with derefs=0.
Carefully Modifying *reflect.SliceHeader
This method has the advantage/disadvantage of letting one modify the internal state of a slice directly. Unfortunately, due it's multiline nature and use of uintptr, the GC can easily mess things up if one is not careful about keeping a reference to the original string. (Here I avoided creating temporary pointers to reduce inline cost and to avoid needing to add runtime.KeepAlive):
func unsafeGetBytes(s string) (b []byte) {
(*reflect.SliceHeader)(unsafe.Pointer(&b)).Data = (*reflect.StringHeader)(unsafe.Pointer(&s)).Data
(*reflect.SliceHeader)(unsafe.Pointer(&b)).Cap = len(s)
(*reflect.SliceHeader)(unsafe.Pointer(&b)).Len = len(s)
return
}
The corresponding assembly on amd64 (FUNCDATA omitted):
TEXT "".unsafeGetBytes(SB), NOSPLIT|ABIInternal, $32-16
SUBQ $32, SP
MOVQ BP, 24(SP)
LEAQ 24(SP), BP
MOVQ AX, "".s+40(SP)
MOVQ BX, "".s+48(SP)
MOVQ $0, "".b(SP)
MOVUPS X15, "".b+8(SP)
MOVQ "".s+40(SP), DX
MOVQ DX, "".b(SP)
MOVQ "".s+48(SP), CX
MOVQ CX, "".b+16(SP)
MOVQ "".s+48(SP), BX
MOVQ BX, "".b+8(SP)
MOVQ "".b(SP), AX
MOVQ 24(SP), BP
ADDQ $32, SP
RET
Other unimportant fun facts about the above (easily subject to change): compiled size of 3700B; has an inline cost of 20; subpar escape analysis: s leaks to {heap} with derefs=0.
Unsafer version of modifying SliceHeader
Adapted from Nuno Cruces' answer. This relies on the inherent structural similarity between StringHeader and SliceHeader, so in a sense it breaks "more easily". Additionally, it temporarily creates an illegal state where cap(b) (being 0) is less than len(b).
func unsafeGetBytes(s string) (b []byte) {
*(*string)(unsafe.Pointer(&b)) = s
(*reflect.SliceHeader)(unsafe.Pointer(&b)).Cap = len(s)
return
}
Corresponding assembly (FUNCDATA omitted):
TEXT "".unsafeGetBytes(SB), NOSPLIT|ABIInternal, $32-16
SUBQ $32, SP
MOVQ BP, 24(SP)
LEAQ 24(SP), BP
MOVQ AX, "".s+40(FP)
MOVQ $0, "".b(SP)
MOVUPS X15, "".b+8(SP)
MOVQ AX, "".b(SP)
MOVQ BX, "".b+8(SP)
MOVQ BX, "".b+16(SP)
MOVQ "".b(SP), AX
MOVQ BX, CX
MOVQ 24(SP), BP
ADDQ $32, SP
NOP
RET
Other unimportant details: compiled size 3636B, inline cost of 11, with subpar escape analysis: s leaks to {heap} with derefs=0.
Slicing a pointer to array
This is the accepted answer (shown here for comparison) -- its primary disadvantage is its ugliness (viz. magic number 0x7fff0000). There's also the tiniest possibility of getting a string bigger than the array, and an unavoidable bounds check.
func unsafeGetBytes(s string) []byte {
return (*[0x7fff0000]byte)(unsafe.Pointer(
(*reflect.StringHeader)(unsafe.Pointer(&s)).Data),
)[:len(s):len(s)]
}
Corresponding assembly (FUNCDATA removed).
TEXT "".unsafeGetBytes(SB), NOSPLIT|ABIInternal, $24-16
SUBQ $24, SP
MOVQ BP, 16(SP)
LEAQ 16(SP), BP
PCDATA $0, $-2
MOVQ AX, "".s+32(SP)
MOVQ BX, "".s+40(SP)
MOVQ "".s+32(SP), AX
PCDATA $0, $-1
TESTB AL, (AX)
NOP
CMPQ BX, $2147418112
JHI unsafeGetBytes_pc54
MOVQ BX, CX
MOVQ 16(SP), BP
ADDQ $24, SP
RET
unsafeGetBytes_pc54:
MOVQ BX, DX
MOVL $2147418112, BX
PCDATA $1, $1
NOP
CALL runtime.panicSlice3Alen(SB)
XCHGL AX, AX
Other unimportant details: compiled size 3142B, inline cost of 9, with correct escape analysis: s leaks to ~r1 with derefs=0
Note the runtime.panicSlice3Alen -- this is bounds check that checks that len(s) is within 0x7fff0000.
Improved slicing pointer to array
This is what I've concluded to be the most efficient method as of Go 1.17. I basically modified the accepted answer to eliminate the bounds check, and found a "more meaningful" constant (math.MaxInt32) to use than 0x7fff0000. Using MaxInt32 preserves 32-bit compatibility.
func unsafeGetBytes(s string) []byte {
const MaxInt32 = 1<<31 - 1
return (*[MaxInt32]byte)(unsafe.Pointer((*reflect.StringHeader)(
unsafe.Pointer(&s)).Data))[:len(s)&MaxInt32:len(s)&MaxInt32]
}
Corresponding assembly (FUNCDATA removed):
TEXT "".unsafeGetBytes(SB), NOSPLIT|ABIInternal, $0-16
PCDATA $0, $-2
MOVQ AX, "".s+8(SP)
MOVQ BX, "".s+16(SP)
MOVQ "".s+8(SP), AX
PCDATA $0, $-1
TESTB AL, (AX)
ANDQ $2147483647, BX
MOVQ BX, CX
RET
Other unimportant details: compiled size 3188B, inline cost of 13, and correct escape analysis: s leaks to ~r1 with derefs=0
In go 1.17, I'd recommend unsafe.Slice as more readable:
unsafe.Slice((*byte)(unsafe.Pointer((*reflect.StringHeader)(unsafe.Pointer(&s)).Data)), len(s))
I think that this also works (doesn't violate any unsafe.Pointer rules), with the benefit that it works for a const s:
*(*[]byte)(unsafe.Pointer(&struct{string; int}{s, len(s)}))
Commentary bellow is regarding the accepted answer as it originally stood. The accepted answer now mentions an (authoritative) solution from Ian Lance Taylor. Keeping it as it points out a common error.
The accepted answer is wrong, and may produce the panic #RFC mentioned in the comments. The explanation by #icza about GC and keep alive is misguided.
The reason capacity is zero (or even an arbitrary value) is more prosaic.
A slice is:
type SliceHeader struct {
Data uintptr
Len int
Cap int
}
A string is:
type StringHeader struct {
Data uintptr
Len int
}
Converting a byte slice to a string can be "safely" done as the strings.Builder does it:
func (b *Builder) String() string {
return *(*string)(unsafe.Pointer(&b.buf))
}
This will copy the Data pointer and Len from the slice to the string.
The opposite conversion is not "safe" because Cap doesn't get set to the correct value.
The following (originally by me) is also wrong because it violates unsafe.Pointer rule #1.
This is the correct code, that fixes the panic:
var buf = *(*[]byte)(unsafe.Pointer(&str))
(*reflect.SliceHeader)(unsafe.Pointer(&buf)).Cap = len(str)
Or perhaps:
var buf []byte
*(*string)(unsafe.Pointer(&buf)) = str
(*reflect.SliceHeader)(unsafe.Pointer(&buf)).Cap = len(str)
I should add that all these conversions are unsafe in the sense that strings are expected to be immutable, and byte arrays/slices mutable.
But if you know for sure that the byte slice won't be mutated, you won't get bounds (or GC) issues with the above conversions.
In Go 1.17, one can now use unsafe.Slice, so the accepted answer can be rewritten as follows:
func unsafeGetBytes(s string) []byte {
return unsafe.Slice((*byte)(unsafe.Pointer((*reflect.StringHeader)(unsafe.Pointer(&s)).Data)), len(s))
}
I managed to get the goal by this:
func TestString(t *testing.T) {
b := []byte{'a', 'b', 'c', '1', '2', '3', '4'}
s := *(*string)(unsafe.Pointer(&b))
sb := *(*[]byte)(unsafe.Pointer(&s))
addr1 := unsafe.Pointer(&b)
addr2 := unsafe.Pointer(&s)
addr3 := unsafe.Pointer(&sb)
fmt.Print("&b=", addr1, "\n&s=", addr2, "\n&sb=", addr3, "\n")
hdr1 := (*reflect.StringHeader)(unsafe.Pointer(&b))
hdr2 := (*reflect.SliceHeader)(unsafe.Pointer(&s))
hdr3 := (*reflect.SliceHeader)(unsafe.Pointer(&sb))
fmt.Print("b.data=", hdr1.Data, "\ns.data=", hdr2.Data, "\nsb.data=", hdr3.Data, "\n")
b[0] = 'X'
sb[1] = 'Y' // if sb is from a string directly, this will cause nil panic
fmt.Print("s=", s, "\nsb=")
for _, c := range sb {
fmt.Printf("%c", c)
}
fmt.Println()
}
Output:
=== RUN TestString
&b=0xc000218000
&s=0xc00021a000
&sb=0xc000218020
b.data=824635867152
s.data=824635867152
sb.data=824635867152
s=XYc1234
sb=XYc1234
These variables all share the same memory.
Go 1.20 (February 2023)
You can use unsafe.StringData to greatly simplify YenForYang's answer:
StringData returns a pointer to the underlying bytes of str. For an empty string the return value is unspecified, and may be nil.
Since Go strings are immutable, the bytes returned by StringData must not be modified.
func main() {
str := "foobar"
d := unsafe.StringData(str)
b := unsafe.Slice(d, len(str))
fmt.Printf("%T, %s\n", b, b) // []uint8, foobar (byte is alias of uint8)
}
Go tip playground: https://go.dev/play/p/FIXe0rb8YHE?v=gotip
Remember that you can't assign to b[n]. The memory is still read-only.
Simple, no reflect, and I think it is portable. s is your string and b is your bytes slice
var b []byte
bb:=(*[3]uintptr)(unsafe.Pointer(&b))[:]
copy(bb, (*[2]uintptr)(unsafe.Pointer(&s))[:])
bb[2] = bb[1]
// use b
Remember, bytes value should not be modified (will panic). re-slicing is ok (for example: bytes.split(b, []byte{','} )

Ownership and conditionally executed code

I read the rust book over the weekend and I have a question about the concept of ownership. The impression I got is that ownership is used to statically determine where a resource can be deallocated. Now, suppose that we have the following:
{ // 1
let x; // 2
{ // 3
let y = Box::new(1); // 4
x = if flip_coin() {y} else {Box::new(2)} // 5
} // 6
} // 7
I was surprised to see that the compiler accepts this program. By inserting println!s and implementing the Drop trait for the boxed value, I saw that the box containing the value 1 will be deallocated at either line 6 or 7 depending on the return value of flip_coin. How does the compiler know when to deallocate that box? Is this decided at run-time using some run-time information (like a flag to indicate if the box is still in use)?
After some research I found out that Rust currently adds a flag to every type that implements the Drop trait so that it knows whether the value has been dropped or not, which of course incurs a run-time cost. There have been proposals to avoid that cost by using static drops or eager drops but those solutions had problems with their semantics, namely that drops could occur at places that you wouldn't expect (e.g. in the middle of a code block), especially if you are used to C++ style RAII. There is now consensus that the best compromise is a different solution where the flags are removed from the types. Instead flags will be added to the stack, but only when the compiler cannot figure out when to do the drop statically (while having the same semantics as C++) which specifically happens when there are conditional moves like the example given in this question. For all other cases there will be no run-time cost. It appears though, that this proposal will not be implemented in time for 1.0.
Note that C++ has similar run-time costs associated with unique_ptr. When the new Drop is implemented, Rust will be strictly better than C++ in that respect.
I hope this is a correct summary of the situation. Credit goes to u/dyoll1013, u/pcwalton, u/!!kibwen, u/Kimundi on reddit, and Chris Morgan here on SO.
In non-optimized code, Rust uses dynamic checks, but it's likely that they will be eliminated in optimized code.
I looked at the behavior of the following code:
#[derive(Debug)]
struct A {
s: String
}
impl Drop for A {
fn drop(&mut self) {
println!("Dropping {:?}", &self);
}
}
fn flip_coin() -> bool { false }
#[allow(unused_variables)]
pub fn test() {
let x;
{
let y1 = A { s: "y1".to_string() };
let y2 = A { s: "y2".to_string() };
x = if flip_coin() { y1 } else { y2 };
println!("leaving inner scope");
}
println!("leaving middle scope");
}
Consistent with your comment on the other answer, the call to drop for the String that was left alone occurs after the "leaving inner scope" println. That does seem consistent with one's expectation that the y's scopes extend to the end of their block.
Looking at the assembly language, compiled without optimization, it seems that the if statement not only copies either y1 or y2 to x, but also zeroes out whichever variable provided the source for the move. Here's the test:
.LBB14_8:
movb -437(%rbp), %al
andb $1, %al
movb %al, -177(%rbp)
testb $1, -177(%rbp)
jne .LBB14_11
jmp .LBB14_12
Here's the 'then' branch, which moves the "y1" String to x. Note especially the call to memset, which is zeroing out y1 after the move:
.LBB14_11:
xorl %esi, %esi
movl $32, %eax
movl %eax, %edx
leaq -64(%rbp), %rcx
movq -64(%rbp), %rdi
movq %rdi, -176(%rbp)
movq -56(%rbp), %rdi
movq %rdi, -168(%rbp)
movq -48(%rbp), %rdi
movq %rdi, -160(%rbp)
movq -40(%rbp), %rdi
movq %rdi, -152(%rbp)
movq %rcx, %rdi
callq memset#PLT
jmp .LBB14_13
(It looks horrible until you realize that all those movq instructions are just copying 32 bytes from %rbp-64, which is y1, to %rbp-176, which is x, or at least some temporary that'll eventually be x.) Note that it copies 32 bytes, not the 24 you'd expect for a Vec (one pointer plus two usizes). This is because Rust adds a hidden "drop flag" to the structure, indicating whether the value is live or not, following the three visible fields.
And here's the 'else' branch, doing exactly the same for y2:
.LBB14_12:
xorl %esi, %esi
movl $32, %eax
movl %eax, %edx
leaq -128(%rbp), %rcx
movq -128(%rbp), %rdi
movq %rdi, -176(%rbp)
movq -120(%rbp), %rdi
movq %rdi, -168(%rbp)
movq -112(%rbp), %rdi
movq %rdi, -160(%rbp)
movq -104(%rbp), %rdi
movq %rdi, -152(%rbp)
movq %rcx, %rdi
callq memset#PLT
.LBB14_13:
This is followed by the code for the "leaving inner scope" println, which is painful to behold, so I won't include it here.
We then call a "glue_drop" routine on both y1 and y2. This seems to be a compiler-generated function that takes an A, checks its String's Vec's drop flag, and if that's set, invokes A's drop routine, followed by the drop routine for the String it contains.
If I'm reading this right, it's pretty clever: even though it's the A that has the drop method we need to call first, Rust knows that it can use ... inhale ... the drop flag of the Vec inside the String inside the A as the flag that indicates whether the A needs to be dropped.
Now, when compiled with optimization, inlining and flow analysis should recognize situations where the drop definitely will happen (and omit the run-time check), or definitely will not happen (and omit the drop altogether). And I believe I have heard of optimizations that duplicate the code following a then/else clause into both paths, and then specialize them. This would eliminate all run-time checks from this code (but duplicate the println! call).
As the original poster points out, there's an RFC proposal to move drop flags out of the values and instead associate them with the stack slots holding the values.
So it's plausible that the optimized code might not have any run-time checks at all. I can't bring myself to read the optimized code, though. Why not give it a try yourself?

Can i use rust instead of c++ in OS Development

I want to know if rust complied code have OS dependent code in it or not.(not talking about print like stuff)
for example
let x = (4i,2i,3i)
let y = (3i,4i,4i)
now if compare x == y is it using some of its library and if yes is platform dependent.
Edited:
Like in C++ we should not use new, try catch, or any standard lib.
what are the things we should be avoid while writing in rust.
You can see the code that the rust compiler will generate for a snippet like that yourself, without having to even install Rust locally.
Just visit the web-based playpen, and type your snippet in there. You can run the program (and thus observe what it does via print statements), or, more usefully in this case, you can compile the program down to the generated assembly and then inspect it to see if it has calls to underlying system routines.
If you go to this link: http://is.gd/Be6YVJ I have already put such a program into the playpen. (See bottom of this post for the actual program text.)
If you hit the asm button, you can then see the assembly for each routine. (I have added inline(never) attributes to the relevant functions to ensure that they do not get optimized away by the compiler.)
Here is the generated assembly for bar below, a function that calls out to a higher-order function to get a pair of 3-tuples, and then compares them for equality:
.section .text._ZN3bar20h2bb2fd5b9c9e987beaaE,"ax",#progbits
.align 16, 0x90
.type _ZN3bar20h2bb2fd5b9c9e987beaaE,#function
_ZN3bar20h2bb2fd5b9c9e987beaaE:
.cfi_startproc
cmpq %fs:112, %rsp
ja .LBB0_2
movabsq $56, %r10
movabsq $0, %r11
callq __morestack
retq
.LBB0_2:
subq $56, %rsp
.Ltmp0:
.cfi_def_cfa_offset 64
movq %rdi, %rax
leaq 8(%rsp), %rdi
callq *%rax
movq 8(%rsp), %rcx
xorl %eax, %eax
cmpq 32(%rsp), %rcx
jne .LBB0_5
movq 40(%rsp), %rcx
cmpq %rcx, 16(%rsp)
jne .LBB0_5
movq 48(%rsp), %rax
cmpq %rax, 24(%rsp)
sete %al
.LBB0_5:
addq $56, %rsp
retq
.Ltmp1:
.size _ZN3bar20h2bb2fd5b9c9e987beaaE, .Ltmp1-_ZN3bar20h2bb2fd5b9c9e987beaaE
.cfi_endproc
So you can see that the only thing it is calling out to is a helper routine, __morestack, that checks for stack-overflow (or allocate more stack, in systems with segmented stack support). (So for an example like this, that is the only core functionality you will need to provide yourself; note that you could just have it halt the kernel.)
Here is the program I put into the playpen:
#[inline(never)]
fn bar(f: fn() -> ((int, int, int), (int, int, int))) -> bool {
let (x, y) = f();
x == y
}
#[inline(never)]
fn foo_1() -> ((int,int,int), (int,int,int)) {
let x = (4i,2i,3i);
let y = (3i,4i,4i);
(x, y)
}
#[inline(never)]
fn foo_2() -> ((int,int,int), (int,int,int)) {
let x = (4i,2i,3i);
(x, x)
}
fn main() {
println!("bar(foo_1): {}", bar(foo_1));
println!("bar(foo_2): {}", bar(foo_2));
}
Rust had been designed to allow one to implement an operating system kernel, drivers or an application that does not even have an operating systems and runs on bare-metal hardware.
Currently Rust's standard runtime can be disable with #![no_std] attribute in the code. You can still use some libraries, such as libcore. One of the things that you will not get without runtime is format! and println! macros, the sprintf() and printf() equivalents.
For an example of something you can do today, take a look at Zinc project.

Resources