Operations on 64bit words in 32bit system - linux

I'm new here same as I'm new with assembly. I hope that you can help me to start.
I'm using 32bit (i686) Ubuntu to make programs in assembly, using gcc compiler.
I know that general-purpose-registers are 32bit (4 bytes) max, but what when I have to operate on 64 bit numbers? Intel's instruction says that higher bits are stored in %edx and lower in %eax
Great...
So how can I do something with this 2-registers number? I have to convert 64bit dec to hex, then save it to memory and show on the screen.
How to make the 64bit quadword at start of the program in .data section?
EDIT:
When I defined global variable llu (long long unsigned) in C and compiled to assembly it made:
.data
a:
.long <low bits>
.long <high bits>
It is because the parameters are saved in the stack backwards or something more?

Write a trivial C program that uses long long numbers (which have
64-bits on Linux/ix86).
Compile that program into assembly with gcc -S t.c.
Study the resulting assembly.
Modify your program to do something more complicated, and repeat steps 2 and 3.
After several iterations, you should have a good handle on what you need to do in assembly.

When I defined global variable llu (long long unsigned) in C and compiled to assembly it made:
.data
a:
.long <low bits>
.long <high bits>
It is because the parameters are saved in the stack backwards or something more?

Related

Minimal assembly program ARM

I am learning assembly, but I'm having trouble understanding how a program is executed by a CPU when using gnu/linux on arm. I will elaborate.
Problem:
I want my program to return 5 as it's exit status.
Assembly for this is:
.text
.align 2
.global main
.type main, %function
main:
mov w0, 5 //move 5 to register w0
ret //return
I then assemble it with:
as prog.s -o prog.o
Everything ok up to here. I understand I then have to link my object file to the C library in order to add additional code that will make my program run. I then link with(paths are omitted for clarity):
ld crti.o crtn.o crt1.o libc.so prog.o ld-linux-aarch64.so.1 -o prog
After this, things work as expected:
./prog; echo $?
5
My problem is that I can't figure out what the C standard library is actually doing here. I more or less understand crti/n/1 are adding entry code to my program (eg the .init and .start sections), but no clue what's libc purpose.
I am interested at what would be a minimal assembly implementation of "returning 5 as exit status"
Most resources on the web focus on the instructions and program flow once you are in main. I am really interested at what are all the steps that go on once I execute with ./. I am now going through computer architecture textbooks, but I hope I can get a little help here.
The C language starts at main() but for C to work you in general need at least a minimal bootstrap. For example before main can be called from the C bootstrap
1) stack/stackpointer
2) .data initialized
3) .bss initalized
4) argc/argv prepared
And then there is C library which there are many/countless C libraries and each have their own designs and requirements to be satisfied before main is called. The C library makes system calls into the system so this starts to become system (operating system, Linux, Windows, etc) dependent, depending on the design of the C library that may be a thin shim or heavily integrated or somewhere in between.
Likewise for example assuming that the operating system is taking the "binary" (binary formats supported by the operating system and rules for that format are defined by the operating system and the toolchain must conform likewise the C library (even though you see the same brand name sometimes assume toolchain and C library are separate entities, one designed to work with the other)) from a non volatile media like a hard drive or ssd and copying the relevant parts into memory (some percentage of the popular, supported, binary file formats, are there for debug or file format and not actually code or data that is used for execution).
So this leaves a system level design option of does the binary file format indicate .data, .bss, .text, etc (note that .data, .bss, .text are not standards just convention most people know what that means even if a particular toolchain did not choose to use those names for sections or even the term sections).
If so the operating systems loader that takes the program and loads it into memory can choose to put .data in the right place and zero .bss for you so that the bootstrap does not have to. In a bare-metal situation the bootstrap would normally handle the read/write items because it is not loaded from media by some other software it is often simply in the address space of the processor on a rom of some flavor.
Likewise argv/argc could be handled by the operating systems tool that loads the binary as it had to parse out the location of the binary from the command line assuming the operating system has/uses a command line interface. But it could as easily simply pass the command line to the bootstrap and the bootstrap has to do it, these are system level design choices that have little to do with C but everything to do with what happens before main is called.
The memory space rules are defined by the operating system and between the operating system and the C library which often contains the bootstrap due to its intimate nature but I guess the C library and bootstrap could be separate. So linking plays a role as well, does this operating system support protection is it just read/write memory and you just need to spam it in there or are there separate regions for read/only (.text, .rodata, etc) and read/write (.data, .bss, etc). Someone needs to handle that, linker script and bootstrap often have a very intimate relationship and the linker script solution is specific to a toolchain not assumed to be portable, why would it, so while there are other solutions the common solution is that there is a C library with a bootstrap and linker solution that are heavily tied to the operating system and target processor.
And then you can talk about what happens after main(). I am happy to see you are using ARM not x86 to learn first, although aarch64 is a nightmare for a first one, not the instruction set just the execution levels and all the protections, you can go a long long way with this approach but there are some things and some instructions you cannot touch without going bare metal. (assuming you are using a pi there is a very good bare-metal forum with a lot of good resources).
The gnu tools are such that binutils and gcc are separate but intimately related projects. gcc knows where things are relative to itself so assuming you combined gcc with binutils and glibc (or you just use the toolchain you found), gcc knows relative to where it executed to find these other items and what items to pass when it calls the linker (gcc is to some extent just a shell that calls a preprocessor a compiler the assembler then linker if not instructed not to do these things). But the gnu binutils linker does not. While as distasteful as it feels to use, it is easier just to
gcc test.o -o test
rather than figure out for that machine that day what all you need on the ld command line and what paths and depending on design the order on the command line of the arguments.
Note you can probably get away with this as a minimum
.global main
.type main, %function
main:
mov w0, 5 //move 5 to register w0
ret //return
or see what gcc generates
unsigned int fun ( void )
{
return 5;
}
.arch armv8-a
.file "so.c"
.text
.align 2
.p2align 4,,11
.global fun
.type fun, %function
fun:
mov w0, 5
ret
.size fun, .-fun
.ident "GCC: (GNU) 10.2.0"
I am used to seeing more fluff in there:
.arch armv5t
.fpu softvfp
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 2
.eabi_attribute 30, 2
.eabi_attribute 34, 0
.eabi_attribute 18, 4
.file "so.c"
.text
.align 2
.global fun
.syntax unified
.arm
.type fun, %function
fun:
# args = 0, pretend = 0, frame = 0
# frame_needed = 0, uses_anonymous_args = 0
# link register save eliminated.
mov r0, #5
bx lr
.size fun, .-fun
.ident "GCC: (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609"
.section .note.GNU-stack,"",%progbits
Either way, you can look up each of the assembly language items and decide if you really need them or not, depends in part on if you feel the need to use a debugger or binutils tools to tear apart the binary (do you really need to know the size of fun for example in order to learn assembly language?)
If you wish to control all of the code and not link with a C library you are more than welcome to you need to know the memory space rules for the operating system and create a linker script (the default one may in part be tied to the C library and is no doubt overly complicated and not something you would want to use as a starting point). In this case being two instructions in main you simply need the one address space valid for the binary, however the operating system enters (ideally using the ENTRY(label), which could be main if you want but often is not _start is often found in linker scripts but is not a rule either, you choose. And as pointed out in comments you would need to make the system call to exit the program. System calls are specific to the operating system and possibly version and not specific to a target (ARM), so you would need to use the right one in the right way, very doable, your whole project linker script and assembly language could be maybe a couple dozen lines of code total. We are not here to google those for you so you would be on your own for that.
Part of your problem here is you are searching for compiler solutions when the compiler in general has absolutely nothing to do with any of this. A compiler takes one language turns it into another language. An assembler same deal but one is simple and the other usually machine code, bits. (some compilers output bits not text as well). It is equivalent to looking up the users manual for a table saw to figure out how to build a house. The table saw is just a tool, one of the tools you need, but just a generic tool. The compiler, specific gnu's gcc, is generic it does not even know what main() is. Gnu follows the Unix way so it has a separate binutils and C library, separate developments, and you do not have to combine them if you do not want to, you can use them separately. And then there is the operating system so half your question is buried in operating system details, the other half in a particular C library or other solution to connect main() to the operating system.
Being open source you can go look at the bootstrap for glibc and others and see what they do. Understanding this type of open source project the code is nearly unreadable, much easier to disassemble sometimes, YMMV.
You can search for the Linux system calls for arm aarch64 and find the one for exit, you can likely see that the open source C libraries or bootstrap solutions you find that are buried under what you are using today, will call exit but if not then there is some other call they need to make to return back to the operating system. It is unlikely it is a simple ret with a register that holds a return value, but technically that is how someone could choose to do it for their operating system.
I think you will find for Linux on arm that Linux is going to parse the command line and pass argc/argv in registers, so you can simply use them. And likely going to prep .data and .bss so long as you build the binary correctly (you link it correctly).
Here's a bare minimum example.
Run it with:
gcc -c thisfile.S && ld thisfile.o && ./a.out
Source code:
#include <sys/syscall.h>
.global _start
_start:
movq $SYS_write, %rax
movq $1, %rdi
movq $st, %rsi
movq $(ed - st), %rdx
syscall
movq $SYS_exit, %rax
movq $1, %rdi
syscall
st:
.ascii "\033[01;31mHello, OS World\033[0m\n"
ed:

GCC compiled code: why integer declaration needs several statements?

I'm learning AT&T assembly,I know arrays/variables can be declared using .int/.long, or using .equ to declare a symbol, that's to be replaced by assembly.
They're declared insided either .data section(initialzed),or .bss section(uninitialzed).
But when I used gcc to compiled a very simple .c file with '-S' command line option to check the disassembly code, I noticed that:
(1) .s is not using both .data and .bss, but only .data
(2) The declaration of an integer(.long) cost several statements, some of them seems redundant or useless to me.
Shown as below, I've added some comments as per my questions.
$ cat n.c
int i=23;
int j;
int main(){
return 0;
}
$ gcc -S n.c
$ cat n.s
.file "n.c"
.globl i
.data
.align 4
.type i, #object #declare i, I think it's useless
.size i, 4 #There's '.long 23', we know it's 4 bytes, why need this line?
i:
.long 23 #Only this line is needed, I think
.comm j,4,4 #Why j is not put inside .bss .section?
.text
.globl main
.type main, #function
main:
.LFB0: #What does this symbol mean, I don't find it useful.
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0: #What does this symbol mean, I don't find it useful.
.size main, .-main
.ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.2) 5.4.0 20160609"
.section .note.GNU-stack,"",#progbits
All my questions are in the comments above, I re-emphasize here again:
.type i, #object
.size i, 4
i:
.long 23
I really think above code is redundant, should be as simple as:
i:
.long 23
Also, "j" doesn't have a symbol tag, and is not put inside .bss section.
Did I get wrong with anything? Please help to correct. Thanks a lot.
(I am guessing you are using some Linux system)
They're declared insided either .data section(initialzed),or .bss section(uninitialzed).
No, you have many other sections, notably .comm (for the "common" section, with initialized data common to several object files, that the linker would "merge") and .rodata for read-only data. The ELF format is flexible enough to permit many sections and many segments (some of which are not loaded -more precisely memory mapped- in memory).
The description of sections in ELF files is much more complex than what you believe. Take time to read more, e.g. Linkers and loaders by Levine. Read also the documentation and the scripts of GNU binutils and also ld(1) & as(1). use objdump(1) and readelf(1) to explore existing ELF executables, object files and shared objects. Read also execve(2) & elf(5)
But when I used gcc to compiled a very simple .c file with -S command line option
When examining the assembler file generated by gcc I strongly recommend passing at least -fverbose-asm to ask gcc to emit some additional and useful comments in the assembler file. I also usually recommend to use some optimization flag -e.g. -O1 at least (or perhaps -Og on recent versions of gcc).
I noticed that:
(1) .s is not using both .data and .bss, but only .data
No, your generated code use the .comm section and put the value of j there.
(2) The declaration of an integer(.long) cost several statements, some of them seems redundant or useless to me.
These are mostly not assembler statements (translated into machine code) but assembler directives; they are very useful (and they don't waste space in the memory segment produced by ld, but the ELF format has information elsewhere). In particular .size and .type are both needed because the symbol tables in ELF files contain more than addresses (it also has a notion of size, and a very primitive notion of type).
The .LFB0 is a gcc (actually cc1-) generated label. GCC does not care about generating useless labels (it is simpler for the assembler generator in the GCC backend), since they don't appear in object files.
There's '.long 23', we know it's 4 bytes,
you might know that a long is 4 bytes, but that information (size of j) should go into the ELF file so requires explicit assembler directives....
(I don't have space or time to explain the ELF format, you need to read many pages about it, and it is a lot more complex and more complete than what you believe)
BTW, Drepper's How To Write Shared Libraries is quite long (more than 40 pages) and explains a good deal about ELF files, focusing on shared libraries.

How to write c++ code that the compiler can efficiently compile to SSE or AVX?

Let's say I have a function written in c++ that performs matrix vector multiplications on a lot of vectors. It takes a pointer to the array of vectors to transform. Am I correct to assume that the compiler cannot efficiently optimize that to SIMD instructions because it does not know the alignment of the passed pointer (requiring a 16 byte alignment for SSE or 32 byte alignment for AVX) at compile time? Or is the memory alignment of the data irrelevant for optimal SIMD code and the data alignment will only affect cache performance?
If alignment is important for the generated code, how can I let the (visual c++) compiler know that I intend to only pass values with a certain alignment to the function?
In theory alignment should not matter on Intel processors since Nehalem. Therefore, your compiler should be able to produce code in which a pointer being aligned or not is not an issue.
Unaligned load/store instructions have the same performance on Intel processors since Nehalem. However, until AVX arrived with Sandy Bridge unaligned loads could not be folded with another operation for micro-op fusion.
Additionally, even before AVX to avoid the penalty of cache line splits having 16 byte aligned memory could still be helpful so it would still be reasonable for a compiler to add code until the pointer is 16 byte aligned.
Since AVX there is no advantage to using aligned load/store instructions anymore and there is no reason for a compiler to add code to make a pointer 16 byte or 32 byte aligned..
However, there is till a reason to use aligned memory to avoid cache-line splits with AVX. Therefore, it would would be reasonable for a compiler to add code to make the pointer 32 byte aligned even if it still used an unaligned load instruction.
So in practice some compilers produce much simpler code when they are told to assume that a pointer is aligned.
I'm not aware of a method to tell MSVC that a pointer is aligned. With GCC and Clang (since 3.6) you can use a built in __builtin_assume_aligned. With ICC and also GCC you can use #pragma omp simd aligned. With ICC you can also use __assume_aligned.
For example with GCC compiling this simple loop
void foo(float * __restrict a, float * __restrict b, int n)
{
//a = (float*)__builtin_assume_aligned (a, 16);
//b = (float*)__builtin_assume_aligned (b, 16);
for(int i=0; i<(n & (-4)); i++) {
b[i] = 3.14159f*a[i];
}
}
with gcc -O3 -march=nehalem -S test.c and then wc test.s gives 160 lines. Whereas if use __builtin_assume_aligned then wc test.s gives only 45 lines. When I did this with in both cases clang return 110 lines.
So on clang informing the compiler the arrays were aligned made no difference (in this case) but with GCC it did. Counting lines of code is not a sufficient metric to gauge performance but I'm not going to post all the assembly here I just want to illustrate that your compiler may produce very different code when it is told the arrays are aligned.
Of course, the additional overhead that GCC has for not assuming the arrays are aligned may make no difference in practice. You have to test and see.
In any case, if you want to get the most most from SIMD I would not rely on the compiler to do it correctly (especially with MSVC). Your example of matrix*vector is a poor one in general (but maybe not for some special cases) since it's memory bandwidth bound. But if you choose matrix*matrix no compiler is going to optimize that well without a lot of help which does not conform to the C++ standard. In these cases you will need intrinsics/built-ins/assembly in which you have explicit control of the alignment anyway.
Edit:
The assembly from GCC contains a lot of extraneous lines which are not part of the text segment. Doing gcc -O3 -march=nehalem -S test.c and then using objdump -d and counting the lines in the text (code) segment gives 108 lines without using __builtin_assume_aligned and only 16 lines with it. This shows more clearly that GCC produces very different code when it assumes the arrays are aligned.
Edit:
I went ahead and tested the foo function above in MSVC 2013. It produces unaligned loads and the code is much shorter than GCC (I only show the main loop here):
$LL3#foo:
movsxd rax, r9d
vmulps xmm1, xmm0, XMMWORD PTR [r10+rax*4]
vmovups XMMWORD PTR [r11+rax*4], xmm1
lea eax, DWORD PTR [r9+4]
add r9d, 8
movsxd rcx, eax
vmulps xmm1, xmm0, XMMWORD PTR [r10+rcx*4]
vmovups XMMWORD PTR [r11+rcx*4], xmm1
cmp r9d, edx
jl SHORT $LL3#foo
This should be fine on processors since Nehalem (late 2008). But MSVC still has cleanup code for arrays that are not a multiple of four even thought I told the compiler that it was a multiple of four ((n & (-4)). At least GCC gets that right.
Since AVX can fold unalinged loads I checked GCC with AVX to see if the code was the same.
void foo(float * __restrict a, float * __restrict b, int n)
{
//a = (float*)__builtin_assume_aligned (a, 32);
//b = (float*)__builtin_assume_aligned (b, 32);
for(int i=0; i<(n & (-8)); i++) {
b[i] = 3.14159f*a[i];
}
}
without __builtin_assume_aligned GCC produces 168 lines of assembly and with it it only produces 17 lines.
My original answer became too messy to edit so I am adding a new answer here and making my original answer community wiki.
I did some tests using aligned and unaligned memory on a pre Nehalem system and on a Haswell system with GCC, Clang, and MSVC.
The assembly shows that only GCC adds code to check and fix alignment. Due to this with __builtin_assume_aligned GCC produces much simpler code. But using __builtin_assume_aligned with Clang only changes unaligned instructions to aligned (the number of instructions stay the same). MSVC just uses unaligned instructions.
The results in performance is that on per-Nehalem systems Clang and MSVC are much slower than GCC with auto-vectorization when the memory is not aligned.
But the penalty for cache-line splits is small since Nehalem. It turns out the extra code GCC adds to check and align the memory more than makes up for the small penalty due to cache-line splits. This explains why neither Clang nor MSVC worry about cache-line splits with vectorization.
So my original claim that auto-vecorization does not need to know about the alignment is more or less correct since Nehalem. That's not the same thing as saying that aligning memory is not useful since Nehalem.

Switch from 32bit mode to 64 bit (long mode) on 64bit linux

My program is in 32bit mode running on x86_64 CPU (64bit OS, ubuntu 8.04). Is it possible to switch to 64bit mode (long mode) in user mode temporarily? If so, how?
Background story: I'm writing a library linked with 32bit mode program, so it must be 32bit mode at start. However, I'd like to use faster x86_64 intructions for better performance. So I want to switch to 64bit mode do some pure computation (no OS interaction; no need 64bit addressing) and come back to 32bit before returning to caller.
I found there are some related but different questions. For example,
run 32 bit code in 64 bit program
run 64 bit code in 32 bit OS
My question is "run 64 bit code in 32 bit program, 64 bit OS"
Contrary to the other answers, I assert that in principle the short answer is YES. This is likely not supported officially in any way, but it appears to work. At the end of this answer I present a demo.
On Linux-x86_64, a 32 bit (and X32 too, according to GDB sources) process gets CS register equal to 0x23 — a selector of 32-bit ring 3 code segment defined in GDT (its base is 0). And 64 bit processes get another selector: 0x33 — a selector of long mode (i.e. 64 bit) ring 3 code segment (bases for ES, CS, SS, DS are treated unconditionally as zeros in 64 bit mode). Thus if we do far jump, far call or something similar with target segment selector of 0x33, we'll load the corresponding descriptor to the shadow part of CS and will end up in a 64 bit segment.
The demo at the bottom of this answer uses jmp far instruction to jump to 64 bit code. Note that I've chosen a special constant to load into rax, so that for 32 bit code that instruction looks like
dec eax
mov eax, 0xfafafafa
ud2
cli ; these two are unnecessary, but leaving them here for fun :)
hlt
This must fail if we execute it having 32 bit descriptor in CS shadow part (will raise SIGILL on ud2 instruction).
Now here's the demo (compile it with fasm).
format ELF executable
segment readable executable
SYS_EXIT_32BIT=1
SYS_EXIT_64BIT=60
SYS_WRITE=4
STDERR=2
entry $
mov ax,cs
cmp ax,0x23 ; 32 bit process on 64 bit kernel has this selector in CS
jne kernelIs32Bit
jmp 0x33:start64 ; switch to 64-bit segment
start64:
use64
mov rax, qword 0xf4fa0b0ffafafafa ; would crash inside this if executed as 32 bit code
xor rdi,rdi
mov eax, SYS_EXIT_64BIT
syscall
ud2
use32
kernelIs32Bit:
mov edx, msgLen
mov ecx, msg
mov ebx, STDERR
mov eax, SYS_WRITE
int 0x80
dec ebx
mov eax, SYS_EXIT_32BIT
int 0x80
msg:
db "Kernel appears to be 32 bit, can't jump to long mode segment",10
msgLen = $-msg
The answer is NO. Just because you are running 64bit code (presumably 64bit length datatypes, eg. variables, etc.) you are not running in 64bit mode on a 32 bit box. Compilers have workarounds to provide 64bit data types on 32 bit machines. For example gcc has unsigned long long and uin64_t that are 8 bit datatypes on both x86 and x86_64 machines. Datatypes are portable between x86 & x86_64 for that reason. That does not mean you get 64bit address space on a 32bit box. It means the compiler can handle 64bit datatypes. You will run into instances where you cannot run some 64bit code on 32 bit boxes. In that case, you will need preprocessor instructions to compile the correct 64bit code on x86_64 and the correct 32bit code on x86. A simple example is where different datatypes are explicitly required. In that case you can provide a preprocessor check to determine if the host computer is 64bit or 32bit with:
#if defined(__LP64__) || defined(_LP64)
# define BUILD_64 1
#endif
You can then provide conditionals to compile the correct code with the following:
#ifdef BUILD_64
printf (" x : %ld, hex: %lx,\nfmtbinstr_64 (d, 4, \"-\"): %s\n",
d, d, fmtbinstr_64 (d, 4, "-"));
#else
printf (" x : %lld, hex: %llx,\nfmtbinstr_64 (d, 4, \"-\"): %s\n",
d, d, fmtbinstr_64 (d, 4, "-"));
#endif
Hopefully this provides a starting point for you to work with. If you have more specific questions, please post more details.

How to compile this asm code under linux with nasm and gcc?

I have the following source snippet from a book that I'm reading now. So I created an asm file and typed exactly. Then used nasm command (nasm -f elf test.asm) then tried to compile into an executable file by using gcc (gcc test.o -o test) then I get the following error.
Error:
ld: warning: ignoring file test.o, file was built for unsupported file format which is not the architecture being linked (x86_64)
Source code:
[BITS 16]
[SECTION .text]
START:
mov dx, eatmsg
mov ah, 9
int 21H
mov ax, 04C00H
int 21H
[SECTION .data]
eatmsg db "Eat at Joe's!", 13, 10, "$"
I guess the source code is not compatible with current generation of CPUs (the book is old...).
How do I fix this source code to run under x86_64 CPUs ?
If you want to continue using that old book to learn the basics (which is just fine, nothing wrong with learning the basics/old way before moving on to modern OS), you can run it in DOSBox, or a FreeDOS VM.
That's a 16 bits code, it was made to create pure binary code and not executables. You cannot run it on modern OSs like Linux without heavy modifications. And btw, that's an MS-DOS assembly, which will not work for Linux anyway (using int 21h which are MS-DOS services).
If you want to learn assembly, I suggest getting a more modern book, or setting up a virtual machine in which to learn with your book (although learning 16-bits assembly is really not useful nowadays).
First of all the code contains interrupts which will work in real mode only and in DOS (int 21h with values in regs), and linux works in protected mode, you cannot call these interrupts directly.
Next the code is 16 bit code, to make it a 64 bit code you need [BITS 64]
Third you have no entry point to the code. To make one you can write a main function in C and then call the starting label as a function in the assembly code.
Have a look at this thing: PC Assembly Language by Paul A. Carter

Resources