Should we store long strings on stack in assembly?

Should we store long strings on stack in assembly? - string

The general way to store strings in NASM is to use db like in msg: db 'hello world', 0xA. I think this stores the string in the bss section. So the string will occupy the storage for the duration of the entire program. Instead, if we store it in the stack, it will be alive only during the local frame. For small strings (less than 8 bytes), this can be done using mov dword [rsp], 'foo'. But for longer strings, the string has to be split and be stored using multiple instructions. So this would increase the executable size (I thought so).
So now, which is better in large programs with multiple strings? Are any of the assumptions I made above, wrong?

mov dword [rsp] 'foo' assembles to C70424666F6F00, it takes 7 bytes to encode 4 payload characters.
In comparison with standard static way DB 'foo',0 the definition of string as immediate operand in code section increases the code size by 75 %.
Your dynamic method may be profitable only if you could eliminate the .rodata or .data section entirely (which is seldom the case of large programs). Each additional section takes more space in executable program than its netto contents because of its file-alignment (in PE format it is 512 bytes).
And even when your program has no other static data in data sections beside long strings, you could spare more space with static declaration in .text (code) section.

But for longer strings string has to be split and be stored using multiple instructions. So this would increase the executable size (I thought so).
Yep, and in almost all cases, the number of bytes used by those instructions will exceed the number of bytes that would be needed to just store the string in memory normally. (The instruction includes all the bytes of the immediate, with a few exceptions like zero- and sign-extension, as well as additional bytes to encode the opcode, destination address, etc). And code, of course, also occupies (virtual) memory for the entire duration of the program. There's no free lunch.
As such, you should just assemble the strings directly into memory with db as you have done. But try to arrange for them to go in a read-only section such as .text or .rodata depending on what your system supports. A modern operating system will then be able to discard those pages from physical memory if there is memory pressure, knowing that they can be reloaded from the executable if needed again. This is entirely transparent to your program. If there are many strings that will be needed at the same times, you can optimize a little by trying to arrange for them to be adjacent in your program's memory (defining them all together in one asm file should suffice).
You could also design something where at runtime, you read your strings from an external file into dynamically allocated memory, then free it when done with them. This was a common technique for programs on ancient operating systems with limited physical memory and no virtual memory support (think MS-DOS).

The string data has to be somewhere. Existing as immediates in your machine code takes space in the .text section of your program, normally linked into the same program segment as .rodata where you'd keep string literals. Running instructions to store strings to the stack means the data has to come into I-cache, then go out into D-cache.
But for long redundant strings, code to generate them in memory may be smaller than the string itself. This comes down to the Kolmogorov complexity; minimum size of a program that can output (or generate in an array) the given data. That program could be a gzip or zstd decompressor with some input constant data, could be good for a very large string or set of strings, much larger than the decompression code. Or for a specific case, have a look at code-golf questions like The alphabet in programming languages where my answer is 9 bytes of x86 machine code (including a ret) which inefficiently stores 1 byte at a time, incrementing a pointer in a loop, to produce the 26-byte string (without a terminating 0). Slow but compact.
Pushing / Storing from immediates
For just 4 bytes (not including the 0 terminator) you'd use push 'foo' = 5 bytes = 80% efficiency. On x86-64, that's a qword push (sign-extending the imm32 to 64-bit), so we get 4 bytes of zeros for free.
After that, getting the pointer into a register with mov rdi, rsp (3 bytes) is 4 bytes smaller than lea rdi, [rel msg_foo] (7 bytes), so it's an overall win for total size (unless padding for function alignment bumps it up or hides it). But anyway, comparing against the best option instead of the worst might be a good idea for your answer.
Compilers will sometimes use immediate data to init a local struct or array that has to be on the stack anyway (i.e. they have to pass a non-const pointer to another function); their threshold for switching to copying from .rodata (with movdqa xmm load/store) is higher than 8 bytes. But when you just want to print the string, you just need to pass a pointer to .rodata without copying it to the stack at all, so the threshold is much lower for it to be worth it to use stack space. Maybe 8 bytes, maybe 16, maybe only 4, depending on I-cache vs. D-cache pressure in your program.
For short but not tiny strings
mov rcx, imm64 + push rcx is 10+1 = 11 total bytes for 8 bytes of payload = 73% efficiency, vs. 4/7 = 57%. (At an offset from RSP it would be even worse, but to save code size you could use RBP+imm8 instead of RSP+imm8, but that's still 4 bytes per 7. You could mix push sign_extended_imm32 with mov dword [rsp+4], imm32 but that's also bad.)
This does have overhead that scales with string size, so it quickly becomes more size-efficient to just copy from .rodata, e.g. with an XMM loop, call memcpy, or even rep movsb if you're optimizing for size.
Or of course just using the string in .rodata if possible, if you don't need to make a copy you can modify on the stack.
Shellcode is a common use-case for techniques like this. You need a single "flat" sequence of bytes, usually not containing any 00 bytes which would terminate a C string.
You can have some data past the end of the actual machine code part, and mov store some zeros into that and generate pointers to it, but that's somewhat cumbersome. And you need a position-independent way to get pointers into registers. call/pop works, if you jump forward and call backward so the rel32 doesn't involve any 00 bytes. Same for RIP-relative lea.
It's often just as easy to push a zeroed register, or an imm8 or imm32, and get some zeros into memory along with ASCII data that way. Generating it a bit at a time makes it easy to mov rsi, rsp or whatever, instead of needing position-independent addressing relative to RIP.

Related

declare mutable string with varying size dynamically

I made my own string declarator with macro in GNU Assembler in x64 machine.
.macro declareString identifier, value
.pushsection .data
\identifier: .ascii "\value"
"lengthof.\identifier"= . - \identifier
.popsection
.endm
This is how to use it:
.data
ANOTHER_VARIABEL: .ascii "why"
.text
_start:
declareString myString, "good"
lea rax, myString
mov rbx, lengthof.myString
So basically my macro is just label maker.
So in debugger, rax will be value of myString address because myString basically just a label.
And rbx will be value of myString length (4).
So If I want change myString from good which has 4 chars to rocker which has 6 chars at runtime, I'm afraid it will overwrite another variabel due to size collision.
I heard about heap memory but I'm not sure how does it work. I only know about stack, but I'm used to use stack as temporary backup.
So how do I declare mutable string in assembly?

Same as in C; if you want to use static storage (in .data) for the characters themselves (like static char myString[4] = "good"; note not including a terminating 0 byte), you actually need to reserve enough space for the largest you ever want this string to be. Like static char myString[128] = "good".
In your case, like .space 124 after your .ascii would be a total of 128 bytes reserved after the label myString.
You're correct that you shouldn't write past the size you reserve, although it would come after ANOTHER_VARIABEL, not before, since that part of your .data section is earlier in your source than where you use the macro containing .pushsection
If you wanted to do dynamic allocation, you might want to make an mmap(..., MAP_PRIVATE|MAP_ANONYMOUS, PROT_READ|PROT_WRITE, ...) system call to allocate 1 or more pages of memory, then copy bytes into it. (You're writing a _start so I'm assuming you're not linking libc.)
You could make your global variable a pointer, but if you free the old pointer and allocate a new one (or mremap to the new size), you should start with runtime allocation and init of it. You don't want to munmap a page of .rodata.
Or instead of mmap/munmap, Linux mremap(MREMAP_MAYMOVE) can ensure that the allocation is big enough to hold your whole string without copying it. MAYMOVE lets it pick a new virtual address if there aren't enough free virtual pages following the currently pointed-to page, but without copying the actual data in the already-allocated pages.
There are tons of Q&As about making system calls in assembly so I'm not going into details about that; you can think about how you're managing memory in terms of system calls to make, separately from implementing it in asm. Thinking about it in C is pretty much equally valid; asm doesn't add much in terms of being able to declare static data or make system calls. Other than the fact that you could in asm make sure that a page of .data was exclusively dedicated to this string, with no other things using it, so you could potentially let mremap unmap that page from your .data and map it at a different virtual address.

Linux x86 CPU Instruction Layout Confusion

In x86, I understand multi-byte objects are stored in memory little endian style.
Now generally speaking, when it comes to CPU instructions, the OPCODE determines the purpose of the instruction and data/memory addresses may follow the opcode in it's encoded format. My understanding is the Opcode portion of the instruction should be the most significant byte and thus appear at the highest address of any given instruction encoding representation.
Can someone explain the memory layout on this x86 linux gdb example? I would imagine that the opcode 0xb8 would appear at a higher address due to it being the most significant byte.
(gdb) disassemble _start
Dump of assembler code for function _start:
0x08048080 <+0>: mov eax,0x11223344
(gdb) x/1xb _start+0
0x8048080 <_start>: 0xb8
(gdb) x/1xb _start+1
0x8048081 <_start+1>: 0x44
(gdb) x/1xb _start+2
0x8048082 <_start+2>: 0x33
(gdb) x/1xb _start+3
0x8048083 <_start+3>: 0x22
(gdb) x/1xb _start+4
0x8048084 <_start+4>: 0x11
It appears the instruction mov eax, 0x11223344 is encoding as 0x11 0x22 0x33 0x44 0xb8.
Questions.
1.) How does the CPU know how many bytes the instruction will take up if the first byte it see's is not an opcode?
2.) I'm wondering if perhaps x86 cpu instructions do not even have endian-ness and are considering some type of string? (probably way off here)

x86 is a variable length instruction set, you start with a single byte which has no endianness, it is wherever it is.
Then depending on the opcode there may be more bytes and those might for example be a 32 bit immediate, and (if that group of bytes is an immediate or address of some sort) THOSE bytes will be little endian. Say you have the five bytes ABCDE (no endianess, think array or string). The A byte is the opcode, the B byte would then be the lower 8 bits of the immediate and the E the upper 8 bits of the immediate.
Opcode is a hard to use term, in these older 8/16 bit CISC processors like x86 the entire byte was an opcode that you basically looked up in a table to see what it meant (and inside the processor they did use a table to figure out how to execute it). When you look at MIPS or ARM or other (certainly RISC) instruction sets like those, only a portion of the 32 bits are the "opcode" and in neither of those cases is it the same set of bits from one instruction to another, you have to look at various places in the instruction (yes there is overlap to make the decoding sane), MIPS is a lot more consistent you have one blob in one place you look at but one of those patterns requires you to look at another blob of bits to fully decode. ARM you start at a common bit and as you work your way across you are further decoding the instruction, then you may have to grab some random looking spots to fully decode. The rest of the bits are operands, what register to use or immediate or whatever the kind of thing that in a CISC you needed a look up table for (are implied by the opcode but not defined by bits in the opcode).
1) the next byte after the prior instruction will be interpreted as an opcode even if not intended to be one (if execution continues to that byte and doesnt branch). I dont remember my x86 table off hand to know if there are any undefined instructions or not, if undefined then it will hit a handler, otherwise it will decode what it finds as machine code and if it is not properly formed instructions will likely crash, sometimes you get lucky and it just messes something up and keeps going, or even more lucky and you cant tell that it almost crashed.
2) you are right for these 8/16 bit CISC or similar instruction sets they are treated more like strings that you parse through linearly.

Calculate number of bytes to fetch , Assembly

I'm tying to calculate how much byte the "fetch" need.
I'm writing in assembly this code
jmp [2*eax]
and the command in the list file is 3 bytes.
when i'm writing this command :
jmp [4*eax]
I got 7 bytes
does anyone know why ?

I suspect your assembler is being smart and is encoding the jmp [2*eax] as jmp [eax+eax] which takes fewer bytes since it doesn't require a displacement. Whereas jmp [4*eax] is really the equivalent of jmp [4*eax+0x00000000] which requires an extra 4 bytes for the displacement.
It has to do with the was the SIB (scaled index byte) works. Typically this encodes addresses in the form base + index*scale + displacement. The displacement is optional, but only if a base register is included. If you want to leave off the base register, then you are forced to include a 32 bit displacement.
So to get eax*4 you need to use the form index*4 + displacement even though you don't need that displacement. But to get eax*2, you can use the form base + index*scale (i.e. eax+eax*1), and avoid having to include the displacement.

Appending two string in x86 assembly

I'm currently working on an assignment in AT&T Assembly and now I have to append two strings:
message: .asciz "String 1"
before: .asciz "String 2"
I have really no idea how to do this or how to begin. I've already searched on internet but I couldn't find any helpful information. I think I have to manually copy the characters of the second string to the end of the first string but I'm not sure about that.
Could anyone please explain to me how to do this? :)

This question fails to mention the target memory, which makes it somewhat difficult to answer. I also don't know if you're in 16 bit, 32 bit or 64 bit. For convenience's sake, I'll also just assume they're C style 0-terminated strings.
Anyway, this seems to be the general procedure:
Get the length of the first string (instructions on writing an asm strlen can be found here: http://www.int80h.org/strlen/)
Set the ptr to the target memory
Copy the first string to the destination memory, using rep(e/ne) movsb with the size in ecx.
This can be CPU-optimized by using 'movsd' by first doing a shr ecx, 2 on your length to get it in batches of 4 bytes, and then doing the remainder with movsb. I've seen this done like this:
mov edi, dest
mov esi, string_address
mov ecx, string_length
mov eax, ecx
shr ecx, 2
repne movsd
mov cl, al
and cl, 3
repne movsb ; esi and edi move along the addresses as they copy, meaning they are already set correctly here
Get the length of the second string (be sure to back up your edi in stack or another register if needed; it contains the address you need to copy the next string to)
Copy the second string to the destination memory (as I said, the correct address should be in edi after the first string operation)
For safety, add a new 0 behind it.
If you're copying the second string to the end of the first string, you need one less copy operation, but you have to make sure there is actually enough space there to copy the second string without overwriting other vital stuff.

This is not a trivial matter. Strings are variable in length and occupy different spaces in memory, and there has to be some way to know how long they are or where they end. With C or C++, a nul bytes (bytes of zero value) indicates the end of the string. With some other
program languages, you have a pointer to the start of the string and the length of the string stored separately, which has the advantage of letting you store binary (including bytes of
zero value) in the string. Even with C and the rest, you have to have a pointer to where the string starts.
What generally has to happen is that you have to use asm to contact the operating system and request a block of memory that is currently free that is big enough to contain the contents of the two strings once they are attached. This would be memory separate from either of the
two strings to start with, and it comes from what is referred to as the Memory Heap, Once you are given the beginning point of that memory block, you copy the contents of the first
string into it, then you continue on while copying the contents of the second string in
there right behind the first. Then you release the memory that had been assigned to the
first string and reassign the block to that string by changing its pointer, and possibly
its length. The released memory is returned to the Memory Heap by the Operating System for reuse elsewhere.
Actually, the operating system is not the only source of freed up memory. Some compilers, even assemblers, either handle memory management on their own, or provide suitable tools
to the programmer to do it as the need arises.
In other words, this can be a very ambitious undertaking, and you have to know quite a
bit about what is going on to do it right. You do it wrong, you can expect consequences
like crashing your system and needing to reboot.

Memory allocation in VC++

I am working with VC++ 2005
I have overloaded the new and delete operators.
All is fine.
My question is related to some of the magic VC++ is adding to the memory allocation.
When I use the C++ call:
data = new _T [size];
The return (for example) from the global memory allocation is 071f2ea0 but data is set to 071f2ea4
When overloaded delete [] is called the 071f2ea0 address is passed in.
Another note is when using something like:
data = new _T;
both data an the return from the global memory allocation are the same.
I am pretty sure Microsoft is adding something at the head of the memory allocation to use for book keeping. My question is, does anyone know of the rules Microsoft is using.
I want to pass in the value of "data" into some memory testing routines so I need to get back to the original memory reference from the global allocation call.
I could assume the 4 byte are an index but I wanted to make sure. I could easily be a flag plus offset, or count or and index into some other table, or just an alignment to cache line of the CPU. I need to find out for sure. I have not been able to find any references to outline the details.
I also think on one of my other runs that the offset was 6 bytes not 4

The 4 bytes most likely contains the total number of objects in the allocation so delete [] will be able to loop over all objects in the array calling their destructor..
To get back the original address, you could keep a lookup-table keyed on address / 16, which stores the base address and length. This will enable you to find the original allocation. You need to ensure that your allocation+4 doesn't cross a 16-byte boundary, however.
EDIT:
I went ahead and wrote a test program that creates 50 objects with a destructor via new, and calls delete []. The destructor just calls printf, so it won't be optimized away.
#include <stdio.h>
class MySimpleClass
{
public:
~MySimpleClass() {printf("Hi\n");}
};
int main()
{
MySimpleClass* arr = new MySimpleClass[50];
delete [] arr;
return 0;
}
The partial disassembly is below, cleaned up to be a bit more legible. As you can see, VC++ is storing array count in the initial 4 byes.
; Allocation
mov ecx, 36h ; Size of allocation
call scratch!operator new
test rax,rax ; Don't write 4 bytes if NULL.
je scratch!main+0x25
mov dword ptr [rax],32h ; Store 50 in first 4 bytes
add rax,4 ; Increment pointer by 4
; Free
lea rdi,[rax-4] ; Grab previous 4 bytes of allocation
mov ebx,dword ptr [rdi] ; Store in loop counter
jmp StartLoop ; Jump to beginning of loop
Loop:
lea rcx,[scratch!`string' (00000000`ffe11170)] ; 1st param to printf
call qword ptr [scratch!_imp_printf; Destructor
StartLoop:
sub ebx,1 ; Decrement loop counter
jns Loop ; Loop while not negative
This book keeping is distinct from the book keeping that malloc or HeapAlloc do. Those allocators don't care about objects and arrays. They only see blobs of memory with a total size. VC++ can't query the heap manager for the total size of the allocation because that would mean that the heap manager would be bound to allocate a block exactly the size that you requested. The heap manager shouldn't have this limitation - if you ask for 240 bytes to allocate for 20 12 byte objects, it should be free to return a 256 byte block that it has immediately available.

The 4 bytes offset is for the number of elements. When delete[] is invoked it is necessary to know the exact number of elements to be able to call the destructors for exactly the necessary number of objects.
Since the memory allocator could have returned a bigger block than necessary for storing all the objects the only sure way to know the number of elements is to store it in the beginning of the block.

For sure, memory allocation does need some bookkeeping info stored together with the actual memory.
Next to that, heap allocated blocks will also be 'decorated' with some magic values, used to easily detect buffer overruns, double deletion, ... Take a look at this CodeGuru site for more info. If you want to know the last of heap debugging, take a look at the msdn documentation.

The finial answer is:
When you do a malloc (which new uses under the hoods) allocates more memory than needed for the system to manage memory. This debug information is not what I am interested. What I am interested is the difference between using the return from malloc on a C++ array allocation.
What I have been able to determine is sometimes the C++ adds/uses 4 additional byte to keep count of the objects being allocated. The twist is these 4 bytes are only added if the objects being allocated require being destructed.
So given:
void* _cdecl operator new(size_t size)
{
void *ptr = malloc(size);
return(ptr);
}
for the case of:
object * data = new object [size]
data will be ptr plus 4 bytes (assuming object required a destructor)
while:
char *data = new char [size]
data will equal ptr because no destructor is required.
Again, I am not interested in the memory tracking malloc adds to manage memory.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string