Memory allocation in VC++

Memory allocation in VC++ - visual-c++

I am working with VC++ 2005
I have overloaded the new and delete operators.
All is fine.
My question is related to some of the magic VC++ is adding to the memory allocation.
When I use the C++ call:
data = new _T [size];
The return (for example) from the global memory allocation is 071f2ea0 but data is set to 071f2ea4
When overloaded delete [] is called the 071f2ea0 address is passed in.
Another note is when using something like:
data = new _T;
both data an the return from the global memory allocation are the same.
I am pretty sure Microsoft is adding something at the head of the memory allocation to use for book keeping. My question is, does anyone know of the rules Microsoft is using.
I want to pass in the value of "data" into some memory testing routines so I need to get back to the original memory reference from the global allocation call.
I could assume the 4 byte are an index but I wanted to make sure. I could easily be a flag plus offset, or count or and index into some other table, or just an alignment to cache line of the CPU. I need to find out for sure. I have not been able to find any references to outline the details.
I also think on one of my other runs that the offset was 6 bytes not 4

The 4 bytes most likely contains the total number of objects in the allocation so delete [] will be able to loop over all objects in the array calling their destructor..
To get back the original address, you could keep a lookup-table keyed on address / 16, which stores the base address and length. This will enable you to find the original allocation. You need to ensure that your allocation+4 doesn't cross a 16-byte boundary, however.
EDIT:
I went ahead and wrote a test program that creates 50 objects with a destructor via new, and calls delete []. The destructor just calls printf, so it won't be optimized away.
#include <stdio.h>
class MySimpleClass
{
public:
~MySimpleClass() {printf("Hi\n");}
};
int main()
{
MySimpleClass* arr = new MySimpleClass[50];
delete [] arr;
return 0;
}
The partial disassembly is below, cleaned up to be a bit more legible. As you can see, VC++ is storing array count in the initial 4 byes.
; Allocation
mov ecx, 36h ; Size of allocation
call scratch!operator new
test rax,rax ; Don't write 4 bytes if NULL.
je scratch!main+0x25
mov dword ptr [rax],32h ; Store 50 in first 4 bytes
add rax,4 ; Increment pointer by 4
; Free
lea rdi,[rax-4] ; Grab previous 4 bytes of allocation
mov ebx,dword ptr [rdi] ; Store in loop counter
jmp StartLoop ; Jump to beginning of loop
Loop:
lea rcx,[scratch!`string' (00000000`ffe11170)] ; 1st param to printf
call qword ptr [scratch!_imp_printf; Destructor
StartLoop:
sub ebx,1 ; Decrement loop counter
jns Loop ; Loop while not negative
This book keeping is distinct from the book keeping that malloc or HeapAlloc do. Those allocators don't care about objects and arrays. They only see blobs of memory with a total size. VC++ can't query the heap manager for the total size of the allocation because that would mean that the heap manager would be bound to allocate a block exactly the size that you requested. The heap manager shouldn't have this limitation - if you ask for 240 bytes to allocate for 20 12 byte objects, it should be free to return a 256 byte block that it has immediately available.

The 4 bytes offset is for the number of elements. When delete[] is invoked it is necessary to know the exact number of elements to be able to call the destructors for exactly the necessary number of objects.
Since the memory allocator could have returned a bigger block than necessary for storing all the objects the only sure way to know the number of elements is to store it in the beginning of the block.

For sure, memory allocation does need some bookkeeping info stored together with the actual memory.
Next to that, heap allocated blocks will also be 'decorated' with some magic values, used to easily detect buffer overruns, double deletion, ... Take a look at this CodeGuru site for more info. If you want to know the last of heap debugging, take a look at the msdn documentation.

The finial answer is:
When you do a malloc (which new uses under the hoods) allocates more memory than needed for the system to manage memory. This debug information is not what I am interested. What I am interested is the difference between using the return from malloc on a C++ array allocation.
What I have been able to determine is sometimes the C++ adds/uses 4 additional byte to keep count of the objects being allocated. The twist is these 4 bytes are only added if the objects being allocated require being destructed.
So given:
void* _cdecl operator new(size_t size)
{
void *ptr = malloc(size);
return(ptr);
}
for the case of:
object * data = new object [size]
data will be ptr plus 4 bytes (assuming object required a destructor)
while:
char *data = new char [size]
data will equal ptr because no destructor is required.
Again, I am not interested in the memory tracking malloc adds to manage memory.

Related

declare mutable string with varying size dynamically

I made my own string declarator with macro in GNU Assembler in x64 machine.
.macro declareString identifier, value
.pushsection .data
\identifier: .ascii "\value"
"lengthof.\identifier"= . - \identifier
.popsection
.endm
This is how to use it:
.data
ANOTHER_VARIABEL: .ascii "why"
.text
_start:
declareString myString, "good"
lea rax, myString
mov rbx, lengthof.myString
So basically my macro is just label maker.
So in debugger, rax will be value of myString address because myString basically just a label.
And rbx will be value of myString length (4).
So If I want change myString from good which has 4 chars to rocker which has 6 chars at runtime, I'm afraid it will overwrite another variabel due to size collision.
I heard about heap memory but I'm not sure how does it work. I only know about stack, but I'm used to use stack as temporary backup.
So how do I declare mutable string in assembly?

Same as in C; if you want to use static storage (in .data) for the characters themselves (like static char myString[4] = "good"; note not including a terminating 0 byte), you actually need to reserve enough space for the largest you ever want this string to be. Like static char myString[128] = "good".
In your case, like .space 124 after your .ascii would be a total of 128 bytes reserved after the label myString.
You're correct that you shouldn't write past the size you reserve, although it would come after ANOTHER_VARIABEL, not before, since that part of your .data section is earlier in your source than where you use the macro containing .pushsection
If you wanted to do dynamic allocation, you might want to make an mmap(..., MAP_PRIVATE|MAP_ANONYMOUS, PROT_READ|PROT_WRITE, ...) system call to allocate 1 or more pages of memory, then copy bytes into it. (You're writing a _start so I'm assuming you're not linking libc.)
You could make your global variable a pointer, but if you free the old pointer and allocate a new one (or mremap to the new size), you should start with runtime allocation and init of it. You don't want to munmap a page of .rodata.
Or instead of mmap/munmap, Linux mremap(MREMAP_MAYMOVE) can ensure that the allocation is big enough to hold your whole string without copying it. MAYMOVE lets it pick a new virtual address if there aren't enough free virtual pages following the currently pointed-to page, but without copying the actual data in the already-allocated pages.
There are tons of Q&As about making system calls in assembly so I'm not going into details about that; you can think about how you're managing memory in terms of system calls to make, separately from implementing it in asm. Thinking about it in C is pretty much equally valid; asm doesn't add much in terms of being able to declare static data or make system calls. Other than the fact that you could in asm make sure that a page of .data was exclusively dedicated to this string, with no other things using it, so you could potentially let mremap unmap that page from your .data and map it at a different virtual address.

Should we store long strings on stack in assembly?

The general way to store strings in NASM is to use db like in msg: db 'hello world', 0xA. I think this stores the string in the bss section. So the string will occupy the storage for the duration of the entire program. Instead, if we store it in the stack, it will be alive only during the local frame. For small strings (less than 8 bytes), this can be done using mov dword [rsp], 'foo'. But for longer strings, the string has to be split and be stored using multiple instructions. So this would increase the executable size (I thought so).
So now, which is better in large programs with multiple strings? Are any of the assumptions I made above, wrong?

mov dword [rsp] 'foo' assembles to C70424666F6F00, it takes 7 bytes to encode 4 payload characters.
In comparison with standard static way DB 'foo',0 the definition of string as immediate operand in code section increases the code size by 75 %.
Your dynamic method may be profitable only if you could eliminate the .rodata or .data section entirely (which is seldom the case of large programs). Each additional section takes more space in executable program than its netto contents because of its file-alignment (in PE format it is 512 bytes).
And even when your program has no other static data in data sections beside long strings, you could spare more space with static declaration in .text (code) section.

But for longer strings string has to be split and be stored using multiple instructions. So this would increase the executable size (I thought so).
Yep, and in almost all cases, the number of bytes used by those instructions will exceed the number of bytes that would be needed to just store the string in memory normally. (The instruction includes all the bytes of the immediate, with a few exceptions like zero- and sign-extension, as well as additional bytes to encode the opcode, destination address, etc). And code, of course, also occupies (virtual) memory for the entire duration of the program. There's no free lunch.
As such, you should just assemble the strings directly into memory with db as you have done. But try to arrange for them to go in a read-only section such as .text or .rodata depending on what your system supports. A modern operating system will then be able to discard those pages from physical memory if there is memory pressure, knowing that they can be reloaded from the executable if needed again. This is entirely transparent to your program. If there are many strings that will be needed at the same times, you can optimize a little by trying to arrange for them to be adjacent in your program's memory (defining them all together in one asm file should suffice).
You could also design something where at runtime, you read your strings from an external file into dynamically allocated memory, then free it when done with them. This was a common technique for programs on ancient operating systems with limited physical memory and no virtual memory support (think MS-DOS).

The string data has to be somewhere. Existing as immediates in your machine code takes space in the .text section of your program, normally linked into the same program segment as .rodata where you'd keep string literals. Running instructions to store strings to the stack means the data has to come into I-cache, then go out into D-cache.
But for long redundant strings, code to generate them in memory may be smaller than the string itself. This comes down to the Kolmogorov complexity; minimum size of a program that can output (or generate in an array) the given data. That program could be a gzip or zstd decompressor with some input constant data, could be good for a very large string or set of strings, much larger than the decompression code. Or for a specific case, have a look at code-golf questions like The alphabet in programming languages where my answer is 9 bytes of x86 machine code (including a ret) which inefficiently stores 1 byte at a time, incrementing a pointer in a loop, to produce the 26-byte string (without a terminating 0). Slow but compact.
Pushing / Storing from immediates
For just 4 bytes (not including the 0 terminator) you'd use push 'foo' = 5 bytes = 80% efficiency. On x86-64, that's a qword push (sign-extending the imm32 to 64-bit), so we get 4 bytes of zeros for free.
After that, getting the pointer into a register with mov rdi, rsp (3 bytes) is 4 bytes smaller than lea rdi, [rel msg_foo] (7 bytes), so it's an overall win for total size (unless padding for function alignment bumps it up or hides it). But anyway, comparing against the best option instead of the worst might be a good idea for your answer.
Compilers will sometimes use immediate data to init a local struct or array that has to be on the stack anyway (i.e. they have to pass a non-const pointer to another function); their threshold for switching to copying from .rodata (with movdqa xmm load/store) is higher than 8 bytes. But when you just want to print the string, you just need to pass a pointer to .rodata without copying it to the stack at all, so the threshold is much lower for it to be worth it to use stack space. Maybe 8 bytes, maybe 16, maybe only 4, depending on I-cache vs. D-cache pressure in your program.
For short but not tiny strings
mov rcx, imm64 + push rcx is 10+1 = 11 total bytes for 8 bytes of payload = 73% efficiency, vs. 4/7 = 57%. (At an offset from RSP it would be even worse, but to save code size you could use RBP+imm8 instead of RSP+imm8, but that's still 4 bytes per 7. You could mix push sign_extended_imm32 with mov dword [rsp+4], imm32 but that's also bad.)
This does have overhead that scales with string size, so it quickly becomes more size-efficient to just copy from .rodata, e.g. with an XMM loop, call memcpy, or even rep movsb if you're optimizing for size.
Or of course just using the string in .rodata if possible, if you don't need to make a copy you can modify on the stack.
Shellcode is a common use-case for techniques like this. You need a single "flat" sequence of bytes, usually not containing any 00 bytes which would terminate a C string.
You can have some data past the end of the actual machine code part, and mov store some zeros into that and generate pointers to it, but that's somewhat cumbersome. And you need a position-independent way to get pointers into registers. call/pop works, if you jump forward and call backward so the rel32 doesn't involve any 00 bytes. Same for RIP-relative lea.
It's often just as easy to push a zeroed register, or an imm8 or imm32, and get some zeros into memory along with ASCII data that way. Generating it a bit at a time makes it easy to mov rsi, rsp or whatever, instead of needing position-independent addressing relative to RIP.

Results of doing += on a double from multiple threads

Consider the following code:
void add(double& a, double b) {
a += b;
}
which according to godbolt compiles on a Skylake to:
add(double&, double):
vaddsd xmm0, xmm0, QWORD PTR [rdi]
vmovsd QWORD PTR [rdi], xmm0
ret
If I call add(a, 1.23) and add(a, 2.34) from different threads (for the same variable a), will a definitely end up as either a+1.23, a+2.34, or a+1.23+2.34?
That is, will one of these results definitely happen given this assembly, and a will not end up in some other state?

Here is a relevant questions to me:
Does the CPU fetch the word you are dealing with in a single operation?
Some processor might allow memory access to a variable that happens to be not aligned in memory by doing two fetches one after the other - non atomically of course.
In that case, problems would arise if another thread interjects writing on that area of memory while the first thread had fetched already the first part of the word and then fetches the second part when the other thread has already modified the word.
thread 1 fetches first part of a XXXX
thread 1 fetches second part of a YYYY
thread 2 fetches first part of a XXXX
thread 1 increments double represented as XXXXYYYY that becomes ZZZZWWWW by adding b
thread 1 writes back in memory ZZZZ
thread 1 writes back in memory WWWW
thread 2 fetches second part of a that is now WWWW
thread 2 increments double represented as XXXXWWWW that becomes VVVVPPPP by adding b
thread 2 writes back in memory VVVV
thread 2 writes back in memory PPPP
For keeping it compact I used one character to represent 8 bits.
Now XXXXWWWW and VVVVPPPP are going to be representation of total different floating point values than the one you would have expected. That is because you ended up mixing two parts of two different binary representation (IEEE-754) of double variables.
Said that, I know that in certain ARM based architectures data access are not allowed (that would cause a trap to be generated), but I suspect that Intel processors do allow that instead.
Therefore, if your variable a is aligned, your result can be any of
a+1.23, a+2.34, a+1.23+2.34
if your variable might be mis-aligned (i.e. has got an address that is not a multiple of 8) your result can be any of
a+1.23, a+2.34, a+1.23+2.34 or a rubbish value
As a further note, please bear in mind that even if your environment alignof(double) == 8 that is not necessarily enough to conclude you are not going to have misalignment issues. All depends from where your particular variable comes from. Consider the following (or run it here):
#pragma push()
#pragma pack(1)
struct Packet
{
unsigned char val1;
unsigned char val2;
double val3;
unsigned char val4;
unsigned char val5;
};
#pragma pop()
int main()
{
static_assert(alignof(double) == 8);
double d;
add(d,1.23); // your a parameter is aligned
Packet p;
add(p.val3,1.23); // your a parameter is now NOT aligned
return 0;
}
Therefore asserting alignof() doesn't necessarily guarantee your variable is aligned. If your variable is not involved in any packing then you should be OK.
Please allow me just a disclaimer for whoever else is reading this answer: using std::atomic<double> in these situations is the best compromise in term of implementation effort and performance to achieve thread safety. There are CPUs architectures that have special efficient instructions for dealing with atomic variables without injecting heavy fences. That might end up satisfying your performance requirements already.

How can DWORD ever hold values more than 32 bits?

The NSFDbSpaceUsage function in the Lotus Notes C API is defined as:
STATUS LNPUBLIC NSFDbSpaceUsage(
DBHANDLE hDB,
DWORD far *retAllocatedBytes,
DWORD far *retFreeByes);
This function returns the number of bytes allocated and the number of bytes free in a specified database.
Reading SO and elsewhere, I understand that DWORD is associated with unsigned long, which is (usually) 32 bits. What puzzles me is how the above function will ever return the size of a Domino database that is more than 2^32 bytes in size. And in fact, my sample application never returns anything greater than 2,147,483,647 (2^31) for some of my larger databases. An NSF file in Domino can grow to 64 GB, so why would IBM use a DWORD to report the number of bytes allocated, when a DWORD can't represent more than 4,294,967,296 (2^32) bytes?
What am I missing?

I'm guessing this API method was created back when 4 GBs was larger than anyone could ever imagine ;)
According to the comments here, this method is limited to only 4 GBs and there is another method to use: NSFDbSpaceUsageScaled

Appending two string in x86 assembly

I'm currently working on an assignment in AT&T Assembly and now I have to append two strings:
message: .asciz "String 1"
before: .asciz "String 2"
I have really no idea how to do this or how to begin. I've already searched on internet but I couldn't find any helpful information. I think I have to manually copy the characters of the second string to the end of the first string but I'm not sure about that.
Could anyone please explain to me how to do this? :)

This question fails to mention the target memory, which makes it somewhat difficult to answer. I also don't know if you're in 16 bit, 32 bit or 64 bit. For convenience's sake, I'll also just assume they're C style 0-terminated strings.
Anyway, this seems to be the general procedure:
Get the length of the first string (instructions on writing an asm strlen can be found here: http://www.int80h.org/strlen/)
Set the ptr to the target memory
Copy the first string to the destination memory, using rep(e/ne) movsb with the size in ecx.
This can be CPU-optimized by using 'movsd' by first doing a shr ecx, 2 on your length to get it in batches of 4 bytes, and then doing the remainder with movsb. I've seen this done like this:
mov edi, dest
mov esi, string_address
mov ecx, string_length
mov eax, ecx
shr ecx, 2
repne movsd
mov cl, al
and cl, 3
repne movsb ; esi and edi move along the addresses as they copy, meaning they are already set correctly here
Get the length of the second string (be sure to back up your edi in stack or another register if needed; it contains the address you need to copy the next string to)
Copy the second string to the destination memory (as I said, the correct address should be in edi after the first string operation)
For safety, add a new 0 behind it.
If you're copying the second string to the end of the first string, you need one less copy operation, but you have to make sure there is actually enough space there to copy the second string without overwriting other vital stuff.

This is not a trivial matter. Strings are variable in length and occupy different spaces in memory, and there has to be some way to know how long they are or where they end. With C or C++, a nul bytes (bytes of zero value) indicates the end of the string. With some other
program languages, you have a pointer to the start of the string and the length of the string stored separately, which has the advantage of letting you store binary (including bytes of
zero value) in the string. Even with C and the rest, you have to have a pointer to where the string starts.
What generally has to happen is that you have to use asm to contact the operating system and request a block of memory that is currently free that is big enough to contain the contents of the two strings once they are attached. This would be memory separate from either of the
two strings to start with, and it comes from what is referred to as the Memory Heap, Once you are given the beginning point of that memory block, you copy the contents of the first
string into it, then you continue on while copying the contents of the second string in
there right behind the first. Then you release the memory that had been assigned to the
first string and reassign the block to that string by changing its pointer, and possibly
its length. The released memory is returned to the Memory Heap by the Operating System for reuse elsewhere.
Actually, the operating system is not the only source of freed up memory. Some compilers, even assemblers, either handle memory management on their own, or provide suitable tools
to the programmer to do it as the need arises.
In other words, this can be a very ambitious undertaking, and you have to know quite a
bit about what is going on to do it right. You do it wrong, you can expect consequences
like crashing your system and needing to reboot.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string