Appending two string in x86 assembly

Appending two string in x86 assembly - string

I'm currently working on an assignment in AT&T Assembly and now I have to append two strings:
message: .asciz "String 1"
before: .asciz "String 2"
I have really no idea how to do this or how to begin. I've already searched on internet but I couldn't find any helpful information. I think I have to manually copy the characters of the second string to the end of the first string but I'm not sure about that.
Could anyone please explain to me how to do this? :)

This question fails to mention the target memory, which makes it somewhat difficult to answer. I also don't know if you're in 16 bit, 32 bit or 64 bit. For convenience's sake, I'll also just assume they're C style 0-terminated strings.
Anyway, this seems to be the general procedure:
Get the length of the first string (instructions on writing an asm strlen can be found here: http://www.int80h.org/strlen/)
Set the ptr to the target memory
Copy the first string to the destination memory, using rep(e/ne) movsb with the size in ecx.
This can be CPU-optimized by using 'movsd' by first doing a shr ecx, 2 on your length to get it in batches of 4 bytes, and then doing the remainder with movsb. I've seen this done like this:
mov edi, dest
mov esi, string_address
mov ecx, string_length
mov eax, ecx
shr ecx, 2
repne movsd
mov cl, al
and cl, 3
repne movsb ; esi and edi move along the addresses as they copy, meaning they are already set correctly here
Get the length of the second string (be sure to back up your edi in stack or another register if needed; it contains the address you need to copy the next string to)
Copy the second string to the destination memory (as I said, the correct address should be in edi after the first string operation)
For safety, add a new 0 behind it.
If you're copying the second string to the end of the first string, you need one less copy operation, but you have to make sure there is actually enough space there to copy the second string without overwriting other vital stuff.

This is not a trivial matter. Strings are variable in length and occupy different spaces in memory, and there has to be some way to know how long they are or where they end. With C or C++, a nul bytes (bytes of zero value) indicates the end of the string. With some other
program languages, you have a pointer to the start of the string and the length of the string stored separately, which has the advantage of letting you store binary (including bytes of
zero value) in the string. Even with C and the rest, you have to have a pointer to where the string starts.
What generally has to happen is that you have to use asm to contact the operating system and request a block of memory that is currently free that is big enough to contain the contents of the two strings once they are attached. This would be memory separate from either of the
two strings to start with, and it comes from what is referred to as the Memory Heap, Once you are given the beginning point of that memory block, you copy the contents of the first
string into it, then you continue on while copying the contents of the second string in
there right behind the first. Then you release the memory that had been assigned to the
first string and reassign the block to that string by changing its pointer, and possibly
its length. The released memory is returned to the Memory Heap by the Operating System for reuse elsewhere.
Actually, the operating system is not the only source of freed up memory. Some compilers, even assemblers, either handle memory management on their own, or provide suitable tools
to the programmer to do it as the need arises.
In other words, this can be a very ambitious undertaking, and you have to know quite a
bit about what is going on to do it right. You do it wrong, you can expect consequences
like crashing your system and needing to reboot.

Related

brk segment overflow error in x86 assembly [duplicate]

When to use size directives in x86 seems a bit ambiguous. This x86 assembly guide says the following:
In general, the intended size of the of the data item at a given memory
address can be inferred from the assembly code instruction in which it is
referenced. For example, in all of the above instructions, the size of
the memory regions could be inferred from the size of the register
operand. When we were loading a 32-bit register, the assembler could
infer that the region of memory we were referring to was 4 bytes wide.
When we were storing the value of a one byte register to memory, the
assembler could infer that we wanted the address to refer to a single
byte in memory.
The examples they give are pretty trivial, such as mov'ing an immediate value into a register.
But what about more complex situations, such as the following:
mov QWORD PTR [rip+0x21b520], 0x1
In this case, isn't the QWORD PTR size directive redundant since, according to the above guide, it can be assumed that we want to move 8 bytes into the destination register due to the fact that RIP is 8 bytes? What are the definitive rules for size directives on the x86 architecture? I couldn't find an answer for this anywhere, thanks.
Update: As Ross pointed out, the destination in the above example isn't a register. Here's a more relevant example:
mov esi, DWORD PTR [rax*4+0x419260]
In this case, can't it be assumed that we want to move 4 bytes because ESI is 4 bytes, making the DWORD PTR directive redundant?

You're right; it is rather ambiguous. Assuming we're talking about Intel syntax, it is true that you can often get away with not using size directives. Any time the assembler can figure it out automatically, they are optional. For example, in the instruction
mov esi, DWORD PTR [rax*4+0x419260]
the DWORD PTR specifier is optional for exactly the reason you suppose: the assembler can figure out that it is to move a DWORD-sized value, since the value is being moved into a DWORD-sized register.
Similarly, in
mov rsi, QWORD PTR [rax*4+0x419260]
the QWORD PTR specifier is optional for the exact same reason.
But it is not always optional. Consider your first example:
mov QWORD PTR [rip+0x21b520], 0x1
Here, the QWORD PTR specifier is not optional. Without it, the assembler has no idea what size value you want to store starting at the address rip+0x21b520. Should 0x1 be stored as a BYTE? Extended to a WORD? A DWORD? A QWORD? Some assemblers might guess, but you can't be assured of the correct result without explicitly specifying what you want.
In other words, when the value is in a register operand, the size specifier is optional because the assembler can figure out the size based on the size of the register. However, if you're dealing with an immediate value or a memory operand, the size specifier is probably required to ensure you get the results you want.
Personally, I prefer to always include the size when I write code. It's a couple of characters more typing, but it forces me to think about it and state explicitly what I want. If I screw up and code a mismatch, then the assembler will scream loudly at me, which has caught bugs more than once. I also think having it there enhances readability. So here I agree with old_timer, even though his perspective appears to be somewhat unpopular.
Disassemblers also tend to be verbose in their outputs, including the size specifiers even when they are optional. Hans Passant theorized in the comments this was to preserve backwards-compatibility with old-school assemblers that always needed these, but I'm not sure that's true. It might be part of it, but in my experience, disassemblers tend to be wordy in lots of different ways, and I think this is just to make it easier to analyze code with which you are unfamiliar.
Note that AT&T syntax uses a slightly different tact. Rather than writing the size as a prefix to the operand, it adds a suffix to the instruction mnemonic: b for byte, w for word, l for dword, and q for qword. So, the three previous examples become:
movl 0x419260(,%rax,4), %esi
movq 0x419260(,%rax,4), %rsi
movq $0x1, 0x21b520(%rip)
Again, on the first two instructions, the l and q prefixes are optional, because the assembler can deduce the appropriate size. On the last instruction, just like in Intel syntax, the prefix is non-optional. So, the same thing in AT&T syntax as Intel syntax, just a different format for the size specifiers.

RIP, or any other register in the address is only relevant to the addressing mode, not the width of data transfered. The memory reference [rip+0x21b520] could be used with a 1, 2, 4, or 8-byte access, and the constant value 0x01 could also be 1 to 8 bytes (0x01 is the same as 0x00000001 etc.) So in this case, the operand size has to be explicitly mentioned.
With a register as the source or destination, the operand size would be implicit: if, say, EAX is used, the data is 32 bits or 4 bytes:
mov [rip+0x21b520],eax
And of course, in the awfully beautiful AT&T syntax, the operand size is marked as a suffix to the instruction mnemonic (the l here).
movl $1, 0x21b520(%rip)

it gets worse than that, an assembly language is defined by the assembler, the program that reads/interprets/parses it. And x86 in particular but as a general rule there is no technical reason for any two assemblers for the same target to have the same assembly language, they tend to be similar, but dont have to be.
You have fallen into a couple of traps, first off the specific syntax used for the assembler you are using with respect to the size directive, then second, is there a default. My recommendation is ALWAYS use the size directive (or if there is a unique instruction mnemonic), then you never have to worry about it right?

declare mutable string with varying size dynamically

I made my own string declarator with macro in GNU Assembler in x64 machine.
.macro declareString identifier, value
.pushsection .data
\identifier: .ascii "\value"
"lengthof.\identifier"= . - \identifier
.popsection
.endm
This is how to use it:
.data
ANOTHER_VARIABEL: .ascii "why"
.text
_start:
declareString myString, "good"
lea rax, myString
mov rbx, lengthof.myString
So basically my macro is just label maker.
So in debugger, rax will be value of myString address because myString basically just a label.
And rbx will be value of myString length (4).
So If I want change myString from good which has 4 chars to rocker which has 6 chars at runtime, I'm afraid it will overwrite another variabel due to size collision.
I heard about heap memory but I'm not sure how does it work. I only know about stack, but I'm used to use stack as temporary backup.
So how do I declare mutable string in assembly?

Same as in C; if you want to use static storage (in .data) for the characters themselves (like static char myString[4] = "good"; note not including a terminating 0 byte), you actually need to reserve enough space for the largest you ever want this string to be. Like static char myString[128] = "good".
In your case, like .space 124 after your .ascii would be a total of 128 bytes reserved after the label myString.
You're correct that you shouldn't write past the size you reserve, although it would come after ANOTHER_VARIABEL, not before, since that part of your .data section is earlier in your source than where you use the macro containing .pushsection
If you wanted to do dynamic allocation, you might want to make an mmap(..., MAP_PRIVATE|MAP_ANONYMOUS, PROT_READ|PROT_WRITE, ...) system call to allocate 1 or more pages of memory, then copy bytes into it. (You're writing a _start so I'm assuming you're not linking libc.)
You could make your global variable a pointer, but if you free the old pointer and allocate a new one (or mremap to the new size), you should start with runtime allocation and init of it. You don't want to munmap a page of .rodata.
Or instead of mmap/munmap, Linux mremap(MREMAP_MAYMOVE) can ensure that the allocation is big enough to hold your whole string without copying it. MAYMOVE lets it pick a new virtual address if there aren't enough free virtual pages following the currently pointed-to page, but without copying the actual data in the already-allocated pages.
There are tons of Q&As about making system calls in assembly so I'm not going into details about that; you can think about how you're managing memory in terms of system calls to make, separately from implementing it in asm. Thinking about it in C is pretty much equally valid; asm doesn't add much in terms of being able to declare static data or make system calls. Other than the fact that you could in asm make sure that a page of .data was exclusively dedicated to this string, with no other things using it, so you could potentially let mremap unmap that page from your .data and map it at a different virtual address.

Should we store long strings on stack in assembly?

The general way to store strings in NASM is to use db like in msg: db 'hello world', 0xA. I think this stores the string in the bss section. So the string will occupy the storage for the duration of the entire program. Instead, if we store it in the stack, it will be alive only during the local frame. For small strings (less than 8 bytes), this can be done using mov dword [rsp], 'foo'. But for longer strings, the string has to be split and be stored using multiple instructions. So this would increase the executable size (I thought so).
So now, which is better in large programs with multiple strings? Are any of the assumptions I made above, wrong?

mov dword [rsp] 'foo' assembles to C70424666F6F00, it takes 7 bytes to encode 4 payload characters.
In comparison with standard static way DB 'foo',0 the definition of string as immediate operand in code section increases the code size by 75 %.
Your dynamic method may be profitable only if you could eliminate the .rodata or .data section entirely (which is seldom the case of large programs). Each additional section takes more space in executable program than its netto contents because of its file-alignment (in PE format it is 512 bytes).
And even when your program has no other static data in data sections beside long strings, you could spare more space with static declaration in .text (code) section.

But for longer strings string has to be split and be stored using multiple instructions. So this would increase the executable size (I thought so).
Yep, and in almost all cases, the number of bytes used by those instructions will exceed the number of bytes that would be needed to just store the string in memory normally. (The instruction includes all the bytes of the immediate, with a few exceptions like zero- and sign-extension, as well as additional bytes to encode the opcode, destination address, etc). And code, of course, also occupies (virtual) memory for the entire duration of the program. There's no free lunch.
As such, you should just assemble the strings directly into memory with db as you have done. But try to arrange for them to go in a read-only section such as .text or .rodata depending on what your system supports. A modern operating system will then be able to discard those pages from physical memory if there is memory pressure, knowing that they can be reloaded from the executable if needed again. This is entirely transparent to your program. If there are many strings that will be needed at the same times, you can optimize a little by trying to arrange for them to be adjacent in your program's memory (defining them all together in one asm file should suffice).
You could also design something where at runtime, you read your strings from an external file into dynamically allocated memory, then free it when done with them. This was a common technique for programs on ancient operating systems with limited physical memory and no virtual memory support (think MS-DOS).

The string data has to be somewhere. Existing as immediates in your machine code takes space in the .text section of your program, normally linked into the same program segment as .rodata where you'd keep string literals. Running instructions to store strings to the stack means the data has to come into I-cache, then go out into D-cache.
But for long redundant strings, code to generate them in memory may be smaller than the string itself. This comes down to the Kolmogorov complexity; minimum size of a program that can output (or generate in an array) the given data. That program could be a gzip or zstd decompressor with some input constant data, could be good for a very large string or set of strings, much larger than the decompression code. Or for a specific case, have a look at code-golf questions like The alphabet in programming languages where my answer is 9 bytes of x86 machine code (including a ret) which inefficiently stores 1 byte at a time, incrementing a pointer in a loop, to produce the 26-byte string (without a terminating 0). Slow but compact.
Pushing / Storing from immediates
For just 4 bytes (not including the 0 terminator) you'd use push 'foo' = 5 bytes = 80% efficiency. On x86-64, that's a qword push (sign-extending the imm32 to 64-bit), so we get 4 bytes of zeros for free.
After that, getting the pointer into a register with mov rdi, rsp (3 bytes) is 4 bytes smaller than lea rdi, [rel msg_foo] (7 bytes), so it's an overall win for total size (unless padding for function alignment bumps it up or hides it). But anyway, comparing against the best option instead of the worst might be a good idea for your answer.
Compilers will sometimes use immediate data to init a local struct or array that has to be on the stack anyway (i.e. they have to pass a non-const pointer to another function); their threshold for switching to copying from .rodata (with movdqa xmm load/store) is higher than 8 bytes. But when you just want to print the string, you just need to pass a pointer to .rodata without copying it to the stack at all, so the threshold is much lower for it to be worth it to use stack space. Maybe 8 bytes, maybe 16, maybe only 4, depending on I-cache vs. D-cache pressure in your program.
For short but not tiny strings
mov rcx, imm64 + push rcx is 10+1 = 11 total bytes for 8 bytes of payload = 73% efficiency, vs. 4/7 = 57%. (At an offset from RSP it would be even worse, but to save code size you could use RBP+imm8 instead of RSP+imm8, but that's still 4 bytes per 7. You could mix push sign_extended_imm32 with mov dword [rsp+4], imm32 but that's also bad.)
This does have overhead that scales with string size, so it quickly becomes more size-efficient to just copy from .rodata, e.g. with an XMM loop, call memcpy, or even rep movsb if you're optimizing for size.
Or of course just using the string in .rodata if possible, if you don't need to make a copy you can modify on the stack.
Shellcode is a common use-case for techniques like this. You need a single "flat" sequence of bytes, usually not containing any 00 bytes which would terminate a C string.
You can have some data past the end of the actual machine code part, and mov store some zeros into that and generate pointers to it, but that's somewhat cumbersome. And you need a position-independent way to get pointers into registers. call/pop works, if you jump forward and call backward so the rel32 doesn't involve any 00 bytes. Same for RIP-relative lea.
It's often just as easy to push a zeroed register, or an imm8 or imm32, and get some zeros into memory along with ASCII data that way. Generating it a bit at a time makes it easy to mov rsi, rsp or whatever, instead of needing position-independent addressing relative to RIP.

Hardware VGA Text Mode IO in old dos assembly Issue

After reading about at least the first 3 or 4 chapters of about 4 different books on assembly programming I got to a stage where I can put "Hello World" on a dosbox console using MASM 6.11. Imagine my delight!!
The first version of my program used DOS Function 13h.
The second version of my program used BIOS Function 10h
I now want to do the third version using direct hardware output. I have read the parts of the books that explain the screen is divided into 80x25 on a VGA monitors (not bothered about detecting CGA and all that so my program uses memory address 0B800h for colour VGA, because DOSBox is great and all, and my desire to move to Win Assembler sometime before im 90 years old). I have read that each character on the hardware screen is 2 bytes (1 for the attribute and one for the character itself, therefore you have 80x25x2=4000 bytes). The odd bytes describe the attribute, and the even bytes the ASCII character.
But my problem is this. No matter how I try, I cant get my program to output a simple black and white (which is just the attribute, I assume I can change this reasonably easily) string (which is just an array of bytes) 5 lines from the top of the screen, and 20 characters in from the left edge (which is just the number of blank characters away from a zero based index with 4000 bytes long). (if my calc is correct that is 5x80=400+20=420x2=840 is the starting position of my string within the array of 4000 bytes)
How do I separate the attribute from the character (I got it to work partially but it only shows every second character, then a bunch of random junk (thats how I figured I need some sort of byte pair for the attribute and text), or how do I set it up such that both are recognised together. How do I control he position of the text on the screen once the calcs are made. Where am I going wrong.
I have tried looking around the web for this seemingly simple question but am unable to find a solution. Is there anyone who used to program in DOS and x86 Assembly that can tell me how to do this easy little program by not using BIOS or DOS functions, just with the hardware.
I would really appreicate a simple code snippet if possible. Or a refrence to some site or free e-book. I dont want buying a big book on dos console programming which will end up useless when I move to windows shortly. The only reason I am focused on this is because I want to learn true assembly, not some macro language or some pretensious high level language that claims to be assembly.
I am trying to build a library of routines that will make Assembly easier to learn so people dont have to work though all the 3 to 6 chapters across 10 books of theory esentially explaining again and again the same stuff when really all that is needed is enough to know how to get some output, assign values to variables, get some input, and do some loops and decisions. The theory can come along later, and by the time they get to loops and decisions most people will have done enough assembler to have all the theory anyway. I beleive assembly should be taught no different than any other language starting with a simple hello world program then getting input ect. I want to make this possible. But hey, I'm just a beginner, maybe my taughts will change when I learn more.
One other note, I know for a fact the problem is NOT with DOSBox as I have a very old PC running true MS-DOS V6.2 and the program still doesnt work (but gives almost identical output). In fact, DOSBox actually runs some of my old programs even better than True dos. Gem desktop being one example. Just wanted to get that cleared before people try suggesting its a problem with the emulator. It cant be, not with such simple programs. No im afraid the problem is with my little brain not fully understanding what is needed.
Can anyone out there please help!!
Below is the program I used (MASM 6.1 Under DOSBox on Win 7 64-bit). It uses BIOS Intrrupt 10h Function 13h sub function 0. I want to do the very same using direct hardware IO.
.model small
.stack
.data ;part of the program containing data
;Constants - None
;Variables
MyMsg db 'Hello World'
.code
Main:
GetAddress:
mov ax,#data ;Gets address of data segment
mov es,ax ;Loads segment address into es regrister
mov bp,OFFSET MyMsg ;Load Offset into DX
SetAttributes:
mov bl,01001111b ;BG/FG Colour attributes
mov cx,11 ;Length of string in data segment
SetRowAndCol:
mov dh,24 ;Set the row to start printing at
mov dl,68 ;Set the column to start printing at
GetFunctionAndSub:
mov ah,13h ;BIOS Function 10h - String Output
mov al,0 ;BIOS Sub-Function (0-3)
Execute:
int 10h ;BIOS Interrupt 10h
EndProg:
mov ax,4c00h ;Terminate program return 0 to OS
int 21h ;DOS Interrupt 21h
end Main
end
I want to have this in a format that is easy to explain. So here is my current workings. I've almost got it. But it only prints the attributes, getting the characters on screen is a problem. (Ocasionally when I modify it slightly, I get every second character with random attributes (I think I know the technicalities of why, but dont know enough assembler to fix it)).
.model small
.stack
.data
;Constants
ScreenSeg equ 0B800h
;Variables
MyMsg db 'Hello World'
StrLen equ $-MyMsg
.code
Main:
SetSeg:
mov ax, ScreenSeg ;set segment register:
mov ds, ax
InitializeStringLoop: ;Display all characters: - Not working :( Y!
mov cx, StrLen ;number of characters.
mov di, 00h ;start from byte 'h'
OutputString:
mov [di], offset byte ptr MyMsg[di]
add di, 2 ;skip over next attribute code in vga memory.
loop OutputString
InitializeAttributeLoop:;Color all characters: - Atributes are working fine.
mov cx, StrLen ;number of characters.
mov di, 01h ;start from byte after 'h'
;Assuming I have all chars with same attributes - fine for now - later I would make this
;into a procedure that I will just pass the details into. - But for now I just want a
;basic output tutorial.
OutputAttributes:
mov [di], 11101100b ;light red(1100) on yellow(1110)
add di, 2 ;skip over next ascii code in vga memory.
loop OutputAttributes
EndPrg:
mov ax, 4C00h
int 21h
end Main
Of course I want to reduce the instructions used to the bare bones essentials. (for proper tuition purposes, less to cover when teaching others). Hense the reason I did not use MOVSB/W/D ect with REP. I opted instead for an easy to explain manual loop using standard MOV, INC, ADD ect. These are instructions that are basic enough and easy to explain to newcommers. So if possible I would like to keep it as close to this as possible.
I know esentially all that seems to be wrong is the loop for the actual string handler. Its not letting me increment the address the way I want it to. Its embarasssing to me cause I am actually quite a good progammer using C++, C#, VB, and Delphi (way back when)). I know you wouldnt think that given I cant even get a loop right in assembler, but it is such a different language. There are 2 or 3 loops in high level languages, and it seems there are an infinate combination of ways to do loops in assembler depending on the instructions. So I say "Simple Loop", but in reality there is little simple about it.
I hope someone can help me with this, you would be saving my assembly carreer, and ensuring I eventually become a good assembly teacher. Thanks in advance, and especially for reading this far.

The typical convention would be to use ds:si as source, and es:di as destination.
So it would end up being similar to (untested):
mov ax, #data
mov ds, ax
mov ax, ScreenSeg
mov es, ax
...
mov si, offset MyMsg
OutputString:
mov al, byte ptr ds:[si]
mov byte ptr es:[di], al
add si, 1 ; next character from string
add di, 2 ; skip over next attribute code in vga memory.
loop OutputString

I would suggest getting the Masm32 Package if you don't already have it. It is mainly geared towards easily using Assembly Language in "Windows today", which is very nice, but also will teach you a lot about Asm and also shows where to get the Intel Chip manuals that were mentioned in an earlier reply that are indispensable.
I started programming in the 80's and so I can see why you are so interested in the nitty gritty of it, I know I miss it. If more people were to start out there, it would pay off for them very much. You are doing a great service!
I am playing with exactly what you are talking about, Direct Hardware, and I have also learned that Windows has changed some of the DOS services and BIOS services have changed too, so that some don't work any more. I am in fact writing a small .com program and running it from Win7 in a Command Prompt Window, Prints a msg and waits for a key, Pretty cool considering it's Win7 in 2012!
In fact it was BIOS 10h - 0Eh that did not work and so I tried Dos 21h 02h to write to the screen and it worked. The code is below because it is a .com (Command Program) i thought it might be of use to you.
; This makes a .com program (64k Limit, Code, Data and all
; have to fit in this space. Used for small utilities and
; good for very fast tasks. In fact DOS Commands are mostly
; small .com programs like this (except more useful)!
;
; Assemble with Masm using
; c:\masm32\bin\ml /AT /c bfc.asm
; Link with Masm's Link16 using
; c:\masm32\bin\link16 bfc.obj,bfc.com;
;
; Link16 is the key to making this 16bit .com (Command) file
SEGMT SEGMENT
org 100h
Start:
push CS
pop DS
MOV SI, OFFSET Message
Next:
MOV ah, 02h ; Write Char to Standard out
MOV dl, [si] ; Char
INT 21h ; Write it
INC si ; Next Char
CMP byte ptr[si], 0 ; Done?
JNE Next ; Nope
WaitKey:
XOR ah, ah ; 0
INT 16h ; Wait for any Key
ExitHere:
MOV ah, 4Ch ; Exit with Return Code
xor al, al ; Return Code
INT 21h
Message db "It Works in Windows 7!", 0
SEGMT ENDS
END Start

I used to do all of what you are talking about. Trying to remember the details. Michael Abrash is a name you should be googling for. Mode-X for example a 200 something by 200 (240x200?) 256 color mode was very popular as it broke the 16 color boundary and at the time the games looked really good.
I think that the on the metal register programming was doable but painful and you likely need to get the programmers reference/datasheet for the chip you are interested in. As time passed from CGA to EGA to VGA to VESA the way things worked changed as well. I think typically you use int something calls to get access to the page frame then you could fill that in directly. VESA I think worked that way, VESA was a big livesaver when it came to video card support, you used to have to write your own drivers for each chip before then (if you didnt want the ugly standard modes).
I would look at mode-x or vesa and go from there. You need to have a bit of a hacker inside to get through some of this anyway, it is very rare to find a datasheet/programmers reference manual that is complete and accurate, you always have to just shove some bytes around to see what happens. Start filling those memory blocks that are supposed to be the page frames until you see something change on the screen...
I dont see any specific graphics programming books in my library other than the abrash books like the graphics programming black book, which was at the tail end of this period of time. I have bios and dos programmers references and used ralf browns list too. I am sure that I had copies of the manuals for popular video chips (before the internet remember you called a human on that phone thing with a cord hanging out of it, the human took a printed manual, sometimes nicely bound sometimes just a staple in the corner if that, put it in an envelope and mailed it to you and that was your only copy unless you ran it through the copier). I have stacks of printed stuff that, sorry to say, am not going to go through to answer this question. I will keep this question in my mind though and look around some more for info, actually I may have some of my old programs handy, drawing fractals and other such things (direct as practical to the video card/memory).
EDIT.
I know you are looking for text mode stuff, and this is a graphics mode but it may or may not shed some light on what you are trying to do. combination of int calls and filling pages and palette memory directly.
http://dwelch.s3.amazonaws.com/fly.tar.gz

Trying to understand gcc's complicated stack-alignment at the top of main that copies the return address

hi I have disassembled some programs (linux) I wrote to understand better how it works, and I noticed that the main function always begins with:
lea ecx,[esp+0x4] ; I assume this is for getting the adress of the first argument of the main...why ?
and esp,0xfffffff0 ; ??? is the compiler trying to align the stack pointer on 16 bytes ???
push DWORD PTR [ecx-0x4] ; I understand the assembler is pushing the return adress....why ?
push ebp
mov ebp,esp
push ecx ;why is ecx pushed too ??
so my question is: why all this work is done ??
I only understand the use of:
push ebp
mov ebp,esp
the rest seems useless to me...

I've had a go at it:
;# As you have already noticed, the compiler wants to align the stack
;# pointer on a 16 byte boundary before it pushes anything. That's
;# because certain instructions' memory access needs to be aligned
;# that way.
;# So in order to first save the original offset of esp (+4), it
;# executes the first instruction:
lea ecx,[esp+0x4]
;# Now alignment can happen. Without the previous insn the next one
;# would have made the original esp unrecoverable:
and esp,0xfffffff0
;# Next it pushes the return addresss and creates a stack frame. I
;# assume it now wants to make the stack look like a normal
;# subroutine call:
push DWORD PTR [ecx-0x4]
push ebp
mov ebp,esp
;# Remember that ecx is still the only value that can restore the
;# original esp. Since ecx may be garbled by any subroutine calls,
;# it has to save it somewhere:
push ecx

This is done to keep the stack aligned to a 16-byte boundary. Some instructions require certain data types to be aligned on as much as a 16-byte boundary. In order to meet this requirement, GCC makes sure that the stack is initially 16-byte aligned, and allocates stack space in multiples of 16 bytes. This can be controlled using the option -mpreferred-stack-boundary=num. If you use -mpreferred-stack-boundary=2 (for a 22=4-byte alignment), this alignment code will not be generated because the stack is always at least 4-byte aligned. However you could then have trouble if your program uses any data types that require stronger alignment.
According to the gcc manual:
On Pentium and PentiumPro, double and long double values should be aligned to an 8 byte boundary (see -malign-double) or suffer significant run time performance penalties. On Pentium III, the Streaming SIMD Extension (SSE) data type __m128 may not work properly if it is not 16 byte aligned.
To ensure proper alignment of this values on the stack, the stack boundary must be as aligned as that required by any value stored on the stack. Further, every function must be generated such that it keeps the stack aligned. Thus calling a function compiled with a higher preferred stack boundary from a function compiled with a lower preferred stack boundary will most likely misalign the stack. It is recommended that libraries that use callbacks always use the default setting.
This extra alignment does consume extra stack space, and generally increases code size. Code that is sensitive to stack space usage, such as embedded systems and operating system kernels, may want to reduce the preferred alignment to -mpreferred-stack-boundary=2.
The lea loads the original stack pointer (from before the call to main) into ecx, since the stack pointer is about to modified. This is used for two purposes:
to access the arguments to the main function, since they are relative to the original stack pointer
to restore the stack pointer to its original value when returning from main

lea ecx,[esp+0x4] ; I assume this is for getting the adress of the first argument of the main...why ?
and esp,0xfffffff0 ; ??? is the compiler trying to align the stack pointer on 16 bytes ???
push DWORD PTR [ecx-0x4] ; I understand the assembler is pushing the return adress....why ?
push ebp
mov ebp,esp
push ecx ;why is ecx pushed too ??
Even if every instruction worked perfectly with no speed penalty despite arbitrarily aligned operands, alignment would still increase performance. Imagine a loop referencing a 16-byte quantity that just overlaps two cache lines. Now, to load that little wchar into the cache, two entire cache lines have to be evicted, and what if you need them in the same loop? The cache is so tremendously faster than RAM that cache performance is always critical.
Also, there usually is a speed penalty to shift misaligned operands into the registers.
Given that the stack is being realigned, we naturally have to save the old alignment in order to traverse stack frames for parameters and returning.
ecx is a temporary register so it has to be saved. Also, depending on optimization level, some of the frame linkage ops that don't seem strictly necessary to run the program might well be important in order to set up a trace-ready chain of frames.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string