Understanding Assembly Code - linux

Could anyone provide some insight into the following assembly code:
More Information:
The bootloader is in fact a small 16bit bootloader that decipher using Xor decryption, a bigger one, a linux bootloader located in sectors 3 to 34. ( 1 sector is 512 byte in that disk )
The whole thing is a protection system for an exec running on embedded Linux.
the version where the protection was removed has the linux bootloader already deciphered ( we were able to reverse it using IDA ) so we assume that the xor key must be made only with zero's in the version without the protection.
if we look at offset 0x800 to 0x8FF in the version with protection removed it is not filled with zero's so this cannot be the key otherwise this version couldn't be loaded, it would xor plain data and load nothing but garbage.
the sectors 3->34 are ciphered in original version and in clear in our version ( protection removed ) but the MBR code ( small prebootloader ) is identical in both version.
So could it be that there is a small detail in the assembly code of the MBR that changes slightly the place of the xor key?
This is simply an exercise I am doing to understand assembly code loaders better and I found this one quite a challenge to go through. I thank you for your input so far!

Well, reading assembly code is a lot of work. I'll just stick to answering where does the "key" come from.
The BIOS loads the MBR at 0000:7C00 and sets DL to the drive from which it was loaded.
First, your MBR sets up a stack growing down from 0000:7C00.
Then, it copies itself to 0000:0600-0000:07ff and does the far jump which I assume is to the next instruction, however on the copied version, 0000:061D. (EDIT: this copy is a pretty standard thing for an MBR to do, typically to this address, it leaves the address space above free) (EDIT: it copies itself fully, I misread).
Then, it tries 5 times to read sector 2 into 0000:0700-0000:08ff, and halts with an error message if it fails (loc_78).
Then, it tries 5 times to read 32 sectors starting from sector 3 into 2000:0-2000:3FFF (absolute address 20000-23FFF, that's 16kB starting at 128kB).
If all goes well we are at loc_60 and we know what we have in memory.
The loop on loc_68 will use as destination the buffer holding those 32 sectors.
It will use as source the buffer starting at 0000:0800 (which is the second 256 bytes of the buffer we read into 0:0700-0:08ff, the second 256 bytes of sector 2 of the disk).
On each iteration, LODSW adds 2 to SI.
When SI reaches 840, it is set back to 800 by the AND. Therefore, the "key" buffer goes from 800 to 83F, which is to say the 64 bytes starting at byte 256 of sector 2 of the disk.
I have the impression that the XORing falls one byte short... that CX should have been set to 4000h, not 3FFF. I think this is a bug in the code.
After the 16kB buffer at physical address 20000 has been circularly XORed with that "key" buffer, we jump into it with a far jump to 2000:0.
Right? Does it fit? It the key there?

Related

Should we store long strings on stack in assembly?

The general way to store strings in NASM is to use db like in msg: db 'hello world', 0xA. I think this stores the string in the bss section. So the string will occupy the storage for the duration of the entire program. Instead, if we store it in the stack, it will be alive only during the local frame. For small strings (less than 8 bytes), this can be done using mov dword [rsp], 'foo'. But for longer strings, the string has to be split and be stored using multiple instructions. So this would increase the executable size (I thought so).
So now, which is better in large programs with multiple strings? Are any of the assumptions I made above, wrong?
mov dword [rsp] 'foo' assembles to C70424666F6F00, it takes 7 bytes to encode 4 payload characters.
In comparison with standard static way DB 'foo',0 the definition of string as immediate operand in code section increases the code size by 75 %.
Your dynamic method may be profitable only if you could eliminate the .rodata or .data section entirely (which is seldom the case of large programs). Each additional section takes more space in executable program than its netto contents because of its file-alignment (in PE format it is 512 bytes).
And even when your program has no other static data in data sections beside long strings, you could spare more space with static declaration in .text (code) section.
But for longer strings string has to be split and be stored using multiple instructions. So this would increase the executable size (I thought so).
Yep, and in almost all cases, the number of bytes used by those instructions will exceed the number of bytes that would be needed to just store the string in memory normally. (The instruction includes all the bytes of the immediate, with a few exceptions like zero- and sign-extension, as well as additional bytes to encode the opcode, destination address, etc). And code, of course, also occupies (virtual) memory for the entire duration of the program. There's no free lunch.
As such, you should just assemble the strings directly into memory with db as you have done. But try to arrange for them to go in a read-only section such as .text or .rodata depending on what your system supports. A modern operating system will then be able to discard those pages from physical memory if there is memory pressure, knowing that they can be reloaded from the executable if needed again. This is entirely transparent to your program. If there are many strings that will be needed at the same times, you can optimize a little by trying to arrange for them to be adjacent in your program's memory (defining them all together in one asm file should suffice).
You could also design something where at runtime, you read your strings from an external file into dynamically allocated memory, then free it when done with them. This was a common technique for programs on ancient operating systems with limited physical memory and no virtual memory support (think MS-DOS).
The string data has to be somewhere. Existing as immediates in your machine code takes space in the .text section of your program, normally linked into the same program segment as .rodata where you'd keep string literals. Running instructions to store strings to the stack means the data has to come into I-cache, then go out into D-cache.
But for long redundant strings, code to generate them in memory may be smaller than the string itself. This comes down to the Kolmogorov complexity; minimum size of a program that can output (or generate in an array) the given data. That program could be a gzip or zstd decompressor with some input constant data, could be good for a very large string or set of strings, much larger than the decompression code. Or for a specific case, have a look at code-golf questions like The alphabet in programming languages where my answer is 9 bytes of x86 machine code (including a ret) which inefficiently stores 1 byte at a time, incrementing a pointer in a loop, to produce the 26-byte string (without a terminating 0). Slow but compact.
Pushing / Storing from immediates
For just 4 bytes (not including the 0 terminator) you'd use push 'foo' = 5 bytes = 80% efficiency. On x86-64, that's a qword push (sign-extending the imm32 to 64-bit), so we get 4 bytes of zeros for free.
After that, getting the pointer into a register with mov rdi, rsp (3 bytes) is 4 bytes smaller than lea rdi, [rel msg_foo] (7 bytes), so it's an overall win for total size (unless padding for function alignment bumps it up or hides it). But anyway, comparing against the best option instead of the worst might be a good idea for your answer.
Compilers will sometimes use immediate data to init a local struct or array that has to be on the stack anyway (i.e. they have to pass a non-const pointer to another function); their threshold for switching to copying from .rodata (with movdqa xmm load/store) is higher than 8 bytes. But when you just want to print the string, you just need to pass a pointer to .rodata without copying it to the stack at all, so the threshold is much lower for it to be worth it to use stack space. Maybe 8 bytes, maybe 16, maybe only 4, depending on I-cache vs. D-cache pressure in your program.
For short but not tiny strings
mov rcx, imm64 + push rcx is 10+1 = 11 total bytes for 8 bytes of payload = 73% efficiency, vs. 4/7 = 57%. (At an offset from RSP it would be even worse, but to save code size you could use RBP+imm8 instead of RSP+imm8, but that's still 4 bytes per 7. You could mix push sign_extended_imm32 with mov dword [rsp+4], imm32 but that's also bad.)
This does have overhead that scales with string size, so it quickly becomes more size-efficient to just copy from .rodata, e.g. with an XMM loop, call memcpy, or even rep movsb if you're optimizing for size.
Or of course just using the string in .rodata if possible, if you don't need to make a copy you can modify on the stack.
Shellcode is a common use-case for techniques like this. You need a single "flat" sequence of bytes, usually not containing any 00 bytes which would terminate a C string.
You can have some data past the end of the actual machine code part, and mov store some zeros into that and generate pointers to it, but that's somewhat cumbersome. And you need a position-independent way to get pointers into registers. call/pop works, if you jump forward and call backward so the rel32 doesn't involve any 00 bytes. Same for RIP-relative lea.
It's often just as easy to push a zeroed register, or an imm8 or imm32, and get some zeros into memory along with ASCII data that way. Generating it a bit at a time makes it easy to mov rsi, rsp or whatever, instead of needing position-independent addressing relative to RIP.

x64 opcodes and scaled byte index

I think I'm getting the Mod R/M byte down but I'm still confused by the effective memory address/scaled indexing byte. I'm looking at these sites: http://www.sandpile.org/x86/opc_rm.htm, http://wiki.osdev.org/X86-64_Instruction_Encoding. Can someone encode an example with the destination address being in a register where the SIB is used? Say for example adding an 8-bit register to an address in a 8-bit register with SIB used?
Also when I use the ModR/M byte of 0x05 is that (*) relative to the current instruction pointer? Is it 32 or 64 bits when in 64 bit mode?'
Is the SIB always used with a source or destination address?
A memory address is never in an 8-bit register, but here's an example of using SIB:
add byte [rax + rdx], 1
This is an instance of add rm8, imm8, 80 /0 ib. /0 indicates that the r field in the ModR/M byte is zero. We must use a SIB here but don't need an immediate offset, so we can use 00b for the mod and 100b for the rm, to form 04h for the ModR/M byte (44h and 84h also work, but wastes space encoding a zero-offset). Looking in the SIB table now, there are two registers both with "scale 1", so the base and index are mostly interchangeable (rsp can not be an index, but we're not using it here). So the SIB byte can be 10h or 02h.
Just putting the bytes in a row now:
80 04 10 01
; or
80 04 02 01
Also when I use the ModR/M byte of 0x05 is that (*) relative to the current instruction pointer? Is it 32 or 64 bits when in 64 bit mode?
Yes. You saw the note, I'm sure. So it can be either, depending on whether you used an address size override or not. In every reasonable case, it will be rip + sdword. Using the other form gives you a truncated result, I can't immediately imagine any circumstances under which that makes sense to do (for general lea math sure, but not for pointers). Probably (this is speculation though) that possibility only exists to make the address size override work reasonably uniformly.
Is the SIB always used with a source or destination address?
Depends on what you mean. Certainly, if you have a SIB, it will encode a source or destination (because what else is there?) (you might argue that the SIB that can appear in nop rm encodes nothing because nop has neither sources nor destinations). If you mean "which one does it encode", it can be either one. Looking over all instructions, it can most often appear in a source operand. But obviously there are many cases where it can encode the destination - example: see above. If you mean "is it always used", well no, see that table that you were looking at.

Buffer size and chance of ASLR brute force

How does increasing buffer size increase the chances of ASLR brute force succeeding?
This is related to a project I recently did. We had an exploit_1.c program which had a buffer/character array (originally of size 517). The buffer was set to NOPs with memset. Then we were to place shellcode and the return address into the buffer, which was then written to a file called badfile. The program took one argument which was the return address. We had a stack.c program which, in a function called bof, copied the contents of the badfile into a buffer of size 12.
I had read that one method is to put a jump at the end of the NOP sled and have it redirect to the shellcode. However, the shellcode was 24 bytes, and at maximum I only had 16 bytes before the return address. So what I did is I put the shellcode at the end of the buffer.
The 3rd task we were given is to pick return addresses such that the exploit program (modified for buffer size of course) had a higher average chance of succeeding with buffer sizes 1000, 10000, and 100000. We used a bash while loop with a counter to count how many tries the ASLR brute force took.
So what I was thinking is that with more memory obviously the NOP sled will be longer. But there's got to be something more than that.
The addresses I picked were:
0xbf87f030
0xbf82e3d0
0xbfe0fb60

how to boot this code?

i am a newbie to assembly and program in c (use GCC in Linux)
can anyone here tell me how to compile c code into assembly and boot from it using pen drive
i use the command (in linux terminal) :
gcc -S bootcode.c
the code gives me a bootcode.S file
what do i do with that ???
i just wanna compile the following code and run it directly from a USB stick
#include<stdio.h>
void main()
{
printf ("hi");
}
any help here ???
First of all,
You Should be aware that when you are writing bootloader codes , you should know that you are CREATING YOUR OWN ENVIRONMENT of CODE, that means, there is nothing such ready made C Library available to you or anything similar , ONLY and ONLY BIOS SERVICES (or INTERRUPT ROUTINES).
Now, if you got this, you will probably figure out that the above code won't boot since, you don't have the "stdio.h" header, this means that the CPU when executing your compiled code won't find this header and thereby won't understand what is "printf" (since printf is a method of the stdio.h header).
So if you want to print any string you need to write this function by YOUR OWN either in a separate file as a header and link its object file at compilation time when creating the final binary file or in the same file. it is up to you. There could be other ways, I'm not well familiar with them, just do some researches.
Another thing you should know, it is the BIOS who is responsible for loading this boot code (your above code in your case) into memory location 0x07C00 (0x0000h:0x7C00 in segment:offset representation), so you HAVE to mention in your code that you are writing this code on this memory location, either by
1-using the ORG instruction
2-Or by loading the appropriate registers for that (cs,ds,es)
Also, you should get yourself familiar with the segment:offset memory representation scheme, just google it or read intel manuals.
Finally, for the BIOS to load your code into the 0x07C00, the boot code must not exceed 512byte (ONLY ON FIRST SECTOR OF THE BOOTABLE MEDIA, since a sectore is 512byte) and he must find at the last two byte of this first sector (byte 510 & byte 511) of your code the boot signature 0x55AA, otherwise the BIOS won't consider this code AS BOOTABLE.
Usually this is coded as :
ORG 0x7C00
...
your boot code and to load more codes since 512byte won't be sufficient.
...
times 510 - ($ - $$) db 0x00 ; Zerofill up to 510 bytes
dw 0xAA55 ;Boot Sector signature,written in reverse order since it
will be stored as little endian notation
Just to let you know, I'm not covering everything here, because if so, I'll be writing pages about it, you need to look for more resources on the net, and here is a link to start with(coding in assembly):
http://www.brokenthorn.com/Resources/OSDevIndex.html
That's all, hopefully this was helpful to you...^_^
Khilo - ALGERIA
Booting a computer is not that easy. A bootloader needs to be written. The bootloader must obey certain rules and correspond with hardware such as ROM. You also need to disable interrupts, reserve some memory etc. Look up MikeOS, it's a great project that can better help you understand the process.
Cheers

Purpose of ibs/obs/bs in dd

I have a script that creates file system in a file on a linux machine. I see that to create the file system, it uses 'dd' with bs=x option, reads from /dev/zero and writes to a file. I think usually specifying ibs/obs/bs is useful to read from real hardware devices as one has specific block size constraints. In this case however, as it is reading from virtual device and writing to a file, I don't see any point behind using 'bs=x bytes' option. Is my understanding wrong here?
(Just in case if it helps, this file system is later on used to boot a qemu vm)
To understand block sizes, you have to be familiar with tape drives. If you're not interested in tape drives - for example, you don't think you're ever going to use one - then you can go back to sleep now.
Remember the tape drives from films in the 60s, 70s, maybe even 80s? The ones where the reel went spinning around, and so on? Not your Exabyte or even QIC - quarter-inch cartridge - tapes; your good old fashioned reel-to-reel half-inch tape drives? On those, block size mattered.
The data on a tape was written in blocks. Each block was separated from the next by an inter-record gap.
----+-------+-----+-------+-----+----
... | block | IRG | block | IRG | ...
----+-------+-----+-------+-----+----
Depending on the tape drive hardware and software, there were a variety of problems that could happen. For example, if the tape was written with a block size of 5120 bytes and you read the tape with a block size of 512 bytes, then the tape drive might read the first block, return you 512 bytes of it, and then discard the remaining data; the next read would start on the next block. Conversely, if the tape was written with a block size of 512 bytes and you requested blocks of 5120 bytes, you would get short reads; each read would return just 512 bytes, and if your software wasn't paying attention, you'd be reading garbage. There was also the issue that the tape drive had to get up to speed to read the block, and then slow down. The ASCII art suggests that the IRG was smaller than the data blocks; that was not necessarily the case. And it took time to read one block, overshoot the IRG, rewind backwards to get to the next block, and start forwards again. And if the tape drive didn't have the memory to buffer data - the cheaper ones did not - then you could seriously affect your tape drive performance.
War story: work prepared on newer machine with a slightly more modern tape drive. I wrote a tape using tar without a sensible block size (so it defaulted to 512 bytes). It was a large bit of software - must have been, oh, less than 100 MB in total (a long time ago, in other words). The tape wrote nicely because the machine was modern enough, and it took just a few seconds to do so. But, I had to get the material off the tape on a machine with an older tape drive, one that did not have any on-board buffer. So, it read the material, 512 bytes at a time, and the reel rocked forward, reading one block, and then rocked back all but maybe half an inch, and then read forwards to get to the next block, and then rocked back, and ... well, you could see it doing this, and since it took appreciable portions of a second to read each 512 byte block, the total time taken was horrendous. My flight was due to leave...and I needed to get that data across too. (It was long enough ago, and in a land far enough away, that last minute changes to flights weren't much of an option either.) To cut a long story short, it did get read - but if I'd used a sensible block size (such as 5120 bytes instead of the default of 512), I would have been done much, much quicker and with much less danger of missing the plane (but I did actually catch the plane, with maybe 20 minutes to spare, despite a taxi ride across Paris in the rush hour).
With more modern tape drives, there was enough memory on the drive to do buffering and getting a tape drive to stream - write continuously without reversing - was feasible. It used to be that I'd use a block size like 256 KB to get QIC tapes to stream. I've not done much with tape drives recently - let's see, not this millennium and not much for a few years before that, either; certainly not much since CD and DVD became the software distribution mechanisms of choice (when electronic download wasn't used).
But the block size really did matter in the old days. And dd provided good support for it. You could even transfer data from a tape drive that was written with, say, 4 KB block to another that you wanted to write with, say, 16 KB blocks, by specifying the ibs (input block size) separately from the obs (output block size). Darned useful!
Also, the count parameter is in terms of the (input) block size. It was useful to say 'dd bs=1024 count=1024 if=/dev/zero of=/my/file/of/zeroes' to copy 1 MB of zeroes around. Or to copy 1 MB of a file.
The importance of dd is vastly diminished; it was an essential part of the armoury for anybody who worked with tape drives a decade or more ago.
The block size is the number of bytes that are read and written at a time. Presumably there is a count= option, and that is specified in units of the block size. If there is a skip= or seek= option, those will also be in block size units. However if you are reading and writing a regular file, and there are no disk errors, then the block size doesn't really matter as long as you can scale those parameters accordingly and they are still integers. However certain sizes may be more efficient than others.
For reading from /dev/zero, it doesn't matter. ibs/obs/bs specify how many bytes will be read at a time. It's helpful to choose a number based on the way bytes are read/written in the operating system. For instance, Linux usually reads from a hard drive in 4096 byte chunks. If you have at least some idea about how the underlying hardware reads/writes, it might be a good idea to specify ibs/obs/bs. By the way, if you specify bs, it will override whatever you specify for ibs and obs.
In addition to the great answer by Jonathan Leffler, keep in mind that the bs= option isn't always a substitute for using both ibs= and obs=, in particular for the old [ugly] days of tape drives.
The bs= option reserves the right for dd to write the data as soon as it's read. This can cause you to no longer have identically sized blocks on the output. Here is GNU's take on this, but the behavior dates back as far as I can remember (80's):
(bs=) Set both input and output block sizes to bytes. This makes dd read and write bytes per block, overriding any ‘ibs’ and ‘obs’ settings. In addition, if no data-transforming conv option is specified, input is copied to the output as soon as it’s read, even if it is smaller than the block size.
For instance, back in the QIC days on an old Sun system, if you did this:
tar cvf /dev/rst0c /bla
It would work, but cause an enormous amount of back and forth thrashing while the drive wrote a small block, and tried to backup and read to reposition itself properly for the next write.
If you swapped this with:
tar cvf - /bla | dd ibs=16K obs=16K of=/dev/rst0c
You'd get the QIC drive writing much larger chunks and not thrashing quite so much.
However, if you made the mistake of this:
tar cvf - /bla | dd bs=16K of=/dev/rst0c
You'd run the risk of having precisely the same thrashing you had before depending upon how much data was available at the time of each read.
Specifying both ibs= and obs= precludes this from happening.

Resources