How does increasing buffer size increase the chances of ASLR brute force succeeding?
This is related to a project I recently did. We had an exploit_1.c program which had a buffer/character array (originally of size 517). The buffer was set to NOPs with memset. Then we were to place shellcode and the return address into the buffer, which was then written to a file called badfile. The program took one argument which was the return address. We had a stack.c program which, in a function called bof, copied the contents of the badfile into a buffer of size 12.
I had read that one method is to put a jump at the end of the NOP sled and have it redirect to the shellcode. However, the shellcode was 24 bytes, and at maximum I only had 16 bytes before the return address. So what I did is I put the shellcode at the end of the buffer.
The 3rd task we were given is to pick return addresses such that the exploit program (modified for buffer size of course) had a higher average chance of succeeding with buffer sizes 1000, 10000, and 100000. We used a bash while loop with a counter to count how many tries the ASLR brute force took.
So what I was thinking is that with more memory obviously the NOP sled will be longer. But there's got to be something more than that.
The addresses I picked were:
0xbf87f030
0xbf82e3d0
0xbfe0fb60
Related
I'm confused by the implementation of the dcache_inval_poc (start, end) as follows: https://github.com/torvalds/linux/blob/v5.15/arch/arm64/mm/cache.S#L134. There is no sanity check for the "end" address, but what will happen if the range (start, end) passes from the upper layer, like dma_sync_single_for_cpu/dma_sync_single_for_device, beyond the L1 data cache size? eg: dcache_inval_poc(start, start+256KB), but L1 D-cache size is 32KB
After going through the source code of the dcache_inval_poc (start, end) https://github.com/torvalds/linux/blob/v5.15/arch/arm64/mm/cache.S#L152 , I tried to convert the loop code to Pseudo-Code in C as the following:
x0_kaddr = start;
while ( start < end){
dc_civac( x0_kaddr );
x0_kaddr += cache_line_size;
}
If "end - start" > L1 D-cache size, the loop will still run, however, the "x0_kaddr" address no longer exists in the D-cache.
Your confusion comes from fact that you thinking in terms of cache lines somehow mapped on top of some memory range. But function is Invalidate range by virtual address in terms of available mapped memory.
So far as start and end parameters are valid virtual addresses of general memory that's fine.
Memory range does not have to be cached as a whole, only some data out of given range might be cached or none at all.
So say there is 2MB buffer in physical DDR memory that's mapped and could be accessed by virtual addresses.
Say L1 is 32KB.
So up to 32KB out of 2MB buffer might be cached (or none at all). You don't know what part, if any, is in cache.
For that reason you run a loop over virtual addresses of your 2MB buffer. If data block of cache_line_size is in cache, that cache line would be invalidated. If data is not in cache and only in DDR memory, that's basically a nop.
It's good practice to provide start and end addresses aligned to cache_line_size, because memory controller would clip lower bits and you might miss cleaning some data in buffer tail.
PS: if you want to operate directly on cache lines, there is other functions for that. And they takes way and set parameters to address directly cache lines.
The general way to store strings in NASM is to use db like in msg: db 'hello world', 0xA. I think this stores the string in the bss section. So the string will occupy the storage for the duration of the entire program. Instead, if we store it in the stack, it will be alive only during the local frame. For small strings (less than 8 bytes), this can be done using mov dword [rsp], 'foo'. But for longer strings, the string has to be split and be stored using multiple instructions. So this would increase the executable size (I thought so).
So now, which is better in large programs with multiple strings? Are any of the assumptions I made above, wrong?
mov dword [rsp] 'foo' assembles to C70424666F6F00, it takes 7 bytes to encode 4 payload characters.
In comparison with standard static way DB 'foo',0 the definition of string as immediate operand in code section increases the code size by 75 %.
Your dynamic method may be profitable only if you could eliminate the .rodata or .data section entirely (which is seldom the case of large programs). Each additional section takes more space in executable program than its netto contents because of its file-alignment (in PE format it is 512 bytes).
And even when your program has no other static data in data sections beside long strings, you could spare more space with static declaration in .text (code) section.
But for longer strings string has to be split and be stored using multiple instructions. So this would increase the executable size (I thought so).
Yep, and in almost all cases, the number of bytes used by those instructions will exceed the number of bytes that would be needed to just store the string in memory normally. (The instruction includes all the bytes of the immediate, with a few exceptions like zero- and sign-extension, as well as additional bytes to encode the opcode, destination address, etc). And code, of course, also occupies (virtual) memory for the entire duration of the program. There's no free lunch.
As such, you should just assemble the strings directly into memory with db as you have done. But try to arrange for them to go in a read-only section such as .text or .rodata depending on what your system supports. A modern operating system will then be able to discard those pages from physical memory if there is memory pressure, knowing that they can be reloaded from the executable if needed again. This is entirely transparent to your program. If there are many strings that will be needed at the same times, you can optimize a little by trying to arrange for them to be adjacent in your program's memory (defining them all together in one asm file should suffice).
You could also design something where at runtime, you read your strings from an external file into dynamically allocated memory, then free it when done with them. This was a common technique for programs on ancient operating systems with limited physical memory and no virtual memory support (think MS-DOS).
The string data has to be somewhere. Existing as immediates in your machine code takes space in the .text section of your program, normally linked into the same program segment as .rodata where you'd keep string literals. Running instructions to store strings to the stack means the data has to come into I-cache, then go out into D-cache.
But for long redundant strings, code to generate them in memory may be smaller than the string itself. This comes down to the Kolmogorov complexity; minimum size of a program that can output (or generate in an array) the given data. That program could be a gzip or zstd decompressor with some input constant data, could be good for a very large string or set of strings, much larger than the decompression code. Or for a specific case, have a look at code-golf questions like The alphabet in programming languages where my answer is 9 bytes of x86 machine code (including a ret) which inefficiently stores 1 byte at a time, incrementing a pointer in a loop, to produce the 26-byte string (without a terminating 0). Slow but compact.
Pushing / Storing from immediates
For just 4 bytes (not including the 0 terminator) you'd use push 'foo' = 5 bytes = 80% efficiency. On x86-64, that's a qword push (sign-extending the imm32 to 64-bit), so we get 4 bytes of zeros for free.
After that, getting the pointer into a register with mov rdi, rsp (3 bytes) is 4 bytes smaller than lea rdi, [rel msg_foo] (7 bytes), so it's an overall win for total size (unless padding for function alignment bumps it up or hides it). But anyway, comparing against the best option instead of the worst might be a good idea for your answer.
Compilers will sometimes use immediate data to init a local struct or array that has to be on the stack anyway (i.e. they have to pass a non-const pointer to another function); their threshold for switching to copying from .rodata (with movdqa xmm load/store) is higher than 8 bytes. But when you just want to print the string, you just need to pass a pointer to .rodata without copying it to the stack at all, so the threshold is much lower for it to be worth it to use stack space. Maybe 8 bytes, maybe 16, maybe only 4, depending on I-cache vs. D-cache pressure in your program.
For short but not tiny strings
mov rcx, imm64 + push rcx is 10+1 = 11 total bytes for 8 bytes of payload = 73% efficiency, vs. 4/7 = 57%. (At an offset from RSP it would be even worse, but to save code size you could use RBP+imm8 instead of RSP+imm8, but that's still 4 bytes per 7. You could mix push sign_extended_imm32 with mov dword [rsp+4], imm32 but that's also bad.)
This does have overhead that scales with string size, so it quickly becomes more size-efficient to just copy from .rodata, e.g. with an XMM loop, call memcpy, or even rep movsb if you're optimizing for size.
Or of course just using the string in .rodata if possible, if you don't need to make a copy you can modify on the stack.
Shellcode is a common use-case for techniques like this. You need a single "flat" sequence of bytes, usually not containing any 00 bytes which would terminate a C string.
You can have some data past the end of the actual machine code part, and mov store some zeros into that and generate pointers to it, but that's somewhat cumbersome. And you need a position-independent way to get pointers into registers. call/pop works, if you jump forward and call backward so the rel32 doesn't involve any 00 bytes. Same for RIP-relative lea.
It's often just as easy to push a zeroed register, or an imm8 or imm32, and get some zeros into memory along with ASCII data that way. Generating it a bit at a time makes it easy to mov rsi, rsp or whatever, instead of needing position-independent addressing relative to RIP.
Stress-ng: Can we test RAM using stress-ng? What are the commands used to test RAM on a MIPS 32 device?
There are many memory based stressors in stress-ng:
stress-ng --class memory?
class 'memory' stressors: atomic bsearch context full heapsort hsearch
lockbus lsearch malloc matrix membarrier memcpy memfd memrate memthrash
mergesort mincore null numa oom-pipe pipe qsort radixsort remap
resources rmap stack stackmmap str stream tlb-shootdown tmpfs tsearch
vm vm-rw wcs zero zlib
Alternatively, one can also use VM based stressors too:
stress-ng --class vm?
class 'vm' stressors: bigheap brk madvise malloc mlock mmap mmapfork mmapmany
mremap msync shm shm-sysv stack stackmmap tmpfs userfaultfd vm vm-rw
vm-splice
I suggest looking at the vm stressor first as this contains a large range of stressor methods that exercise memory patterns and can possibly find broken memory:
-m N, --vm N
start N workers continuously calling mmap(2)/munmap(2) and writ‐
ing to the allocated memory. Note that this can cause systems to
trip the kernel OOM killer on Linux systems if not enough physi‐
cal memory and swap is not available.
--vm-bytes N
mmap N bytes per vm worker, the default is 256MB. One can spec‐
ify the size as % of total available memory or in units of
Bytes, KBytes, MBytes and GBytes using the suffix b, k, m or g.
--vm-ops N
stop vm workers after N bogo operations.
--vm-hang N
sleep N seconds before unmapping memory, the default is zero
seconds. Specifying 0 will do an infinite wait.
--vm-keep
do not continually unmap and map memory, just keep on re-writing
to it.
--vm-locked
Lock the pages of the mapped region into memory using mmap
MAP_LOCKED (since Linux 2.5.37). This is similar to locking
memory as described in mlock(2).
--vm-madvise advice
Specify the madvise 'advice' option used on the memory mapped
regions used in the vm stressor. Non-linux systems will only
have the 'normal' madvise advice, linux systems support 'dont‐
need', 'hugepage', 'mergeable' , 'nohugepage', 'normal', 'ran‐
dom', 'sequential', 'unmergeable' and 'willneed' advice. If this
option is not used then the default is to pick random madvise
advice for each mmap call. See madvise(2) for more details.
--vm-method m
specify a vm stress method. By default, all the stress methods
are exercised sequentially, however one can specify just one
method to be used if required. Each of the vm workers have 3
phases:
1. Initialised. The anonymously memory mapped region is set to a
known pattern.
2. Exercised. Memory is modified in a known predictable way.
Some vm workers alter memory sequentially, some use small or
large strides to step along memory.
3. Checked. The modified memory is checked to see if it matches
the expected result.
The vm methods containing 'prime' in their name have a stride of
the largest prime less than 2^64, allowing to them to thoroughly
step through memory and touch all locations just once while also
doing without touching memory cells next to each other. This
strategy exercises the cache and page non-locality.
Since the memory being exercised is virtually mapped then there
is no guarantee of touching page addresses in any particular
physical order. These workers should not be used to test that
all the system's memory is working correctly either, use tools
such as memtest86 instead.
The vm stress methods are intended to exercise memory in ways to
possibly find memory issues and to try to force thermal errors.
Available vm stress methods are described as follows:
Method Description
all iterate over all the vm stress methods
as listed below.
flip sequentially work through memory 8
times, each time just one bit in memory
flipped (inverted). This will effec‐
tively invert each byte in 8 passes.
galpat-0 galloping pattern zeros. This sets all
bits to 0 and flips just 1 in 4096 bits
to 1. It then checks to see if the 1s
are pulled down to 0 by their neighbours
or of the neighbours have been pulled up
to 1.
galpat-1 galloping pattern ones. This sets all
bits to 1 and flips just 1 in 4096 bits
to 0. It then checks to see if the 0s
are pulled up to 1 by their neighbours
or of the neighbours have been pulled
down to 0.
gray fill the memory with sequential gray
codes (these only change 1 bit at a time
between adjacent bytes) and then check
if they are set correctly.
incdec work sequentially through memory twice,
the first pass increments each byte by a
specific value and the second pass
decrements each byte back to the origi‐
nal start value. The increment/decrement
value changes on each invocation of the
stressor.
inc-nybble initialise memory to a set value (that
changes on each invocation of the stres‐
sor) and then sequentially work through
each byte incrementing the bottom 4 bits
by 1 and the top 4 bits by 15.
rand-set sequentially work through memory in 64
bit chunks setting bytes in the chunk to
the same 8 bit random value. The random
value changes on each chunk. Check that
the values have not changed.
rand-sum sequentially set all memory to random
values and then summate the number of
bits that have changed from the original
set values.
read64 sequentially read memory using 32 x 64
bit reads per bogo loop. Each loop
equates to one bogo operation. This
exercises raw memory reads.
ror fill memory with a random pattern and
then sequentially rotate 64 bits of mem‐
ory right by one bit, then check the
final load/rotate/stored values.
swap fill memory in 64 byte chunks with ran‐
dom patterns. Then swap each 64 chunk
with a randomly chosen chunk. Finally,
reverse the swap to put the chunks back
to their original place and check if the
data is correct. This exercises adjacent
and random memory load/stores.
move-inv sequentially fill memory 64 bits of mem‐
ory at a time with random values, and
then check if the memory is set cor‐
rectly. Next, sequentially invert each
64 bit pattern and again check if the
memory is set as expected.
modulo-x fill memory over 23 iterations. Each
iteration starts one byte further along
from the start of the memory and steps
along in 23 byte strides. In each
stride, the first byte is set to a ran‐
dom pattern and all other bytes are set
to the inverse. Then it checks see if
the first byte contains the expected
random pattern. This exercises cache
store/reads as well as seeing if neigh‐
bouring cells influence each other.
prime-0 iterate 8 times by stepping through mem‐
ory in very large prime strides clearing
just on bit at a time in every byte.
Then check to see if all bits are set to
zero.
prime-1 iterate 8 times by stepping through mem‐
ory in very large prime strides setting
just on bit at a time in every byte.
Then check to see if all bits are set to
one.
prime-gray-0 first step through memory in very large
prime strides clearing just on bit
(based on a gray code) in every byte.
Next, repeat this but clear the other 7
bits. Then check to see if all bits are
set to zero.
prime-gray-1 first step through memory in very large
prime strides setting just on bit (based
on a gray code) in every byte. Next,
repeat this but set the other 7 bits.
Then check to see if all bits are set to
one.
rowhammer try to force memory corruption using the
rowhammer memory stressor. This fetches
two 32 bit integers from memory and
forces a cache flush on the two
addresses multiple times. This has been
known to force bit flipping on some
hardware, especially with lower fre‐
quency memory refresh cycles.
walk-0d for each byte in memory, walk through
each data line setting them to low (and
the others are set high) and check that
the written value is as expected. This
checks if any data lines are stuck.
walk-1d for each byte in memory, walk through
each data line setting them to high (and
the others are set low) and check that
the written value is as expected. This
checks if any data lines are stuck.
walk-0a in the given memory mapping, work
through a range of specially chosen
addresses working through address lines
to see if any address lines are stuck
low. This works best with physical mem‐
ory addressing, however, exercising
these virtual addresses has some value
too.
walk-1a in the given memory mapping, work
through a range of specially chosen
addresses working through address lines
to see if any address lines are stuck
high. This works best with physical mem‐
ory addressing, however, exercising
these virtual addresses has some value
too.
write64 sequentially write memory using 32 x 64
bit writes per bogo loop. Each loop
equates to one bogo operation. This
exercises raw memory writes. Note that
memory writes are not checked at the end
of each test iteration.
zero-one set all memory bits to zero and then
check if any bits are not zero. Next,
set all the memory bits to one and check
if any bits are not one.
--vm-populate
populate (prefault) page tables for the memory mappings; this
can stress swapping. Only available on systems that support
MAP_POPULATE (since Linux 2.5.46).
So to run 1 vm stressor that uses 75% of memory using all the vm stressors with verification for 10 minutes with verbose mode enabled, use:
stress-ng --vm 1 --vm-bytes 75% --vm-method all --verify -t 10m -v
I've been wandering about the size of bss, data or text that I have. So I typed size command.
The result is
text data bss dec hex filename
5461 580 24 ....
What does the number mean? Is the unit bits, Bytes, Kilobytes or Megabytes?
In addition, how to reduce the size of bss, data, text of the file? (Not using strip command.)
That command shows a list of the sections and their sizes in bytes found in an object file. The unit is decimal bytes, unless display of a different format was specified. And there most likely exists a man page for the size command too.
"reduce the size" - modify source code. Take things out.
As for the part about reducing segment size, you have some leeway in moving parts from data to bss by not initializing them. This is only an option if the program initializes the data in another way.
You can reduce data or bss by replacing arrays with dynamically allocated memory, using malloc and friends.
Note that the bss takes no space in the executable and reducing it just for the sake of having smaller numbers reported by size is probably not a good idea.
Could anyone provide some insight into the following assembly code:
More Information:
The bootloader is in fact a small 16bit bootloader that decipher using Xor decryption, a bigger one, a linux bootloader located in sectors 3 to 34. ( 1 sector is 512 byte in that disk )
The whole thing is a protection system for an exec running on embedded Linux.
the version where the protection was removed has the linux bootloader already deciphered ( we were able to reverse it using IDA ) so we assume that the xor key must be made only with zero's in the version without the protection.
if we look at offset 0x800 to 0x8FF in the version with protection removed it is not filled with zero's so this cannot be the key otherwise this version couldn't be loaded, it would xor plain data and load nothing but garbage.
the sectors 3->34 are ciphered in original version and in clear in our version ( protection removed ) but the MBR code ( small prebootloader ) is identical in both version.
So could it be that there is a small detail in the assembly code of the MBR that changes slightly the place of the xor key?
This is simply an exercise I am doing to understand assembly code loaders better and I found this one quite a challenge to go through. I thank you for your input so far!
Well, reading assembly code is a lot of work. I'll just stick to answering where does the "key" come from.
The BIOS loads the MBR at 0000:7C00 and sets DL to the drive from which it was loaded.
First, your MBR sets up a stack growing down from 0000:7C00.
Then, it copies itself to 0000:0600-0000:07ff and does the far jump which I assume is to the next instruction, however on the copied version, 0000:061D. (EDIT: this copy is a pretty standard thing for an MBR to do, typically to this address, it leaves the address space above free) (EDIT: it copies itself fully, I misread).
Then, it tries 5 times to read sector 2 into 0000:0700-0000:08ff, and halts with an error message if it fails (loc_78).
Then, it tries 5 times to read 32 sectors starting from sector 3 into 2000:0-2000:3FFF (absolute address 20000-23FFF, that's 16kB starting at 128kB).
If all goes well we are at loc_60 and we know what we have in memory.
The loop on loc_68 will use as destination the buffer holding those 32 sectors.
It will use as source the buffer starting at 0000:0800 (which is the second 256 bytes of the buffer we read into 0:0700-0:08ff, the second 256 bytes of sector 2 of the disk).
On each iteration, LODSW adds 2 to SI.
When SI reaches 840, it is set back to 800 by the AND. Therefore, the "key" buffer goes from 800 to 83F, which is to say the 64 bytes starting at byte 256 of sector 2 of the disk.
I have the impression that the XORing falls one byte short... that CX should have been set to 4000h, not 3FFF. I think this is a bug in the code.
After the 16kB buffer at physical address 20000 has been circularly XORed with that "key" buffer, we jump into it with a far jump to 2000:0.
Right? Does it fit? It the key there?