Can a single byte instruction be executed while being only partially overwritten?

Can a single byte instruction be executed while being only partially overwritten? - multithreading

I have made an experiment in which a new thread executes a shellcode with this simple infinite loop:
NOP
JMP REL8 0xFE (-0x2)
This generate the following shellcode:
0x90, 0xEB, 0xFE
After this infinite loop there are other instructions ending by the overwriting of the destination byte back to -0x2 to make it an infinite loop again, and an absolute jump to send the thread back to this infinite loop.
Now I was asking myself if it was possible that the instruction of the jump was executed while the single byte of the destination is only partially overwritten by the other thread.
For example, let's say that the other thread overwrites the destination of the jump (0xFE, or 11111110 in binary) to 0x0 (00000000) to release the thread of this infinite loop.
Could it happen that the jump goes to let's say 0x1E (00011110) because the destination byte wasn't completely overwritten at that nanosecond?
Before asking this question here I have done the experiment myself in a C++ program and I have let it run for some hours without it never missing a single jump.
If you want to have a look at the code I made for this experiment I have uploaded it to GitHub
Accordingly to this experiment, it seems to be impossible that an instruction is executed while being only partially overwritten .
However, I have very little knowledge in assembly and in processors, this is for this reason that I ask the question here:
Can anyone confirm my observation please? Is it indeed impossible to have an instruction executed while being partially overwritten by another thread? Does anyone knows why for sure?
Thank you very much for your help and knowledge on that, I did not know where to look for such an information.

No, byte stores are always atomic on x86, even for cross-modifying code.
See Observing stale instruction fetching on x86 with self-modifying code for some links to Intel's manuals for cross modifying code. And maybe Reproducing Unexpected Behavior w/Cross-Modifying Code on x86-64 CPUs
Of course, all the recommendations for writing efficient cross-modifying code (and running code that you just JIT-compiled) involve avoiding stores into pages that other threads are currently executing.
Why are you doing this with "shellcode", anyway? Is this supposed to be part of an exploit? If not, why not just write code in asm like a normal person, with a label on the jmp instruction so you can store to it from C by assigning to extern char jmp_bytes[2]?
And if this is supposed to be an efficient cross-thread notification mechanism... it isn't. Spinning on a data load and a conditional branch with a pause loop would allow a lower latency exit from the loop than a self-modifying code machine nuke that flushes the whole pipeline right when you want it to finally be doing something useful instead of wasting CPU time. At least several times the delay of a simple branch miss.
Even better, use an OS-supported condition variable so the thread can sleep instead of heating up your CPU (reducing the thermal headroom for the CPU to turbo above its rated clock speed up when there is work to do).
The mechanism used by current CPUs is that if a store near the EIP/RIP or any instruction in flight in the pipeline is detected, it does a machine clear. (perf counter machine_clears.smc, aka machine nuke.) It doesn't even try to handle it "efficiently", but if you did a non-atomic store (e.g. actually two separate stores, or a store split across a cache line boundary) the target CPU core could see it in different parts and potentially decode it with some bytes updated and other bytes not. But a single byte is always updated atomically, so tearing within a byte is not possible.
However, x86 on paper doesn't guarantee that, but as Andy Glew (one of the architects of Intel's P6 microarchitecture family) says, implementing stronger behaviour than the paper spec can actually be the most efficient way to meet all the required guarantees and run fast. (And / or avoid breaking existing code in widely-used software!)

Related

How do different commands get executed in CPU x86-64 registers?

Years ago a teacher once said to class that 'everything that gets parsed through the CPU can also be exploited'.
Back then I didn't know too much about the topic, but now the statement is nagging on me and I
lack the correct vocabulary to find an answer to this question in the internet myself, so I kindly ask you for help.
We had the lesson about 'cat', 'grep' and 'less' and she said that in the worst case even those commands can cause harm if we parse the wrong content through it.
I don't really understand how she meant that. I do know how CPU registers work, we also had to write an educational buffer overflow so I have seen assembly code in the registers aswell.
I still don't get the following:
How do commands get executed in the CPU at all? e.g. I use 'cat' so somehwere there will be a call of the command. But how does the data I enter get parsed to the CPU? If I 'cat' a .txt file which contains 'hello world' - can I find that string in HEX somewhere in the CPU registers? And if yes:
How does the CPU know that said string is NOT to be executed?
Could you think of any scencario where the above commands could get exploited? Afaik only text gets parsed through it, how could that be exploitable? What do I have to be careful about?
Thanks alot!

Machine code executes by being fetched by the instruction-fetch part of the CPU, at the address pointed to by RIP, the instruction-pointer. CPUs can only execute machine code from memory.
General-purpose registers get loaded with data from data load/store instructions, like mov eax, [rdi]. Having data in registers is totally unrelated to having it execute as machine code. Remember that RIP is a pointer, not actual machine-code bytes. (RIP can be set with jump instructions, including indirect jump to copy a GP register into it, or ret to pop the stack into it).
It would help to learn some basics of assembly language, because you seem to be missing some key concepts there. It's kind of hard to answer the security part of this question when the entire premise seems to be built on some misunderstanding of how computers work. (Which I don't think I can easily clear up here without writing a book on assembly language.) All I can really do is point you at CPU-architecture stuff that answers part of the title question of how instructions get executed. (Not from registers).
Related:
How does a computer distinguish between Data and Instructions?
How instructions are differentiated from data?
Modern Microprocessors
A 90-Minute Guide! covers the basic fetch/decode/execute cycle of simple pipelines. Modern CPUs might have more complex internals, but from a correctness / security POV are equivalent. (Except for exploits like Spectre and Meltdown that depend on speculative execution).
https://www.realworldtech.com/sandy-bridge/3/ is a deep-dive on Intel's Sandybridge microarchitecture. That page covering instruction-fetch shows how things really work under the hood in real CPUs. (AMD Zen is fairly similar.)
You keep using the word "parse", but I think you just mean "pass". You don't "parse content through" something, but you can "pass content through". Anyway no, cat usually doesn't involve copying or looking-at data in user-space, unless you run cat -n to add line numbers.
See Race condition when piping through x86-64 assembly program for an x86-64 Linux asm implementation of plain cat using read and write system calls. Nothing in it is data-dependent, except for the command-line arg. The data being copied is never loaded into CPU registers in user-space.
Inside the kernel, copy_to_user inside Linux's implementation of a read() system call on x86-64 will normally use rep movsb for the copy, not a loop with separate load/store, so even in kernel the data gets copied from the page-cache, pipe buffer, or whatever, to user-space without actually being in a register. (Same for write copying it to whatever stdout is connected to.)
Other commands, like less and grep, would load data into registers, but that doesn't directly introduce any risk of it being executed as code.

Most of the things have already been answered by Peter. However i would like to add a few things.
How do commands get executed in the CPU at all? e.g. I use 'cat' so somehwere there will be a call of the command. But how does the data I enter get parsed to the CPU? If I 'cat' a .txt file which contains 'hello world' - can I find that string in HEX somewhere in the CPU registers?
cat is not directly executed by the CPU cat.c. You could check the source code and get and in-depth view. .
What actually happens is that each instruction is converted to assembly instruction and they get executed by the CPU. The instructions are not vulnerable because what they do is just move some data and switch some bits. Most of the vulnerability are due to memory management and cat has been vulnerable in the past Check this for more detail
How does the CPU know that said string is NOT to be executed?
It does not. Its the job of the operating system to tell what is to be executed and what not.
Could you think of any scencario where the above commands could get exploited? Afaik only text gets parsed through it, how could that be exploitable? What do I have to be careful about?
You have to be careful about how you are passing the text file to the memory. You could even make your own interpreter that would execute txt file and then the interpreter will be telling the CPU about how to execute that instruction.

6502 cycle timing per instruction

I am writing my first NES emulator in C. The goal is to make it easily understandable and cycle accurate (does not necessarily have to be code-efficient though), in order to play games at normal 'hardware' speed. When digging into the technical references of the 6502, it seems like the instructions consume more than one CPU cycle - and also has different cycles depending on given conditions (such as branching). My plan is to create read and write functions, and also group opcodes by addressing modes using a switch.
The question is: When I have a multiple-cycle instruction, such as a BRK, do I need to emulate what is exactly happening in each cycle:
#Method 1
cycle - action
1 - read BRK opcode
2 - read padding byte (ignored)
3 - store high byte of PC
4 - store low byte of PC
5 - store status flags with B flag set
6 - low byte of target address
7 - high byte of target address
...or can I just execute all the required operations in one 'cycle' (one switch case) and do nothing in the remaining cycles?
#Method 2
1 - read BRK opcode,
read padding byte (ignored),
store high byte of PC,
store low byte of PC,
store status flags with B flag set,
low byte of target address,
high byte of target address
2 - do nothing
3 - do nothing
4 - do nothing
5 - do nothing
6 - do nothing
7 - do nothing
Since both methods consume the desired 7 cycles, will there be no difference between the two? (accuracy-wise)
Personally I think method 1 is the way-to-go solution, however I cannot think of a proper, easy way to implement it... (Please help!)

Do you 'need' to? It depends on the software. Imagine the simplest example:
STA ($56), Y
... which happens to hit a hardware register. If you don't do at least the write on the correct cycle then you've introduced a timing deficiency. The register you're writing to will be written to at the wrong time. What if it's something like a palette register, and the programmer is running a raster effect? Then you've just moved where the colour changes. You've changed the graphical output.
In practice, clever programmers do much smarter things than that — e.g. one might use a read-modify-write operation to read a hardware value at an exact cycle, modify it, then write it back at some other exact cycle.
So my answer is:
most software isn't written so that the difference between (1) and (2) will have any effect; but
some definitely is, because the author was very clever; and
some definitely is, just because the author experimented until they found a cool effect, regardless of whether they were cognisant of the cause; and
in any case, when you find something that doesn't work properly on your emulator, how much time do you want to spend considering all the permutations and combinations of potential causes? Every one you can factor out is one less to consider.
Most emulators used to use your method (2). What normally happens is that they work with 90% of software. Then there's a few cases that don't work, for which the emulator author puts in a special case here, a special case there. Those usually ended up interacting poorly and the emulator spent the rest of its life oscillating between supporting different 95% combinations of available software until somebody wrote a better one.
So just go with method (1). It will cause some software that would otherwise be broken not to be so. Also it'll teach you more, and it'll definitely eliminate any potential motivation for special cases so it'll keep your code cleaner. It'll be marginally slower but I think your computer can probably handle it.
Other tips: the 6502 has only a few addressing modes, and the addressing mode entirely dictates the timing. This document is everything you need to know for perfect timing. If you want perfect cleanliness, your switch table can just pick an addressing mode and a central operation, then exit and you can branch on addressing mode to do the main action.
If you're going to use vanilla read and write methods, which is smart on a 6502 as every single cycle is either a read or a write so it's almost all you need to say, just be careful of the method signatures. For example, the 6502 has a SYNC pin which allows an observer to discriminate an ordinary read from an opcode read. Check whether the NES exposes that to cartridges, as it's often used on systems that expose it for implicit paging and the main identifying characteristic of the NES is that there are hundreds of paging schemes.
EDIT: minor updates:
it's not actually completely true to say that a 6502 always reads or writes; it also has an RDY input. If the RDY input is asserted and the 6502 intends to read, it will instead halt while maintaining the intended read address. Rarely used in practice because it's insufficient for common tasks like allowing somebody else to take possession of memory — the 6502 will write regardless of the RDY input, it's really meant to help with single-stepping — and seemingly not included on the NES cartridge pinout, you needn't implement it for that machine.
per the same pinout, the sync signal also doesn't seem to be exposed to cartridges on that system.

what is The poisoned NUL byte, in 1998 and 2014 editions?

I have to make a 10 minutes presentation about "poisoned null-byte (glibc)".
I searched a lot about it but I found nothing, I need help please because operating system linux and the memory and process management isn't my thing.
here is the original article, and here is an old article about the same problem but another version.
what I want is a short and simple explanation to the old and new versions of the problem or/and sufficient references where I can better read about this security threat.

To even begin to understand how this attack works, you will need at least a basic understanding of how a CPU works, how memory works, what the "heap" and "stack" of a process are, what pointers are, what libc is, what linked lists are, how function calls are implemented at the machine level (including calls to function pointers), what the malloc and free functions from the C library do, and so on. Hopefully you at least have some basic knowledge of C programming? (If not, you will probably not be able to complete this assignment in time.)
If you have a couple "gaps" in your knowledge of the basic topics mentioned above, hit the books and fill them in as quickly as you can. Talk to others if you need to, to make sure you understand them. Then read the following very carefully. This will not explain everything in the article you linked to, but will give you a good start. OK, ready? Let's start...
C strings are "null-terminated". That means the end of a string is marked by a zero byte. So for example, the string "abc" is represented in memory as (hex): 0x61 0x62 0x63 0x00. Notice, that 3-character string actually takes 4 bytes, due to the terminating null.
Now if you do something like this:
char *buffer = malloc(3); // not checking for error, this is just an example
strcpy(buffer, "abc");
...then that terminating null (zero byte) will go past the end of the buffer and overwrite something. We allocated a 3-byte buffer, but copied 4 bytes into it. So whatever was stored in the byte right after the end of the buffer will be replaced by a zero byte.
That was what happened in __gconv_translit_find. They had a buffer, which had been allocated with enough space to append ".so", including the terminating null byte, onto the end of a string. But they copied ".so" in starting from the wrong position. They started the copy operation one byte too far to the "right", so the terminating null byte went past the end of the buffer and overwrote something.
Now, when you call malloc to get back a dynamically allocated buffer, most implementations of malloc actually store some housekeeping data right before the buffer. For example, they might store the size of the buffer. Later, when you pass that buffer to free to release the memory, so it can be reused for something else, it will find that "hidden" data right before the beginning of the buffer, and will know how many bytes of memory you are actually freeing. malloc may also "hide" other housekeeping data in the same location. (In the 2014 article you referred to, the implementation of malloc used also stored some "flag" bits there.)
The attack described in the article passed carefully crafted arguments to a command-line program, designed to trigger the buffer overflow error in __gconv_translit_find, in such a way that the terminating null byte would wipe out the "flag" bits stored by malloc -- not the flag bits for the buffer which overflowed, but those for another buffer which was allocated right after the one which overflowed. (Since malloc stores that extra housekeeping data before the beginning of an allocated buffer, and we are overrunning the previous buffer. You follow?)
The article shows a diagram, where 0x00000201 is stored right after the buffer which overflows. The overflowing null byte wipes out the bottom 1 and changes that into 0x00000200. That might not make sense at first, until you remember that x86 CPUs are little-endian -- if you don't understand what "little-endian" and "big-endian" CPUs are, look it up.
Later, the buffer whose flag bit was wiped out is passed to free. As it turns out, wiping out that one flag bit "confuses" free and makes it, in turn, also overwrite some other memory. (You will have to understand the implementation of malloc and free which are used by GNU libc, in order to understand why this is so.)
By carefully choosing the input arguments to the original program, you can set things up so that the memory overwritten by the "confused" free is that used for something called tls_dtor_list. This is a linked list maintained by GNU libc, which holds pointers to certain functions which it must call when the main program is exiting.
So tls_dtor_list is overwritten. The attacker has set things up just right, so that the function pointers in the overwritten tls_dtor_list will point to some code which they want to run. When the main program is exiting, some code in libc iterates over that list and calls each of the function pointers. Result: the attacker's code is executed!
Now, in this case, the attacker already has access to the target system. If all they can do is run some code with the privilege level of their own account, that doesn't get them anywhere. They want to run code with root (administrator) privileges. How is that possible? It is possible because the buggy program is a setuid program, owned by root. If you don't know what "setuid" programs in Unix are, look it up and make sure you understand it, because that is also a key to the whole exploit.
This is all about the 2014 article -- I didn't look at the one from 1998. Good luck!

Kernel Panic -- Failed copy_from_user, kmalloc?

I am writing a rootkit for my OS class (the teacher is okay with me asking for help here). My rootkit hooks the sys_read system call to hide "magic" ports from the user. When I copy the user buffer *buf (one of the arguments of sys_read) to kernel space (into a buffer called kbuf) I get kernel panic/core dump error. It is possible that this is just because breaking read brings the system to a halt, but I wonder if anyone has any perspective on this.
The code is available online. Look at line 207: https://github.com/joshimhoff/toykit/blob/master/toykit.c
I hooked getdents and used copy_from_user to bring the getdents structs into kernel space, and this worked well! I am not sure what is different about read.
Thanks for the help!

I figured it out. I called the actual sys_read function and didn't check the return value. Sometimes it is negative to indicate an error. Instead of failing early, I asked kmalloc for a negative number of bytes.
Imagine that. Allocating negative memory. That would be a crazy world.

Question about cycle counting accuracy when emulating a CPU

I am planning on creating a Sega Master System emulator over the next few months, as a hobby project in Java (I know it isn't the best language for this but I find it very comfortable to work in, and as a frequent user of both Windows and Linux I thought a cross-platform application would be great). My question regards cycle counting;
I've looked over the source code for another Z80 emulator, and for other emulators as well, and in particular the execute loop intrigues me - when it is called, an int is passed as an argument (let's say 1000 as an example). Now I get that each opcode takes a different number of cycles to execute, and that as these are executed, the number of cycles is decremented from the overall figure. Once the number of cycles remaining is <= 0, the execute loop finishes.
My question is that many of these emulators don't take account of the fact that the last instruction to be executed can push the number of cycles to a negative value - meaning that between execution loops, one may end up with say, 1002 cycles being executed instead of 1000. Is this significant? Some emulators account for this by compensating on the next execute loop and some don't - which approach is best? Allow me to illustrate my question as I'm not particularly good at putting myself across:
public void execute(int numOfCycles)
{ //this is an execution loop method, called with 1000.
while (numOfCycles > 0)
{
instruction = readInstruction();
switch (instruction)
{
case 0x40: dowhatever, then decrement numOfCycles by 5;
break;
//lets say for arguments sake this case is executed when numOfCycles is 3.
}
}
After the end of this particular looping example, numOfCycles would be at -2. This will only ever be a small inaccuracy but does it matter overall in peoples experience? I'd appreciate anyone's insight on this one. I plan to interrupt the CPU after every frame as this seems appropriate, so 1000 cycles is low I know, this is just an example though.
Many thanks,
Phil

most emulators/simulators dealing just with CPU Clock tics
That is fine for games etc ... So you got some timer or what ever and run the simulation of CPU until CPU simulate the duration of the timer. Then it sleeps until next timer interval occurs. This is very easy to simulate. you can decrease the timing error by the approach you are asking about. But as said here for games is this usually unnecessary.
This approach has one significant drawback and that is your code works just a fraction of a real time. If the timer interval (timing granularity) is big enough this can be noticeable even in games. For example you hit a Keyboard Key in time when emulation Sleeps then it is not detected. (keys sometimes dont work). You can remedy this by using smaller timing granularity but that is on some platforms very hard. In that case the timing error can be more "visible" in software generated Sound (at least for those people that can hear it and are not deaf-ish to such things like me).
if you need something more sophisticated
For example if you want to connect real HW to your emulation/simulation then you need to emulate/simulate BUS'es. Also things like floating bus or contention of system is very hard to add to approach #1 (it is doable but with big pain).
If you port the timings and emulation to Machine cycles things got much much easier and suddenly things like contention or HW interrupts, floating BUS'es are solving themselves almost on their own. I ported my ZXSpectrum Z80 emulator to this kind of timing and see the light. Many things get obvious (like errors in Z80 opcode documentation, timings etc). Also the contention got very simple from there (just few lines of code instead of horrible decoding tables almost per instruction type entry). The HW emulation got also pretty easy I added things like FDC controlers AY chips emulations to the Z80 in this way (no hacks it really runs on their original code ... even Floppy formating :)) so no more TAPE Loading hacks and not working for custom loaders like TURBO
To make this work I created my emulation/simulation of Z80 in a way that it uses something like microcode for each instruction. As I very often corrected errors in Z80 instruction set (as there is no single 100% correct doc out there I know of even if some of them claim that they are bug free and complete) I come with a way how to deal with it without painfully reprogramming the emulator.
Each instruction is represented by an entry in a table, with info about timing, operands, functionality... Whole instruction set is a table of all theses entries for all instructions. Then I form a MySQL database for my instruction set. and form similar tables to each instruction set I found. Then painfully compared all of them selecting/repairing what is wrong and what is correct. The result is exported to single text file which is loaded at emulation startup. It sound horrible but In reality it simplifies things a lot even speedup the emulation as the instruction decoding is now just accessing pointers. The instruction set data file example can be found here What's the proper implementation for hardware emulation
Few years back I also published paper on this (sadly institution that holds that conference does not exist anymore so servers are down for good on those old papers luckily I still got a copy) So here image from it that describes the problematics:
a) Full throtlle has no synchronization just raw speed
b) #1 has big gaps causing HW synchronization problems
c) #2 needs to sleep a lot with very small granularity (can be problematic and slow things down) But the instructions are executed very near their real time ...
Red line is the host CPU processing speed (obviously what is above it take a bit more time so it should be cut and inserted before next instruction but it would be hard to draw properly)
Magenta line is the Emulated/Simulated CPU processing speed
alternating green/blue colors represent next instruction
both axises are time
[edit1] more precise image
The one above was hand painted... This one is generated by VCL/C++ program:
generated by these parameters:
const int iset[]={4,6,7,8,10,15,21,23}; // possible timings [T]
const int n=128,m=sizeof(iset)/sizeof(iset[0]); // number of instructions to emulate, size of iset[]
const int Tps_host=25; // max possible simulation speed [T/s]
const int Tps_want=10; // wanted simulation speed [T/s]
const int T_timer=500; // simulation timer period [T]
so host can simulate at 250% of wanted speed and simulation granularity is 500T. Instructions where generated pseudo-randomly...

Was a quite interesting article on Arstechnica talking about console simulation recently, also links to quite a few simulators that might make for quite good research:
Accuracy takes power: one man's 3GHz quest to build a perfect SNES emulator
The relevant bit is that the author mentions, and I am inclined to agree, that most games will appear to function pretty correctly even with timing deviations of +/-20%. The issue you mention looks likely to never really introduce more than a fraction of a percent timing error, which is probably imperceptible whilst playing the final game. The authors probably didn't consider it worth dealing with.

I guess that depends on how accurate you want your emulator to be. I do not think that it has to be that accurate. Think emulation of x86 platform, there are so many variants of processors and each has different execution latencies and issue rates.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string