8086 segment selector - protection

There's some "supervisor" bit to not let the "user space" do something like:
mov CS, 200h
?
What kind of protection has?
Thanks

On the actual 8086 CPU? I don't think so. The advanced protection features only really started appearing with the 80286. There were no restrictions on what programs could set the code segment to on the 8086.
On the 80386 in protected mode (I think that was the first one to provide this but it may have been the 80286), the values in CS (and DS, ES, and so on) changed from segment registers to selectors and they had to have entries in a descriptor table (eg: GDT, LDT).
At that point, protection became possible but I don't think it was the loading into a selector register that caused the violations. Rather it was the use of a selector above your privilege level.
Although, for CS, that would happen pretty quickly after you changed it (as you tried to execute the next instruction).
See here for more information.

Related

Changing kernel page permission for allowing user access

In x86 or x64 Linux, I am trying to make a kernel module that changes specific kernel page permission to allow user application accessing that memory. For example, if there is a readable kernel page at 0xC0001000(say it's 3:1 split), I want to change user/supervisor bit of this page and allow user applications to do something like this.
int* m = 0xC0001000;
printf("reading kernel memory from user : %08x\n", *m);
In my kernel module, I changed the access bit of corresponding kernel memory page from 0x67 to 0x63 (lower bits 111 -> 011) clearing the supervisor bit.
After that, I flushed the TLB of virtual address 0xc0001000 using invdpg instruction.
I have confirmed that the page entry I manipulated was indeed the corresponding one.
However, accessing 0xC0001000 from user application still causes me segmentation fault.
Am I missing something important here? perhaps cs segment and GDT? or is that irrelevant?
Some advice would be nice, thank you in advance :)
From your kernel module you can just change the effective user id to 0 to let it read /dev/kmem,

MOVDQU instruction + page boundary

I have a simple test program that loads an xmm register with the
movdqu instruction accessing data across a page boundary (OS = Linux).
If the following page is mapped, this works just fine. If it's not
mapped then I get a SIGSEGV, which is probably expected.
However this diminishes the usefulness of the unaligned loads quite
a bit. Additionally SSE4.2 instructions (like pcmpistri) which
allow for unaligned memory references appear to exhibit this behavior
as well.
That's all fine -- except there's many an implementation of strcmp
using pcmpistri that I've found that don't seem to address this issue
at all -- and I've been able to contrive trivial testcases that will
cause these implementations to fail, while the byte-at-a-time trivial
strcmp implementation will work just fine with the same data layout.
One more note -- it appears the the GNU C library implementation for
64-bit Linux has a __strcmp_sse42 variant that appears to use the
pcmpistri instruction in a more safe manner. The implementation of
this strcmp is fairly complex, but it appears to be carefully trying
to avoid the page boundary issue. I'm not sure if that's due to the
issue I describe above, or whether it's just a side-effect of trying to
get better performance by aligning the data.
Anyway the question I have is primarily -- where can I find out more
about this issue? I've typed in "movdqu crossing page boundary" and
every variant of that I can think of to Google, but haven't come across
anything particularly useful. If anyone can point me to further info
on this it would be greatly appreciated.
First, any algorithm which tries to access an unmapped address will cause a SegFault. If a non-AVX code flow used a 4 byte load to access the last byte of a page and the first 3 bytes of "the next page" which happened to not be mapped then it would also cause a SegFault. No? I believe that the "issue" is that the AVX(1/2/3) registers are so much bigger than "typical" that algorithms which were unsafe (but got away with it) get caught if they are trivially extended to the larger registers.
Aligned loads (MOVDQA) can never have this problem since they don't cross any boundaries of their own size or greater. Unaligned loads CAN have this problem (as you've noted) and "often" do. The reason for this is that the instruction is defined to load the full size of the target register. You need to look at the operand types in the instruction definitions quite carefully. It doesn't matter how much of the data you are interested in. It matters what the instruction is defined to do.
However...
AVX1 (Sandybridge) added a "masked move" capability which is slower than a movdqa or movdqu but will not (architecturally) access the unmapped page so long as the mask is not enabled for the portion of the access which would have fallen in that page. This is meant to address the issue. In general, moving forward, it appears that masked portions (See AVX512) of loads/stores will not cause access violations on IA either.
(It is a bummer about PCMPxSTRx behavior. Perhaps you could add 15 bytes of padding to your "string" objects?)
Facing a similar problem with a library I was writing, I got some information from a very helpful contributor.
The core of the idea is to align the 16-byte reads to the end of the string, then handle the leftover bytes at the beginning. This works because the end of the string must live in an accessible page, and you are guaranteed that the 16-byte truncated starting address must also live in an accessible page.
Since we never read past the string we cannot potentially stray into a protected page.
To handle the initial set of bytes, I chose to use the PCMPxSTRM functions, which return the bitmask of matching bytes. Then it's simply a matter of shifting the result to ignore any mask bits that occur before the true beginning of the string.

How are __addgs* used, and what is GS?

On Microsoft's site can be found some details of the
__addgsbyte ( offset, data )
__addgsword ( offset, data )
__addgsdword ( offset, data )
__addgsqword ( offset, data )
intrinsic functions. It is stated that offset is
the offset from the beginning of GS. I presume that GS refers to the processor register.
How does GS relate to the stack, if at all? Alternatively, how can I calculate an offset with respect to GS?
(And, are there any 'gotchas' relating to this and particular calling conventions, such as __fastcall?)
The GS register does not relate to the stack at all and therefore no relation to callign convensions. On x64 versions of Windows it is used to point to operating system data:
From wikipedia:
Instead of FS segment descriptor on x86 versions of the Windows NT
family, GS segment descriptor is used to point to two operating system
defined structures: Thread Information Block (NT_TIB) in user mode and
Processor Control Region (KPCR) in kernel mode. Thus, for example, in
user mode GS:0 is the address of the first member of the Thread
Information Block. Maintaining this convention made the x86-64 port
easier, but required AMD to retain the function of the FS and GS
segments in long mode — even though segmented addressing per se is not
really used by any modern operating system.
Note that those intrinsics are only available in kernel mode (e.g. device drivers). To calculate an offset, you would need to know what segment of memory GS is pointing to. So in kernel mode you would need to know the layout of the Processor Control Region.
Personally I don't know what the use of these would be.
these intrinsics, along with there fs counterparts have no real use except for accessing OS specific data, as such its most likely that these where added purely to make the windows developers lives easier (I've personally used this for inline TLS access)
The doc you linked says:
Add a value to a memory location specified by an offset relative to the beginning of the GS segment.
That means it's an intrinsic for this asm instruction: add gs:[offset], data (a normal memory-destination add with a GS segment override) with your choice of operand-size.
The compiler can presumably pick any addressing mode for the offset part, and data can be a register or immediate, thus either or both of offset and data can be runtime variables or constants.
The actual linear (virtual) address accessed will be gs_base + offset, where gs_base is set via an MSR to any address (by the OS on its own, or by you making a system call).
In user-space at least, Windows normally uses GS for TLS (thread local storage). The answer claiming this intrinsic only works in kernel code is mistaken. It doesn't add to the GS base, it adds to memory at an address relative to the existing GS base.
MS only seems to document this intrinsic for x64, but it's a valid instruction in 32-bit mode as well. IDK why they'd bother to restrict it. (Except of course the qword form: 64-bit operand size isn't available in 32-bit mode.)
Perhaps the compiler doesn't know how to generally optimize __readgsdword / (operation on that data) / __writegsdword into a memory-destination instruction with the same gs:offset address. If so, this intrinsic would just be a code-size optimization.
But perhaps it's sometimes relevant to force the compiler to do it as a single instruction to make it atomic wrt. interrupts on this core (but not wrt. accesses by other CPU cores). IDK if that's an intended use case is; this answer is just explaining what that one line of documentation means in terms of x86 asm and memory-segmentation.

How to interpret segment register accesses on x86-64?

With this function:
mov 1069833(%rip),%rax # 0x2b5c1bf9ef90 <_fini+3250648>
add %fs:0x0,%rax
retq
How do I interpret the second instruction and find out what was added to RAX?
This code:
mov 1069833(%rip),%rax # 0x2b5c1bf9ef90 <_fini+3250648>
add %fs:0x0,%rax
retq
is returning the address of a thread-local variable. %fs:0x0 is the address of the TCB (Thread Control Block), and 1069833(%rip) is the offset from there to the variable, which is known since the variable resides either in the program or on some dynamic library loaded at program's load time (libraries loaded at runtime via dlopen() need some different code).
This is explained in great detail in Ulrich Drepper's TLS document, specially §4.3 and §4.3.6.
I'm not sure they've been called segment register since the bad old days of segmented architecture. I believe the proper term is a selector (but I could be wrong).
However, I think you just need at the first quadword (64 bits) in the fs area.
The %fs:0x0 bit means the contents of the memory at fs:0. Since you've used the generic add (rather than addl for example), I think it will take the data width from the target %rax.
In terms of getting the actual value, it depends on whether you're in legacy or long mode.
In legacy mode, you'll have to get the fs value and look it up in the GDT (or possibly LDT) in order to get the base address.
In long mode, you'll need to look at the relevant model specific registers. If you're at this point, you've moved beyond my level of expertise unfortunately.

6502 CPU Emulation

It's the weekend, so I relax from spending all week programming by writing a hobby project.
I wrote the framework of a MOS 6502 CPU emulator yesterday, the registers, stack, memory and all the opcodes are implemented. (Link to source below)
I can manually run a series of operations in the debugger I wrote, but I'd like to load a NES rom and just point the program counter at its instructions, I figured that this would be the fastest way to find flawed opcodes.
I wrote a quick NES rom loader and loaded the ROM banks into the CPU memory.
The problem is that I don't know how the opcodes are encoded. I know that the opcodes themselves follow a pattern of one byte per opcode that uniquely identifies the opcode,
0 - BRK
1 - ORA (D,X)
2 - COP b
etc
However I'm not sure where I'm supposed to find the opcode argument. Is it the the byte directly following? In absolute memory, I suppose it might not be a byte but a short.
Is anyone familiar with this CPU's memory model?
EDIT: I realize that this is probably shot in the dark, but I was hoping there were some oldschool Apple and Commodore hackers lurking here.
EDIT: Thanks for your help everyone. After I implemented the proper changes to align each operation the CPU can load and run Mario Brothers. It doesn't do anything but loop waiting for Start, but its a good sign :)
I uploaded the source:
https://archive.codeplex.com/?p=cpu6502
If anyone has ever wondered how an emulator works, its pretty easy to follow. Not optimized in the least, but then again, I'm emulating a CPU that runs at 2mhz on a 2.4ghz machine :)
The opcode takes one byte, and the operands are in the following bytes. Check out the byte size column here, for instance.
If you look into references like http://www.atarimax.com/jindroush.atari.org/aopc.html, you will see that each opcode has an encoding specified as:
HEX LEN TIM
The HEX is your 1-byte opcode. Immediately following it is LEN bytes of its argument. Consult the reference to see what those arguments are. The TIM data is important for emulators - it is the number of clock cycles this instruction takes to execute. You will need this to get your timing correct.
These values (LEN, TIM) are not encoded in the opcode itself. You need to store this data in your program loader/executer. It's just a big lookup table. Or you can define a mini-language to encode the data and reader.
This book might help: http://www.atariarchives.org/mlb/
Also, try examing any other 6502 aseembler/simulator/debugger out there to see how Assembly gets coded as Machine Language.
The 6502 manuals are on the Web, at various history sites. The KIM-1 shipped with them. Maybe more in them than you need to know.
This is better - 6502 Instruction Set matrix:
https://www.masswerk.at/6502/6502_instruction_set.html
The apple II roms included a dissassembler, I think that's what it was called, and it would show you in a nice format the hex opcodes and the 3 character opcode and the operands.
So given how little memory was available, they managed to shove in the operand byte count (always 0, 1 or 2) the 3 character opcode for the entire 6502 instruction set into a really small space, because there's really not that much of it.
If you can dig up an apple II rom, you can just cut and paste from there...
The 6502 has different addressing modes, the same instruction has several different opcodes depending on it's addressing mode. Take a look at the following links which describes the different ways a 6502 can retrieve data from memory, or directly out of ROM.
http://obelisk.me.uk/6502/addressing.html#IMM

Resources