I'm writing a simple GB emulator (wow, now that's something new, isn't it), since I'm really taking my first steps in emu.
What i don't seem to understand is how to correctly implement the CPU cycle and unconditional jumps.
Consider the command JP nn (unconditional jump to memory adress pointed out), like JP 1000h, if I have a basic loop of:
increment PC
read opcode
execute command
Then after the JP opcode has been read and the command executed (read 1000h from memory and set PC = 1000h), the PC gets incremented and becomes 1001h, thus resulting in bad emulation.
tl;dr How do you emulate jumps in emulators, so that PC value stays correct, when having cpu loops that increment PC?
The PC should be incremented as an 'atomic' operation every time it is used to return a byte. This means immediate operands as well as opcodes.
In your example, the PC would be used three times, once for the opcode and twice for the two operand bytes. By the time the CPU has fetched the three bytes and is in a position to load the PC, the PC is already pointing to the next instruction opcode after the second operand but, since actually implementing the instruction reloads the PC, it doesn't matter.
Move increment PC to the end of the loop, and have it performed conditionally depending on the opcode?
I know next to nothing about emulation, but two obvious approaches spring to mind.
Instead of hardcoding PC += 1 into the main loop, let the evaluation if each opcode return the next PC value (or the offset, or a flag saying whether to increment it, or etc). Then the difference between jumps and other opcodes (their effect on the program counter) is definable along with everything else about them.
Knowing that the main loop will always increment the PC by 1, just have the implementation of jumps set the PC to target - 1 rather than target.
Related
Summary: What is the definitive reference or reference implementation for the RISC-V user-level ISA?
Context: The RISC-V website has "The RISC-V Instruction Set Manual" which explains the user-level instructions very well, but does not give an exact specification for them. I am trying to build a user-level ISA simulator now and intend to write an FPGA implementation later, so the exact behavior is important to me.
A reference implementation would be sufficient, but should preferably be as simple as possible -- i.e. I would try to understand a pipelined implementation only as a last resort. What is important is to have an understanding of the specified ISA and not of a single CPU implementation or compiler implementation.
One example to show my problem is the AUIPC instruction: The prose explanation says that "AUIPC forms a 32-bit offset from the 20-bit U-immediate, filling in the lowest 12 bits with zeros, adds this offset to the pc, then places the result in register rd." I wanted to know whether this refers to the old or new PC, i.e. the position of the AUIPC instruction or the next instruction. I looked at the "RISCV Angel" implementation, but that seems to mask out the lower bits of the (old) PC -- not just of the immediate -- which I could not find any reason for in the spec, not even in the change history of the spec (since Angel is a bit older). Instead of an answer, I now have two questions about AUIPC. Many other instructions pose similar problems to me.
AFAICT the RISC-V Instruction Set Manual you cite is the closest thing there is to a definitive reference. If there are things that are unclear or incorrect in there then you could open issues at the Github site where that document is maintained: https://github.com/riscv/riscv-isa-manual
As far as AIUPC is concerned, the answer is implied, but not stated explicitly, by this sentence at the bottom of page 9 in the current manual:
There is one additional user-visible register: the program counter pc holds the address of the current instruction.
Based on that statement I would expect that the pc value that is seen and manipulated by the AIUPC instruction is the address of the AIUPC instruction itself.
This interpretation is supported by the discussion of the JALR instruction:
The indirect jump instruction JALR (jump and link register) uses the I-type encoding. The target address is obtained by adding the 12-bit signed I-immediate to the register rs1, then setting the least-signicant bit of the result to zero. The address of the instruction following the jump (pc+4) is written to register rd.
Given that the address of the following instruction is expressed as pc+4, it seems clear that the pc value visible during the execution of JALR is the address of the JALR instruction itself.
The latest draft of the manual (at https://github.com/riscv/riscv-isa-manual/releases/download/draft-20190321-ba17106/riscv-spec.pdf) makes the situation slightly clearer. In place of this in the current manual:
AUIPC appends 12 low-order zero bits to the 20-bit U-immediate, sign-extends the result to 64 bits, then adds it to the pc and places the result in register rd.
the latest draft says:
AUIPC forms a 32-bit offset from the 20-bit U-immediate, filling in the lowest 12 bits with zeros, adds this offset to the pc of the AUIPC instruction, then places the result in register rd.
I was going through this link delay in assembly to add delay in assembly. I want to perform some experiment by adding different delay value.
The useful code to generate delay
; start delay
mov bp, 43690
mov si, 43690
delay2:
dec bp
nop
jnz delay2
dec si
cmp si,0
jnz delay2
; end delay
What I understood from the code, the delay is proportion to the time it spends to execute nop instructions (43690x43690 ). So in different system and different version of OS, delay will be different. Am I right?
Can anyone explain to me how to calculate the amount of delay in nsec, the following assembly code is generating so that I can conclude my experiment with respect to delay I added in my experimental setup?
This is the code I am using to generate delay without understanding the logic behind use of 43690 value ( I used only one loop against two loops in original source code). To generate different delay (without knowing its value), I just varied number 43690 to 403690 or other value.
Code in 32bit OS
movl $43690, %esi ; ---> if I vary this 4003690 then delay value ??
.delay2:
dec %esi
nop
jnz .delay2
How much delay is generated by this assembly code ?
If I want to generate 100nsec or 1000nsec or any other delay in microsec, what will be initial value I need to load in register?
I am using ubuntu 16.04 (both 32bit as well as 64bit ), in Intel(R) Core(TM) i5-7200U CPU # 2.50GHz and Core-i3 CPU 3470 # 3.20GHz processor.
Thank you in advance.
There is no very good way to get accurate and predictable timing from fixed counts for delay loops on a modern x86 PC, especially in user-space under a non-realtime OS like Linux. (But you could spin on rdtsc for very short delays; see below). You can use a simple delay-loop if you need to sleep at least long enough and it's ok to sleep longer when things go wrong.
Normally you want to sleep and let the OS wake your process, but this doesn't work for delays of only a couple microseconds on Linux. nanosleep can express it, but the kernel doesn't schedule with such precise timing. See How to make a thread sleep/block for nanoseconds (or at least milliseconds)?. On a kernel with Meltdown + Spectre mitigation enabled, a round-trip to the kernel takes longer than a microsecond anyway.
(Or are you doing this inside the kernel? I think Linux already has a calibrated delay loop. In any case, it has a standard API for delays: https://www.kernel.org/doc/Documentation/timers/timers-howto.txt, including ndelay(unsigned long nsecs) which uses the "jiffies" clock-speed estimate to sleep for at least long enough. IDK how accurate that is, or if it sometimes sleeps much longer than needed when clock speed is low, or if it updates the calibration as the CPU freq changes.)
Your (inner) loop is totally predictable at 1 iteration per core clock cycle on recent Intel/AMD CPUs, whether or not there's a nop in it. It's under 4 fused-domain uops, so you bottleneck on the 1-per-clock loop throughput of your CPUs. (See Agner Fog's x86 microarch guide, or time it yourself for large iteration counts with perf stat ./a.out.) Unless there's competition from another hyperthread on the same physical core...
Or unless the inner loop spans a 32-byte boundary, on Skylake or Kaby Lake (loop buffer disabled by microcode updates to work around a design bug). Then your dec / jnz loop could run at 1 per 2 cycles because it would require fetching from 2 different uop-cache lines.
I'd recommend leaving out the nop to have a better chance of it being 1 per clock on more CPUs, too. You need to calibrate it anyway, so a larger code footprint isn't helpful (so leave out extra alignment, too). (Make sure calibration happens while CPU is at max turbo, if you need to ensure a minimum delay time.)
If your inner loop wasn't quite so small (e.g. more nops), see Is performance reduced when executing loops whose uop count is not a multiple of processor width? for details on front-end throughput when the uop count isn't a multiple of 8. SKL / KBL with disabled loop buffers run from the uop cache even for tiny loops.
But x86 doesn't have a fixed clock frequency (and transitions between frequency states stop the clock for ~20k clock cycles (8.5us), on a Skylake CPU).
If running this with interrupts enabled, then interrupts are another unpredictable source of delays. (Even in kernel mode, Linux usually has interrupts enabled. An interrupts-disabled delay loop for tens of thousands of clock cycles seems like a bad idea.)
If running in user-space, then I hope you're using a kernel compiled with realtime support. But even then, Linux isn't fully designed for hard-realtime operation, so I'm not sure how good you can get.
System management mode interrupts are another source of delay that even the kernel doesn't know about. PERFORMANCE IMPLICATIONS OF
SYSTEM MANAGEMENT MODE from 2013 says that 150 microseconds is considered an "acceptable" latency for an SMI, according to Intel's test suite for PC BIOSes. Modern PCs are full of voodoo. I think/hope that the firmware on most motherboards doesn't have much SMM overhead, and that SMIs are very rare in normal operation, but I'm not sure. See also Evaluating SMI (System Management Interrupt) latency on Linux-CentOS/Intel machine
Extremely low-power Skylake CPUs stop their clock with some duty-cycle, instead of clocking lower and running continuously. See this, and also Intel's IDF2015 presentation about Skylake power management.
Spin on RDTSC until the right wall-clock time
If you really need to busy-wait, spin on rdtsc waiting for the current time to reach a deadline. You need to know the reference frequency, which is not tied to the core clock, so it's fixed and nonstop (on modern CPUs; there are CPUID feature bits for invariant and nonstop TSC. Linux checks this, so you could look in /proc/cpuinfo for constant_tsc and nonstop_tsc, but really you should just check CPUID yourself on program startup and work out the RDTSC frequency (somehow...)).
I wrote such a loop as part of a silly-computer-tricks exercise: a stopwatch in the fewest bytes of x86 machine code. Most of the code size is for the string manipulation to increment a 00:00:00 display and print it. I hard-coded the 4GHz RDTSC frequency for my CPU.
For sleeps of less than 2^32 reference clocks, you only need to look at the low 32 bits of the counter. If you do your compare correctly, wrap-around takes care of itself. For the 1-second stopwatch, a 4.3GHz CPU would have a problem, but for nsec / usec sleeps there's no issue.
;;; Untested, NASM syntax
default rel
section .data
; RDTSC frequency in counts per 2^16 nanoseconds
; 3200000000 would be for a 3.2GHz CPU like your i3-3470
ref_freq_fixedpoint: dd 3200000000 * (1<<16) / 1000000000
; The actual integer value is 0x033333
; which represents a fixed-point value of 3.1999969482421875 GHz
; use a different shift count if you like to get more fractional bits.
; I don't think you need 64-bit operand-size
; nanodelay(unsigned nanos /*edi*/)
; x86-64 System-V calling convention
; clobbers EAX, ECX, EDX, and EDI
global nanodelay
nanodelay:
; take the initial clock sample as early as possible.
; ideally even inline rdtsc into the caller so we don't wait for I$ miss.
rdtsc ; edx:eax = current timestamp
mov ecx, eax ; ecx = start
; lea ecx, [rax-30] ; optionally bias the start time to account for overhead. Maybe make this a variable stored with the frequency.
; then calculate edi = ref counts = nsec * ref_freq
imul edi, [ref_freq_fixedpoint] ; counts * 2^16
shr edi, 16 ; actual counts, rounding down
.spinwait: ; do{
pause ; optional but recommended.
rdtsc ; edx:eax = reference cycles since boot
sub eax, ecx ; delta = now - start. This may wrap, but the result is always a correct unsigned 0..n
cmp eax, edi ; } while(delta < sleep_counts)
jb .spinwait
ret
To avoid floating-point for the frequency calculation, I used fixed-point like uint32_t ref_freq_fixedpoint = 3.2 * (1<<16);. This means we just use an integer multiply and shift inside the delay loop. Use C code to set ref_freq_fixedpoint during startup with the right value for the CPU.
If you recompile this for each target CPU, the multiply constant can be an immediate operand for imul instead of loading from memory.
pause sleeps for ~100 clock on Skylake, but only for ~5 clocks on previous Intel uarches. So it hurts timing precision a bit, maybe sleeping up to 100 ns past a deadline when the CPU frequency is clocked down to ~1GHz. Or at a normal ~3GHz speed, more like up to +33ns.
Running continously, this loop heated up one core of my Skylake i7-6700k at ~3.9GHz by ~15 degrees C without pause, but only by ~9 C with pause. (From a baseline of ~30C with a big CoolerMaster Gemini II heatpipe cooler, but low airflow in the case to keep fan noise low.)
Adjusting the start-time measurement to be earlier than it really is will let you compensate for some of the extra overhead, like branch-misprediction when leaving the loop, as well as the fact that the first rdtsc doesn't sample the clock until probably near the end of its execution. Out-of-order execution can let rdtsc run early; you might use lfence, or consider rdtscp, to stop the first clock sample from happening out-of-order ahead of instructions before the delay function is called.
Keeping the offset in a variable will let you calibrate the constant offset, too. If you can do this automatically at startup, that could be good to handle variations between CPUs. But you need some high-accuracy timer for that to work, and this is already based on rdtsc.
Inlining the first RDTSC into the caller and passing the low 32 bits as another function arg would make sure the "timer" starts right away even if there's an instruction-cache miss or other pipeline stall when calling the delay function. So the I$ miss time would be part of the delay interval, not extra overhead.
The advantage of spinning on rdtsc:
If anything happens that delays execution, the loop still exits at the deadline, unless execution is currently blocked when the deadline passes (in which case you're screwed with any method).
So instead of using exactly n cycles of CPU time, you use CPU time until the current time is n * freq nanoseconds later than when you first checked.
With a simple counter delay loop, a delay that's long enough at 4GHz would make you sleep more than 4x too long at 0.8GHz (typical minimum frequency on recent Intel CPUs).
This does run rdtsc twice, so it's not appropriate for delays of only a couple nanoseconds. (rdtsc itself is ~20 uops, and has a throughput of one per 25 clocks on Skylake/Kaby Lake.) I think this is probably the least bad solution for a busy-wait of hundreds or thousands of nanoseconds, though.
Downside: a migration to another core with unsynced TSC could result in sleeping for the wrong time. But unless your delays are very long, the migration time will be longer than the intended delay. The worst case is sleeping for the delay-time again after the migration. The way I do the compare: (now - start) < count, instead of looking for a certain target target count, means that unsigned wraparound will make the compare true when now-start is a large number. You can't get stuck sleeping for nearly a whole second while the counter wraps around.
Downside: maybe you want to sleep for a certain number of core cycles, or to pause the count when the CPU is asleep.
Downside: old CPUs may not have a non-stop / invariant TSC. Check these CPUID feature bits at startup, and maybe use an alternate delay loop, or at least take it into account when calibrating. See also Get CPU cycle count? for my attempt at a canonical answer about RDTSC behaviour.
Future CPUs: use tpause on CPUs with the WAITPKG CPUID feature.
(I don't know which future CPUs are expected to have this.)
It's like pause, but puts the logical core to sleep until the TSC = the value you supply in EDX:EAX. So you could rdtsc to find out the current time, add / adc the sleep time scaled to TSC ticks to EDX:EAX, then run tpause.
Interestingly, it takes another input register where you can put a 0 for a deeper sleep (more friendly to the other hyperthread, probably drops back to single-thread mode), or 1 for faster wakeup and less power-saving.
You wouldn't want to use this to sleep for seconds; you'd want to hand control back to the OS. But you could do an OS sleep to get close to your target wakeup if it's far away, then mov ecx,1 or xor ecx,ecx / tpause ecx for whatever time is left.
Semi-related (also part of the WAITPKG extension) are the even more fun umonitor / umwait, which (like privileged monitor/mwait) can have a core wake up when it sees a change to memory in an address range. For a timeout, it has the same wakeup on TSC = EDX:EAX as tpause.
I've been trying to read the implementation of a kernel module, and I'm stumbling on this piece of code.
unsigned long addr = (unsigned long) buf;
if (!IS_ALIGNED(addr, 1 << 9)) {
DMCRIT("#%s in %s is not sector-aligned. I/O buffer must be sector-aligned.", name, caller);
BUG();
}
The IS_ALIGNED macro is defined in the kernel source as follows:
#define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0)
I understand that data has to be aligned along the size of a datatype to work, but I still don't understand what the code does.
It left-shifts 1 by 9, then subtracts by 1, which gives 111111111. Then 111111111 does bitwise-and with x.
Why does this code work? How is this checking for byte alignment?
In systems programming it is common to need a memory address to be aligned to a certain number of bytes -- that is, several lowest-order bits are zero.
Basically, !IS_ALIGNED(addr, 1 << 9) checks whether addr is on a 512-byte (2^9) boundary (the last 9 bits are zero). This is a common requirement when erasing flash locations because flash memory is split into large blocks which must be erased or written as a single unit.
Another application of this I ran into. I was working with a certain DMA controller which has a modulo feature. Basically, that means you can allow it to change only the last several bits of an address (destination address in this case). This is useful for protecting memory from mistakes in the way you use a DMA controller. Problem it, I initially forgot to tell the compiler to align the DMA destination buffer to the modulo value. This caused some incredibly interesting bugs (random variables that have nothing to do with the thing using the DMA controller being overwritten... sometimes).
As far as "how does the macro code work?", if you subtract 1 from a number that ends with all zeroes, you will get a number that ends with all ones. For example, 0b00010000 - 0b1 = 0b00001111. This is a way of creating a binary mask from the integer number of required-alignment bytes. This mask has ones only in the bits we are interested in checking for zero-value. After we AND the address with the mask containing ones in the lowest-order bits we get a 0 if any only if the lowest 9 (in this case) bits are zero.
"Why does it need to be aligned?": This comes down to the internal makeup of flash memory. Erasing and writing flash is a much less straightforward process then reading it, and typically it requires higher-than-logic-level voltages to be supplied to the memory cells. The circuitry required to make write and erase operations possible with a one-byte granularity would waste a great deal of silicon real estate only to be used rarely. Basically, designing a flash chip is a statistics and tradeoff game (like anything else in engineering) and the statistics work out such that writing and erasing in groups gives the best bang for the buck.
At no extra charge, I will tell you that you will be seeing a lot of this type of this type of thing if you are reading driver and kernel code. It may be helpful to familiarize yourself with the contents of this article (or at least keep it around as a reference): https://graphics.stanford.edu/~seander/bithacks.html
I am building a Commodore PET on an FPGA. I've implemented my own 6502 core in Kansas Lava (code is available at https://github.com/gergoerdi/mos6502-kansas-lava), and by putting enough IO around it (https://github.com/gergoerdi/eightbit-kansas-lava) I was able to boot the original Commodore PET ROM on it, get a blinking cursor and start typing.
However, after typing in the classic BASIC program
10 PRINT "HELLO WORLD"
20 GOTO 10
it crashes after a while (after several seconds) with
?ILLEGAL QUANTITY ERROR IN 10
Because my code has fairly reasonable per-opcode test coverage, and it passes AllSuiteA, I thought I would look into tests for more complicated behaviour, which is how I arrived at Klaus Dormann's interrupt testsuite. Running it in the Kansas Lava simulator has pointed out a ton of bugs in my original interrupt implementation:
The I flag was not set when entering the interrupt handler
The B flag was all over the place
IRQ interrupts were completely ignored unless I was unset when they arrived (the correct behaviour seems to be to queue interrupts when I is set and when it gets unset, they should still be handled)
After fixing these, I can now successfully run the Klaus Dormann test, so I was hoping by loading my machine back onto the real FPGA, with some luck the BASIC crash could be going away.
However, the new version, with all these interrupt bugs fixed, and passing the interrupt test in a simulator, now fails to respond to keyboard input or even just blink the cursor on the real FPGA. Note that both keyboard input and cursor blinking is done in response to an external IRQ (connected from the screen VBlank signal), so this means the fixed version somehow broke all interrupt handling...
I am looking for any kind of vague suggestions what could be going wrong or how I could begin to debug this.
The full code is available at https://github.com/gergoerdi/mos6502-kansas-lava/tree/interrupt-rewrite, the offending commit (the one that fixes the test and breaks the PET) is 7a09b794af. I realize this is the exact opposite of a minimal viable reproduction, but the change itself is tiny and because I have no idea where it goes wrong, and because reproducing the problem requires a machine featureful enough to boot the stock Commodore PET ROM, I don't know how I could shrink it...
Added:
I managed to reproduce the same issue on the same hardware with a very simple (dare I say minimal) ROM instead of the stock PET ROM:
.org $C000
reset:
;; Initialize PIA
LDY #$07
STY $E813
LDA #30
STA $42
STA $8000
CLI
JMP *
irq:
CMP $E812 ; ACK irq
DEC $42
BNE endirq
LDX $8000
INX
STX $8000
LDA #30
STA $42
endirq: RTI
.res $FFFC-*
.org $FFFC
resetv: .addr reset
irqv: .addr irq
Interrupts aren't queued; the interrupt line is sampled on the penultimate cycle of each instruction and if it is active then, and I unset, then a jump to interrupt occurs next instead of a fetch/decode. Could the confusion be that IRQ is level triggered, not edge triggered, and is usually held high for a period, not a single cycle? So clearing I will cause an interrupt to occur immediately if it was already ongoing. It looks like on the PET interrupt is held active until the CPU acknowledges it?
Also notice the semantics: SEI and CLI adjust the flag in the final cycle. A decision on whether to jump to interrupt was made the cycle before. So a SEI as the final thing when an interrupt comes in, you'll enter the interrupt routine with I set. If an interrupt is active when you hit a CLI then the processor will perform the operation after the CLI before branching.
I'm on a phone so it's difficult to assess more thoroughly than to offer those platitudes; I'll try to review properly later. Is any of that helpful?
I am interested in writing emulators like for gameboy and other handheld consoles, but I read the first step is to emulate the instruction set. I found a link here that said for beginners to emulate the Commodore 64 8-bit microprocessor, the thing is I don't know a thing about emulating instruction sets. I know mips instruction set, so I think I can manage understanding other instruction sets, but the problem is what is it meant by emulating them?
NOTE: If someone can provide me with a step-by-step guide to instruction set emulation for beginners, I would really appreciate it.
NOTE #2: I am planning to write in C.
NOTE #3: This is my first attempt at learning the whole emulation thing.
Thanks
EDIT: I found this site that is a detailed step-by-step guide to writing an emulator which seems promising. I'll start reading it, and hope it helps other people who are looking into writing emulators too.
Emulator 101
An instruction set emulator is a software program that reads binary data from a software device and carries out the instructions that data contains as if it were a physical microprocessor accessing physical data.
The Commodore 64 used a 6502 Microprocessor. I wrote an emulator for this processor once. The first thing you need to do is read the datasheets on the processor and learn about its behavior. What sort of opcodes does it have, what about memory addressing, method of IO. What are its registers? How does it start executing? These are all questions you need to be able to answer before you can write an emulator.
Here is a general overview of how it would look like in C (Not 100% accurate):
uint8_t RAM[65536]; //Declare a memory buffer for emulated RAM (64k)
uint16_t A; //Declare Accumulator
uint16_t X; //Declare X register
uint16_t Y; //Declare Y register
uint16_t PC = 0; //Declare Program counter, start executing at address 0
uint16_t FLAGS = 0 //Start with all flags cleared;
//Return 1 if the carry flag is set 0 otherwise, in this example, the 3rd bit is
//the carry flag (not true for actual 6502)
#define CARRY_FLAG(flags) ((0x4 & flags) >> 2)
#define ADC 0x69
#define LDA 0xA9
while (executing) {
switch(RAM[PC]) { //Grab the opcode at the program counter
case ADC: //Add with carry
A = X + RAM[PC+1] + CARRY_FLAG(FLAGS);
UpdateFlags(A);
PC += ADC_SIZE;
break;
case LDA: //Load accumulator
A = RAM[PC+1];
UpdateFlags(X);
PC += MOV_SIZE;
break;
default:
//Invalid opcode!
}
}
According to this reference ADC actually has 8 opcodes in the 6502 processor, which means you will have 8 different ADC in your switch statement, each one for different opcodes and memory addressing schemes. You will have to deal with endianess and byte order, and of course pointers. I would get a solid understanding of pointer and type casting in C if you dont already have one. To manipulate the flags register you have to have a solid understanding of bitwise operations in C. If you are clever you can make use of C macros and even function pointers to save yourself some work, as the CARRY_FLAG example above.
Every time you execute an instruction, you must advance the program counter by the size of that instruction, which is different for each opcode. Some opcodes dont take any arguments and so their size is just 1 byte, while others take 16-bit integers as in my MOV example above. All this should be pretty well documented.
Branch instructions (JMP, JE, JNE etc) are simple: If some flag is set in the flags register then load the PC to the address specified. This is how "decisions" are made in a microprocessor and emulating them is simply a matter of changing the PC, just as the real microprocessor would do.
The hardest part about writing an instruction set emulator is debugging. How do you know if everything is working like it should? There are plenty of resources for helping you. People have written test codes that will help you debug every instruction. You can execute them one instruction at a time and compare the reference output. If something is different, you know you have a bug somewhere and can fix it.
This should be enough to get you started. The important thing is that you have A) A good solid understanding of the instruction set you want to emulate and B) a solid understanding of low level data manipulation in C, including type casting, pointers, bitwise operations, byte order, etc.