6502 cycle timing per instruction - emulation

I am writing my first NES emulator in C. The goal is to make it easily understandable and cycle accurate (does not necessarily have to be code-efficient though), in order to play games at normal 'hardware' speed. When digging into the technical references of the 6502, it seems like the instructions consume more than one CPU cycle - and also has different cycles depending on given conditions (such as branching). My plan is to create read and write functions, and also group opcodes by addressing modes using a switch.
The question is: When I have a multiple-cycle instruction, such as a BRK, do I need to emulate what is exactly happening in each cycle:
#Method 1
cycle - action
1 - read BRK opcode
2 - read padding byte (ignored)
3 - store high byte of PC
4 - store low byte of PC
5 - store status flags with B flag set
6 - low byte of target address
7 - high byte of target address
...or can I just execute all the required operations in one 'cycle' (one switch case) and do nothing in the remaining cycles?
#Method 2
1 - read BRK opcode,
read padding byte (ignored),
store high byte of PC,
store low byte of PC,
store status flags with B flag set,
low byte of target address,
high byte of target address
2 - do nothing
3 - do nothing
4 - do nothing
5 - do nothing
6 - do nothing
7 - do nothing
Since both methods consume the desired 7 cycles, will there be no difference between the two? (accuracy-wise)
Personally I think method 1 is the way-to-go solution, however I cannot think of a proper, easy way to implement it... (Please help!)

Do you 'need' to? It depends on the software. Imagine the simplest example:
STA ($56), Y
... which happens to hit a hardware register. If you don't do at least the write on the correct cycle then you've introduced a timing deficiency. The register you're writing to will be written to at the wrong time. What if it's something like a palette register, and the programmer is running a raster effect? Then you've just moved where the colour changes. You've changed the graphical output.
In practice, clever programmers do much smarter things than that — e.g. one might use a read-modify-write operation to read a hardware value at an exact cycle, modify it, then write it back at some other exact cycle.
So my answer is:
most software isn't written so that the difference between (1) and (2) will have any effect; but
some definitely is, because the author was very clever; and
some definitely is, just because the author experimented until they found a cool effect, regardless of whether they were cognisant of the cause; and
in any case, when you find something that doesn't work properly on your emulator, how much time do you want to spend considering all the permutations and combinations of potential causes? Every one you can factor out is one less to consider.
Most emulators used to use your method (2). What normally happens is that they work with 90% of software. Then there's a few cases that don't work, for which the emulator author puts in a special case here, a special case there. Those usually ended up interacting poorly and the emulator spent the rest of its life oscillating between supporting different 95% combinations of available software until somebody wrote a better one.
So just go with method (1). It will cause some software that would otherwise be broken not to be so. Also it'll teach you more, and it'll definitely eliminate any potential motivation for special cases so it'll keep your code cleaner. It'll be marginally slower but I think your computer can probably handle it.
Other tips: the 6502 has only a few addressing modes, and the addressing mode entirely dictates the timing. This document is everything you need to know for perfect timing. If you want perfect cleanliness, your switch table can just pick an addressing mode and a central operation, then exit and you can branch on addressing mode to do the main action.
If you're going to use vanilla read and write methods, which is smart on a 6502 as every single cycle is either a read or a write so it's almost all you need to say, just be careful of the method signatures. For example, the 6502 has a SYNC pin which allows an observer to discriminate an ordinary read from an opcode read. Check whether the NES exposes that to cartridges, as it's often used on systems that expose it for implicit paging and the main identifying characteristic of the NES is that there are hundreds of paging schemes.
EDIT: minor updates:
it's not actually completely true to say that a 6502 always reads or writes; it also has an RDY input. If the RDY input is asserted and the 6502 intends to read, it will instead halt while maintaining the intended read address. Rarely used in practice because it's insufficient for common tasks like allowing somebody else to take possession of memory — the 6502 will write regardless of the RDY input, it's really meant to help with single-stepping — and seemingly not included on the NES cartridge pinout, you needn't implement it for that machine.
per the same pinout, the sync signal also doesn't seem to be exposed to cartridges on that system.

Related

Can a single byte instruction be executed while being only partially overwritten?

I have made an experiment in which a new thread executes a shellcode with this simple infinite loop:
NOP
JMP REL8 0xFE (-0x2)
This generate the following shellcode:
0x90, 0xEB, 0xFE
After this infinite loop there are other instructions ending by the overwriting of the destination byte back to -0x2 to make it an infinite loop again, and an absolute jump to send the thread back to this infinite loop.
Now I was asking myself if it was possible that the instruction of the jump was executed while the single byte of the destination is only partially overwritten by the other thread.
For example, let's say that the other thread overwrites the destination of the jump (0xFE, or 11111110 in binary) to 0x0 (00000000) to release the thread of this infinite loop.
Could it happen that the jump goes to let's say 0x1E (00011110) because the destination byte wasn't completely overwritten at that nanosecond?
Before asking this question here I have done the experiment myself in a C++ program and I have let it run for some hours without it never missing a single jump.
If you want to have a look at the code I made for this experiment I have uploaded it to GitHub
Accordingly to this experiment, it seems to be impossible that an instruction is executed while being only partially overwritten .
However, I have very little knowledge in assembly and in processors, this is for this reason that I ask the question here:
Can anyone confirm my observation please? Is it indeed impossible to have an instruction executed while being partially overwritten by another thread? Does anyone knows why for sure?
Thank you very much for your help and knowledge on that, I did not know where to look for such an information.
No, byte stores are always atomic on x86, even for cross-modifying code.
See Observing stale instruction fetching on x86 with self-modifying code for some links to Intel's manuals for cross modifying code. And maybe Reproducing Unexpected Behavior w/Cross-Modifying Code on x86-64 CPUs
Of course, all the recommendations for writing efficient cross-modifying code (and running code that you just JIT-compiled) involve avoiding stores into pages that other threads are currently executing.
Why are you doing this with "shellcode", anyway? Is this supposed to be part of an exploit? If not, why not just write code in asm like a normal person, with a label on the jmp instruction so you can store to it from C by assigning to extern char jmp_bytes[2]?
And if this is supposed to be an efficient cross-thread notification mechanism... it isn't. Spinning on a data load and a conditional branch with a pause loop would allow a lower latency exit from the loop than a self-modifying code machine nuke that flushes the whole pipeline right when you want it to finally be doing something useful instead of wasting CPU time. At least several times the delay of a simple branch miss.
Even better, use an OS-supported condition variable so the thread can sleep instead of heating up your CPU (reducing the thermal headroom for the CPU to turbo above its rated clock speed up when there is work to do).
The mechanism used by current CPUs is that if a store near the EIP/RIP or any instruction in flight in the pipeline is detected, it does a machine clear. (perf counter machine_clears.smc, aka machine nuke.) It doesn't even try to handle it "efficiently", but if you did a non-atomic store (e.g. actually two separate stores, or a store split across a cache line boundary) the target CPU core could see it in different parts and potentially decode it with some bytes updated and other bytes not. But a single byte is always updated atomically, so tearing within a byte is not possible.
However, x86 on paper doesn't guarantee that, but as Andy Glew (one of the architects of Intel's P6 microarchitecture family) says, implementing stronger behaviour than the paper spec can actually be the most efficient way to meet all the required guarantees and run fast. (And / or avoid breaking existing code in widely-used software!)

Is there a standard constant *nix benchmark, and if not, how to make a `bogobench`?

Ordinary single-threaded *nix programs can be benchmarked with utils like time, i.e.:
# how long does `seq` take to count to 100,000,000
/usr/bin/time seq 100000000 > /dev/null
Outputs:
1.16user 0.06system 0:01.23elapsed 100%CPU (0avgtext+0avgdata 1944maxresident)k
0inputs+0outputs (0major+80minor)pagefaults 0swaps
...but numbers returned are always system dependent, which in a sense also measures the user's hardware.
Is there some non-relative benchmarking method or command-line util which would return approximately the same virtual timing numbers on any system, (or at least a reasonably large subset of systems)? Just like grep -m1 bogo /proc/cpuinfo returns a roughly approximate but stable unit, such a benchmark should also return a somewhat similar unit of duration.
Suppose for benchmarking ordinary commands we have a magic util bogobench (where "bogo" is an adjective signifying "a somewhat bogus status", but not necessarily having algorithms in common with BogoMIPs):
bogobench foo bar.data
And we run this on two physically separate systems:
a 1996 Pentium II
a 2015 Xeon
Desired output would be something like:
21 bogo-seconds
So bogobench should return about the same number in both cases, even though it probably would finish in much less time on the 2nd system.
A hardware emulator like qemu might be one approach, but not necessarily the only approach:
Insert the code to benchmark into a wrapper script bogo.sh
Copy bogo.sh to a bootable Linux disk image bootimage.iso, within a directory where bogo.sh would autorun then promptly shutdown the emulator. During which it outputs some form of timing data to parse into bogo-seconds.
Run bootimage.iso using one of qemu's more minimal -machine options:
qemu-system-i386 -machine type=isapc bootimage.iso
But I'm not sure how to make qemu use a virtual clock, rather than the host CPU's clock, and qemu itself seems like a heavy tool for a seemingly simple task. (Really MAME or MESS would be more versatile emulators than qemu for such a task -- but I'm not adept with MAME, although MAME currently has some capacity for 80486 PC emulation.)
Online we sometimes compare and contrast timing-based benchmarks made on machine X with one made on machine Y. Whereas I'd like both user X and Y to be able to do their benchmark on a virtual machine Z, with bonus points for emulating X or Y (like MAME) if need be, except with no consideration of X or Y's real run-time, (unlike MAME where emulations are often playable). In this way users could report how programs perform in interesting cases without the programmer having to worry that the results were biased by idiosyncrasies of a user's hardware, such as CPU quirks, background processes hogging resources, etc.
Indeed, even on the user's own hardware, a time based benchmark can be unreliable, as often the user can't be sure some background process, (or bug, or hardware error like a bad sector, or virus), might not be degrading some aspect of performance. Whereas a more virtual benchmark ought to be less susceptible to such influences.
The only sane way I see to implement this is with a cycle-accurate simulator for some kind of hardware design.
AFAIK, no publicly-available cycle-accurate simulators for modern x86 hardware exist, because it's extremely complex and despite a lot of stuff being known about x86 microarchitecture internals (Agner Fog's stuff, Intel's and AMD's own optimization guides, and other stuff in the x86 tag wiki), enough of the behaviour is still a black box full of CPU-design trade-secrets that it's at best possible to simulate something similar. (E.g. branch prediction is definitely one of the most secret but highly important parts).
While it should be possible to come close to simulating Intel Sandybridge or Haswell's actual pipeline and out-of-order core / ROB / RS (at far slower than realtime), nobody has done it that I know of.
But cycle-accurate simulators for other hardware designs do exist: Donald Knuth's MMIX architecture is a clean RISC design that could actually be built in silicon, but currently only exists on paper.
From that link:
Of particular interest is the MMMIX meta-simulator, which is able to do dynamic scheduling of a complex pipeline, allowing superscalar execution with any number of functional units and with many varieties of caching and branch prediction, etc., including a detailed implementation of both hard and soft interrupts.
So you could use this as a reference machine for everyone to run their benchmarks on, and everyone could get comparable results that will tell you how fast something runs on MMIX (after compiling for MMIX with gcc). But not how fast it runs on x86 (presumably also compiling with gcc), which may differ by a significant factor even for two programs that do the same job a different way.
For [fastest-code] challenges over on the Programming puzzles and Code Golf site, #orlp created the GOLF architecture with a simulator that prints timing results, designed for exactly this purpose. It's a toy architecture with stuff like print to stdout by storing to 0xffffffffffffffff, so it's not necessarily going to tell you anything about how fast something will run on any real hardware.
There isn't a full C implementation for GOLF, AFAIK, so you can only really use it with hand-written asm. This is a big difference from MMIX, which optimizing compilers do target.
One practical approach that could (maybe?) be extended to be more accurate over time is to use existing tools to measure some hardware invariant performance metric(s) for the code under test, and then apply a formula to come up with your bogoseconds score.
Unfortunately most easily measurable hardware metrics are not invariant - rather, they depend on the hardware. An obvious one that should be invariant, however, would be "instructions retired". If the code is taking the same code paths every time it is run, the instructions retired count should be the same on all hardware1.
Then you apply some kind of nominal clock speed (let's say 1 GHz) and nominal CPI (let's say 1.0) to get your bogoseconds - if you measure 15e9 instructions, you output a result of 15 bogoseconds.
The primary flaw here is that the nominal CPI may be way off from the actual CPI! While most programs hover around 1 CPI, it's easy to find examples where they can approach 0.25 or whatever the inverse of the width is, or alternately be 10 or more if there are many lengthy stalls. Of course such extreme programs may be what you'd want to benchmark - and even if not you have the issue that if you are using your benchmark to evaluate code changes, it will ignore any improvements or regressions in CPI and look only at instruction count.
Still, it satisfies your requirement in as much as it effectively emulates a machine that executes exactly 1 instruction every cycle, and maybe it's a reasonable broad-picture approach. It is pretty easy to implement with tools like perf stat -e instructions (like one-liner easy).
To patch the holes then you could try to make the formula better - let's say you could add in a factor for cache misses to account for that large source of stalls. Unfortunately, how are you going to measure cache-misses in a hardware invariant way? Performance counters won't help - they rely on the behavior and sizes of your local caches. Well, you could use cachegrind to emulate the caches in a machine-independent way. As it turns out, cachegrind even covers branch prediction. So maybe you could plug your instruction count, cache miss and branch miss numbers into a better formula (e.g., use typical L2, L3, RAM latencies, and a typical cost for branch misses).
That's about as far as this simple approach will take you, I think. After that, you might as well just rip apart any of the existing x862 emulators and add your simple machine model right in there. You don't need to cycle accurate, just pick a nominal width and model it. Probably whatever underlying emulation cachegrind is going might be a good match and you get the cache and branch prediction modeling already for free.
1 Of course, this doesn't rule out bugs or inaccuracies in the instruction counting mechanism.
2 You didn't tag your question x86 - but I'm going to assume that's your target since you mentioned only Intel chips.

How can I prove __udelay() is working correctly on my ARM embedded system?

We have an ARM9 using the 3.2 kernel -- everything seems to work fine. Recently I was asked to add some code to add a 50ms pulse on some GPIO lines at startup. The pulse code is fine; I can see the lines go down and up, as expected. What does not work the way I expected is the udelay() function. Reading the docs makes me think the units are in microseconds, but as measured in the logic analyzer it was way too short. So I finally added this code to get 50ms.
// wait 50ms to be sure PCIE reset takes
for (i=0;i<6100;i++) // measured on logic analyzer - seems wrong to me!!
{
__udelay(2000); // 2000 is max
}
I don't like it, but it works fine. There are some odd constants and instructions in the udelay code. Can someone enlighten me as to how this is supposed to work? This code is called after all the clocks are initialized, so everything else seems ok.
According to Linus in this thread:
If it's about 1% off, it's all fine. If somebody picked a delay value
that is so sensitive to small errors in the delay that they notice
that - or even notice something like 5% - then they have picked too
short of a delay.
udelay() was never really meant to be some kind of precision
instrument. Especially with CPU's running at different frequencies,
we've historically had some rather wild fluctuation. The traditional
busy loop ends up being affected not just by interrupts, but also by
things like cache alignment (we used to inline it), and then later the
TSC-based one obviously depended on TSC's being stable (which they
weren't for a while).
So historically, we've seen udelay() being really off (ie 50% off
etc), I wouldn't worry about things in the 1% range.
Linus
So it's not going to be perfect. It's going to be off. By how much is dependent on a lot of factors. Instead of using a for loop, consider using mdelay instead. It might be a bit more accurate. From the O'Reilly Linux Device Drivers book:
The udelay call should be called only for short time lapses because
the precision of loops_per_second is only eight bits, and noticeable
errors accumulate when calculating long delays. Even though the
maximum allowable delay is nearly one second (since calculations
overflow for longer delays), the suggested maximum value for udelay is
1000 microseconds (one millisecond). The function mdelay helps in
cases where the delay must be longer than one millisecond.
It's also important to remember that udelay is a busy-waiting function
(and thus mdelay is too); other tasks can't be run during the time
lapse. You must therefore be very careful, especially with mdelay, and
avoid using it unless there's no other way to meet your goal.
Currently, support for delays longer than a few microseconds and
shorter than a timer tick is very inefficient. This is not usually an
issue, because delays need to be just long enough to be noticed by
humans or by the hardware. One hundredth of a second is a suitable
precision for human-related time intervals, while one millisecond is a
long enough delay for hardware activities.
Specifically the line "the suggested maximum value for udelay is 1000 microseconds (one millisecond)" sticks out at me since you state that 2000 is the max. From this document on inserting delays:
mdelay is macro wrapper around udelay, to account for possible
overflow when passing large arguments to udelay
So it's possible you're running into an overflow error. Though I wouldn't normally consider 2000 to be a "large argument".
But if you need real accuracy in your timing, you'll need to deal with the offset like you have, roll your own or use a different kernel. For information on how to roll your own delay function using assembler or using hard real time kernels, see this article on High-resolution timing.
See also: Linux Kernel: udelay() returns too early?

Question about cycle counting accuracy when emulating a CPU

I am planning on creating a Sega Master System emulator over the next few months, as a hobby project in Java (I know it isn't the best language for this but I find it very comfortable to work in, and as a frequent user of both Windows and Linux I thought a cross-platform application would be great). My question regards cycle counting;
I've looked over the source code for another Z80 emulator, and for other emulators as well, and in particular the execute loop intrigues me - when it is called, an int is passed as an argument (let's say 1000 as an example). Now I get that each opcode takes a different number of cycles to execute, and that as these are executed, the number of cycles is decremented from the overall figure. Once the number of cycles remaining is <= 0, the execute loop finishes.
My question is that many of these emulators don't take account of the fact that the last instruction to be executed can push the number of cycles to a negative value - meaning that between execution loops, one may end up with say, 1002 cycles being executed instead of 1000. Is this significant? Some emulators account for this by compensating on the next execute loop and some don't - which approach is best? Allow me to illustrate my question as I'm not particularly good at putting myself across:
public void execute(int numOfCycles)
{ //this is an execution loop method, called with 1000.
while (numOfCycles > 0)
{
instruction = readInstruction();
switch (instruction)
{
case 0x40: dowhatever, then decrement numOfCycles by 5;
break;
//lets say for arguments sake this case is executed when numOfCycles is 3.
}
}
After the end of this particular looping example, numOfCycles would be at -2. This will only ever be a small inaccuracy but does it matter overall in peoples experience? I'd appreciate anyone's insight on this one. I plan to interrupt the CPU after every frame as this seems appropriate, so 1000 cycles is low I know, this is just an example though.
Many thanks,
Phil
most emulators/simulators dealing just with CPU Clock tics
That is fine for games etc ... So you got some timer or what ever and run the simulation of CPU until CPU simulate the duration of the timer. Then it sleeps until next timer interval occurs. This is very easy to simulate. you can decrease the timing error by the approach you are asking about. But as said here for games is this usually unnecessary.
This approach has one significant drawback and that is your code works just a fraction of a real time. If the timer interval (timing granularity) is big enough this can be noticeable even in games. For example you hit a Keyboard Key in time when emulation Sleeps then it is not detected. (keys sometimes dont work). You can remedy this by using smaller timing granularity but that is on some platforms very hard. In that case the timing error can be more "visible" in software generated Sound (at least for those people that can hear it and are not deaf-ish to such things like me).
if you need something more sophisticated
For example if you want to connect real HW to your emulation/simulation then you need to emulate/simulate BUS'es. Also things like floating bus or contention of system is very hard to add to approach #1 (it is doable but with big pain).
If you port the timings and emulation to Machine cycles things got much much easier and suddenly things like contention or HW interrupts, floating BUS'es are solving themselves almost on their own. I ported my ZXSpectrum Z80 emulator to this kind of timing and see the light. Many things get obvious (like errors in Z80 opcode documentation, timings etc). Also the contention got very simple from there (just few lines of code instead of horrible decoding tables almost per instruction type entry). The HW emulation got also pretty easy I added things like FDC controlers AY chips emulations to the Z80 in this way (no hacks it really runs on their original code ... even Floppy formating :)) so no more TAPE Loading hacks and not working for custom loaders like TURBO
To make this work I created my emulation/simulation of Z80 in a way that it uses something like microcode for each instruction. As I very often corrected errors in Z80 instruction set (as there is no single 100% correct doc out there I know of even if some of them claim that they are bug free and complete) I come with a way how to deal with it without painfully reprogramming the emulator.
Each instruction is represented by an entry in a table, with info about timing, operands, functionality... Whole instruction set is a table of all theses entries for all instructions. Then I form a MySQL database for my instruction set. and form similar tables to each instruction set I found. Then painfully compared all of them selecting/repairing what is wrong and what is correct. The result is exported to single text file which is loaded at emulation startup. It sound horrible but In reality it simplifies things a lot even speedup the emulation as the instruction decoding is now just accessing pointers. The instruction set data file example can be found here What's the proper implementation for hardware emulation
Few years back I also published paper on this (sadly institution that holds that conference does not exist anymore so servers are down for good on those old papers luckily I still got a copy) So here image from it that describes the problematics:
a) Full throtlle has no synchronization just raw speed
b) #1 has big gaps causing HW synchronization problems
c) #2 needs to sleep a lot with very small granularity (can be problematic and slow things down) But the instructions are executed very near their real time ...
Red line is the host CPU processing speed (obviously what is above it take a bit more time so it should be cut and inserted before next instruction but it would be hard to draw properly)
Magenta line is the Emulated/Simulated CPU processing speed
alternating green/blue colors represent next instruction
both axises are time
[edit1] more precise image
The one above was hand painted... This one is generated by VCL/C++ program:
generated by these parameters:
const int iset[]={4,6,7,8,10,15,21,23}; // possible timings [T]
const int n=128,m=sizeof(iset)/sizeof(iset[0]); // number of instructions to emulate, size of iset[]
const int Tps_host=25; // max possible simulation speed [T/s]
const int Tps_want=10; // wanted simulation speed [T/s]
const int T_timer=500; // simulation timer period [T]
so host can simulate at 250% of wanted speed and simulation granularity is 500T. Instructions where generated pseudo-randomly...
Was a quite interesting article on Arstechnica talking about console simulation recently, also links to quite a few simulators that might make for quite good research:
Accuracy takes power: one man's 3GHz quest to build a perfect SNES emulator
The relevant bit is that the author mentions, and I am inclined to agree, that most games will appear to function pretty correctly even with timing deviations of +/-20%. The issue you mention looks likely to never really introduce more than a fraction of a percent timing error, which is probably imperceptible whilst playing the final game. The authors probably didn't consider it worth dealing with.
I guess that depends on how accurate you want your emulator to be. I do not think that it has to be that accurate. Think emulation of x86 platform, there are so many variants of processors and each has different execution latencies and issue rates.

6502 CPU Emulation

It's the weekend, so I relax from spending all week programming by writing a hobby project.
I wrote the framework of a MOS 6502 CPU emulator yesterday, the registers, stack, memory and all the opcodes are implemented. (Link to source below)
I can manually run a series of operations in the debugger I wrote, but I'd like to load a NES rom and just point the program counter at its instructions, I figured that this would be the fastest way to find flawed opcodes.
I wrote a quick NES rom loader and loaded the ROM banks into the CPU memory.
The problem is that I don't know how the opcodes are encoded. I know that the opcodes themselves follow a pattern of one byte per opcode that uniquely identifies the opcode,
0 - BRK
1 - ORA (D,X)
2 - COP b
etc
However I'm not sure where I'm supposed to find the opcode argument. Is it the the byte directly following? In absolute memory, I suppose it might not be a byte but a short.
Is anyone familiar with this CPU's memory model?
EDIT: I realize that this is probably shot in the dark, but I was hoping there were some oldschool Apple and Commodore hackers lurking here.
EDIT: Thanks for your help everyone. After I implemented the proper changes to align each operation the CPU can load and run Mario Brothers. It doesn't do anything but loop waiting for Start, but its a good sign :)
I uploaded the source:
https://archive.codeplex.com/?p=cpu6502
If anyone has ever wondered how an emulator works, its pretty easy to follow. Not optimized in the least, but then again, I'm emulating a CPU that runs at 2mhz on a 2.4ghz machine :)
The opcode takes one byte, and the operands are in the following bytes. Check out the byte size column here, for instance.
If you look into references like http://www.atarimax.com/jindroush.atari.org/aopc.html, you will see that each opcode has an encoding specified as:
HEX LEN TIM
The HEX is your 1-byte opcode. Immediately following it is LEN bytes of its argument. Consult the reference to see what those arguments are. The TIM data is important for emulators - it is the number of clock cycles this instruction takes to execute. You will need this to get your timing correct.
These values (LEN, TIM) are not encoded in the opcode itself. You need to store this data in your program loader/executer. It's just a big lookup table. Or you can define a mini-language to encode the data and reader.
This book might help: http://www.atariarchives.org/mlb/
Also, try examing any other 6502 aseembler/simulator/debugger out there to see how Assembly gets coded as Machine Language.
The 6502 manuals are on the Web, at various history sites. The KIM-1 shipped with them. Maybe more in them than you need to know.
This is better - 6502 Instruction Set matrix:
https://www.masswerk.at/6502/6502_instruction_set.html
The apple II roms included a dissassembler, I think that's what it was called, and it would show you in a nice format the hex opcodes and the 3 character opcode and the operands.
So given how little memory was available, they managed to shove in the operand byte count (always 0, 1 or 2) the 3 character opcode for the entire 6502 instruction set into a really small space, because there's really not that much of it.
If you can dig up an apple II rom, you can just cut and paste from there...
The 6502 has different addressing modes, the same instruction has several different opcodes depending on it's addressing mode. Take a look at the following links which describes the different ways a 6502 can retrieve data from memory, or directly out of ROM.
http://obelisk.me.uk/6502/addressing.html#IMM

Resources