How do I keep RISC-V compliance?

How do I keep RISC-V compliance? - riscv

I just had a discussion with a colleague about what RISC-V compliance actually means. We discussed the following topics in detail:
As far as I understood the idea, a processor is RISC-V compliant as long as it implements a RISC-V base instruction set and optionally one or more of the standard extensions. Entirely, not just partly. One might even define and implement own instructions (as brownfield or greenfield extensions) as long as they do not touch the base instruction set or any of the standard extensions. Guaranteeing this, the machine code generated by any RISC-V compliant compiler would run on my machine. That's the whole point of it, right?
The RISC-V ISA does not intend delayed branches. My understanding is that the definition whether branches are delayed or not is already part of the ISA and not a matter of implementation. Is this correct?
Assume that one wants to use RISC-V with delayed branches. Whether this is a good idea or not, let's just focus on the compliance question. In my opinion it were no longer RISC-V compliant to define and implement some of the existing branch/jump instructions of the base instruction set as delayed branches. The compilation of a RISC-V compliant compiler would no longer work on such a machine. One would be free to define own delayed branch instructions instead. Of course, as with any self-written extension, it cannot be expected that an arbitrary compiler would use such an instruction. Am I right?
According to the RISC-V specification, "the program counter pc holds the address of the current instruction." My interpretation of this sentence is that any jump/branch instruction refers to the address at which it is stored. Again, independent of the implementation. Example: Assume an implementation where the jump/branch instruction is executed a few cycles after it has been fetched. This would mean that PC has potentially increased already. It is therefore the implementation's task to somehow store the address of the jump/branch instruction. It is not the compiler's task to know about this delay and compensate for it by modifying the immediate that is to be added to the PC. Am I summarizing this correctly?
So, in a nutshell the short version of my questions:
Does RISC-V compliance mean that base integer instruction set and standard extension must neither be changed nor stripped?
Is the information whether a branch is delayed or not already part of the ISA?
Is the PC of RISC-V considered agnostic to any pipeline delay?
I consider an ISA in general to be agnostic to any implementation specifics. The counter-argument to what I claim is that one would have to tell the compiler about implementation specifics (delayed branches, PC behavior etc.) and that this could still be considered compliant with the ISA.

I am not an expert, but have implemented a few cores during the last 20 years. The key concept in your three part-questions are completeness and user visibility. To claim completeness means, in my opinion, that no part of a standard can be changed, nor stripped. However, it is a rare standard indeed if it has no dubious points and sections that may be interpreted differently by different people. In the specific case of RISC-V I would like to point to an aid to indicate compliance, if you have not seen it already.
It would be good to have some real experts answering this question.
Does RISC-V compliance mean that base integer instruction set and standard extension must neither be changed nor stripped?
I have the same understanding as you. It does not make sense to claim a behaviour as defined in a standard, and then not honouring that standard.
Is the information whether a branch is delayed or not already part of the ISA?
Again I concur with you. Delayed branches is an exposed feature for users of a processor. Hence an ISA must specify the eventual existence of such branches, indeed, from page 15 of riscv-spec-v2.2.pdf:
"Control transfer instructions in RV32I do not have architecturally visible delay slots."
Notice the wording, as long as your implementation does not expose any delay slot to a user, you can do as you want. And with a non-standard extension you are perfectly free to design instructions that have delay slots, you may even put RV32I instructions in those slots.
Is the PC of RISC-V considered agnostic to any pipeline delay?
Yes.

Related

Can a programming language do anthing outside of what the OS provides?

I am trying to gain a more holistic, general, and high-level understanding of programs and programming languages.
I would like to understand how they actually function. I understand at the lowest level is machine code which is 0 and 1s. Then you have assembly. Then you have another high level language where every instruction/function/method/call/routine whatever you want to call it maps to some instruction or group of instructions in assembly right? The higher level language cannot provide or do anything outside of what the lower level language assembly provides correct?
Similarly, since all code runs on an OS, that code can only do things that the OS provides. It is impossible for the code to do anything outside of what the OS actually provides correct?

The computer has an instruction set, machine code, which defines what can be done on the computer. Assembly code is essentially a more convenient representation of that, so assembly code can do anything the machine can do. A higher-level programming language has to run on the machine, so it can't do anything the machine can't do, though it may well be able to express it more conveniently (e.g. print "foo" rather than several dozen machine-code instructions). It is a matter of implementation choice whether the compiler for that programming language generates machine code directly, or assembly code, or any other form that might need further processing.
This brings us to the question of whether it's possible for a program (regardless of what it's written in) to do something the OS does not explicitly provide for. I find this an odd way to express it, since the point of writing a program is to give you some capability that you didn't have before, so in some sense you only write programs for things the OS does not explicitly provide you with. The problem lies in defining what the OS "provides". If it's a general-purpose OS, then its designers likely intend to "provide" the ability for you to write a wide range of programs. The OS may choose to provide some convenient feature (say, the ability to create files) but if it did not provide such a feature, you could perhaps do it yourself, given suitable motivation (and, for the file creation example, the ability to do disk I/O - possibly requiring you to write a disk driver too).

I am trying to gain a more holistic, general, and high-level
understanding of programs and programming languages.
I would like to understand how they actually function.
I recommend working through the understanding of modern hardware to get good performance and energy efficiency with the following as example:
With subword parallelism to improve matrix multiply by a factor of 4.
Doubling the performance by unrolling the loop to demonstrate the value of instruction level parallelism.
Doubling performance again by optimizing for caches using
blocking.
Finally, A speedup of 14 from 16 processors by using thread-level parallelism.
All four optimizations in total add just 24 lines of C code to a matrix multiply example you find in
Computer Organization and Design: The Hardware Software Interface (5th ed.)
or similar books.
To make a point, it really is worth it to dig deeper than just "learning python" even if it is a good start. So understanding low level really effects high level programming in many ways and thats what you are after according to your question.
There is actually not just Assembly - there are hardware description languages as VHDL thats handled in this topic f.e.:
https://electronics.stackexchange.com/questions/132611/whats-the-motivation-in-using-verilog-or-vhdl-over-c

Why is return-to-libc much earlier than return-to-user?

I'm really new to this topic and only know some basic concepts. Nevertheless, there is a question that quite confuses me.
Solar Designer proposed the idea of return-to-libc in 1997 (http://seclists.org/bugtraq/1997/Aug/63). Return-to-user, from my understanding, didn't become popular until 2007 (http://seclists.org/dailydave/2007/q1/224).
However, return-to-user seems much easier than return-to-libc. So my question is, why did hackers spend so much effort in building a gadget chain by using libc, rather than simply using their own code in the user space when exploiting a kernel vulnerability?
I don't believe that they did not realize there are NULL pointer dereference vulnerabilities in the kernel.

Great question and thank you for taking some time to research the topic before making a post that causes it's readers to lose hope in the future of mankind (you might be able to tell I've read a few [exploit] tags tonight)
Anyway, there are two reasons
return-to-libc is generic, provided you have the offsets. There is no need to programmatically or manually build either a return chain or scrape existing user functionality.
Partially because of linker-enabled relocations and partly because of history, the executable runtime of most programs executing on a Linux system essentially demand the libc runtime, or at least a runtime that can correctly handle _start and cons. This formula still stands on Windows, just under the slightly different paradigm of return-to-kernel32/oleaut/etc but can actually be more immediately powerful, particularly for shellcode with length requirements, for reasons relating to the way in which system calls are invoked indirectly by kernel32-SSDT functions
As a side-note, If you are discussing NULL pointer dereferences in the kernel, you may be confusing return to vDSO space, which is actually subject to a different set of constraints that standard "mprotect and roll" userland does not.

Ret2libc or ROP is a technique you deploy when you cannot return to your shellcode (jmp esp/whatever) because of memory protections (DEP/NX).

They perform different goals, and you may engage with both.
Return to libc
This is a buffer overrun exploit where you have control over some input, and you know there exists a badly written program which will copy your input onto the stack and return through it. This allows you to start executing code on a machine you have not got a login to, or a very restrictive access.
Return to libc gives you the ability to control what code is executed, and would generally result in you running within the space, a more natural piece of code.
return from kernel
This is a privilege escalation attack. It takes code which is running on the machine, and exploits bugs in the kernel causing the kernel privileges to be applied to the user process.
Thus you may use return-to-libc to get your code running in a web-browser, and then return-to-user to perform restricted tasks.
Earlier shellcode
Before return-to-libc, there was straight shellcode, where the buffer overrun included the exploit code, with knowledge about where the stack would be, this was possible to run directly. This became somewhat obsolete with the NX bit in windows. (x86 hardware was capable of having execute on a segment, but not a page).
Why is return-to-libc much earlier than return-to-user?
The goal in these attacks is to own the machine, frequently, it was easy to find a vulnerability in a user-mode process which gave you access to what you needed, it is only with the hardening of the system by reducing the privilege at the boundary, and fixing the bugs in important programs, that the new return-from-kernel became necessary.

Are extended instruction sets (SSE, MMX) used in Linux kernel?

Well, they bring (should bring at least) great increase in performance, isn’t it?
So, I haven’t seen any Linux kernel sources, but ‘d love to ask: are they used somehow? (In this case – there must be some special “code-cap” for system that has no such instructions?)

The SSE and MMX instruction sets have limited value outside of audio/video and gaming work. You might find a few explicit uses in dark corners of the kernel, but I wouldn't count on it. The answer in the general case is "no, they are not used", nor are they used in most non-kernel/userspace applications.
The kernel does sometimes optionally use certain x86 instructions that are specific to certain CPUs (e.g. present on some AMD or Intel models but not all, nor vice-versa), such as syscall, but these are different from the SIMD instruction sets you're referring to, and are not part of some wider set of similarly-themed extensions.
After Mark's answer, I went looking. The only place I could easily identify them being used is in the RAID 6 library (which also has support for AltiVec, which is the PowerPC SIMD instruction set).
(Be wary just grepping the tree, there are a lot of spots where the kernel "knows" about SSE/MMX to support user-space applications, but isn't actually using it. Also a couple cases of unfortunate variable names that have absolutely nothing to do with SSE, e.g. in the SCTP implementation.)

There are severe restrictions on using vector registers and floating point registers in kernel code. See e.g. chapter 6.3 of "Calling conventions for different C++ compilers and operating systems". http://www.agner.org/optimize/#manuals

They are used in the kernel for a few things, such as
Software RAID
Encryption (possibly)
However, I believe it always checks their presence first.

"cpu simd instructions use FPU"
erm, no, not as I understand it. They're in part a modern and (much) more efficient replacement for FPU instructions, but a large part of the SIMD instruction set deals with integer operations.
I've never looked very hard at it, but I suppose (ok, hope) that SIMD code generated by a recent gcc version will not clobber any registers or state.

Bare metal cross compilers input

What are the input limitations of a bare metal cross compiler...as in does it not compile programs with pointers or mallocs......or anything that would require more than the underlying hardware....also how can 1 find these limitations..
I also wanted to ask...I built a cross compiler for target mips..i need to create a mips executable using this cross compiler...but i am not able to find where the executable is...as in there is 1 executable which i found mipsel-linux-cpp which is supposed to compile,assemble and link and then produce a.out but it is not doing so...
However the ./cc1 gives a mips assembly.......
There is an install folder which has a gcc executable which uses i386 assembly and then gives an exe...i dont understand how can the gcc exe give i386 and not mips assembly when i have specified target as mips....
please help im really not able to understand what is happ...
I followed the foll steps..
1. Installed binutils 2.19
2. configured gcc for mips..(g++,core)

I would suggest that you should have started two separate questions.
The GNU toolchain does not have any OS dependencies, but the GNU library does. Most bare-metal cross builds of GCC use the Newlib C library which provides a set of syscall stubs that you must map to your target yourself. These stubs include low-level calls necessary to implement stream I/O and heap management. They can be very simple or very complex depending on your needs. If the only I/O support is to a UART to stdin/stdout/stderr, then it is simple. You don't have to implement everything, but if you do not implement teh I/O stubs, you won't be able to use printf() for example. You must implement the sbrk()/sbrk_r() syscall is you want malloc() to work.
The GNU C++ library will work correctly with Newlib as its underlying library. If you use C++, the C runtime start-up (usually crt0.s) must include the static initialiser loop to invoke the constructors of any static objects that your code may include. The run-time start-up must also of course initialise the processor, clocks, SDRAM controller, timers, MMU etc; that is your responsibility, not the compiler's.
I have no experience of MIPS targets, but the principles are the same for all processors, there is a very useful article called "Building Bare Metal ARM with GNU" which you may find helpful, much of it will be relevant - especially porting the parts regarding implementing Newlib stubs.
Regarding your other question, if your compiler is called mipsel-linux-cpp, then it is not a 'bare-metal' build but rather a Linux build. Also this executable does not really "compile, assemble and link", it is rather a driver that separately calls the pre-processor, compiler, assembler and linker. It has to be configured correctly to invoke the cross-tools rather than the host tools. I generally invoke the linker separately in order to enforce decisions about which standard library to link (-nostdlib), and also because it makes more sense when a application is comprised of multiple execution units. I cannot offer much help other than that here since I have always used GNU-ARM tools built by people with obviously more patience than me, and moreover hosted on Windows, where there is less possibility of the host tool-chain being invoked instead (one reason why I have also avoided those tool-chains that rely on Cygwin)

EDIT
With more time available, I have rewritten my original answer in an attempt to provide something more useful.
I cannot provide a specific answer for your question. I have never tried to get code running on a MIPS machine. What I do have is plenty of experience getting a variety of "bare metal" boards up and running. All kinds of CPUs and all kinds of compilers and cross compilers. So I have an understanding of the principles that apply in all such situations. I will point out the kind of knowledge you will need to absorb before you can hope to succeed with a job like this, and hopefully I can list some links to resources to get you started on learning that knowledge.
I am worried you don't know that pointers are exactly the kind of thing a bare metal compiler can handle, they are a basic machine primitive. This tells me you are probably not an expert embedded developer who is just stuck in this particular scenario. Never mind. There isn't anything magic about programming an embedded system, and you can learn what you need to know.
The first step is getting to understand the relationship between C and the machine you wish to run code on. Basically C is a portable assembly language. This means that C is good for manipulating the basic operations of the machine. In this sense the basic operations of the machine are reading and writing memory locations, performing arithmetic and boolean operations on the data read from memory, and making branching and looping decisions based on that data. In particular the C concept of pointers allows you to manipulate data at locations in memory that you specify.
So far so good, but just doing raw computations in memory is not usually enough - you need a way to input and output data from memory. To do that you need to manipulate the hardware peripherals on your board. If the hardware peripherals are memory mapped then the machine registers used to control the peripherals look exactly like memory locations and C can manipulate them directly. Even in that case though, it is much more likely that doing useful I/O is best handled by extending the C core language with a library of routines provided just for that purpose. These library routines handle all the nasty details (timers, interrupts, non-memory mapped I/O) involved in manipulating the peripheral hardware on the board, and wrap them up with a convenient C function call interface. The idea is that you can go simply printf("hello world"); and the library call take care of the details of displaying the string.
An appropriately skilled developer knows how to adapt an existing I/O library to a new board, or how to develop new library routines to provide access to non-standard custom hardware. The classic way to develop these skills is to start with something simple, usually a LED for an output device, and a switch for an input device. Write a program that pulses a LED in a predictable way, or reads a switch and reflects in on a LED. The first time you get this working will be hugely satisfying.
Okay I have rambled enough. It is time to provide some more resources for you to study. The good news is that there's never been a better time to learn how things work at the interface between hardware and software. There is a wealth of freely available code and docs. Stackoverflow is a great resource as you know. Good luck! Links follow;
Embedded systems overview
Knowing the C language well is fundamental
Why not get your code working on a simulator before you try real hardware
Another emulated environment
Linux device drivers - an overlapping subject
Another book about bare metal programming

How do emulators work and how are they written? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Closed 9 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
How do emulators work? When I see NES/SNES or C64 emulators, it astounds me.
Do you have to emulate the processor of those machines by interpreting its particular assembly instructions? What else goes into it? How are they typically designed?
Can you give any advice for someone interested in writing an emulator (particularly a game system)?

Emulation is a multi-faceted area. Here are the basic ideas and functional components. I'm going to break it into pieces and then fill in the details via edits. Many of the things I'm going to describe will require knowledge of the inner workings of processors -- assembly knowledge is necessary. If I'm a bit too vague on certain things, please ask questions so I can continue to improve this answer.
Basic idea:
Emulation works by handling the behavior of the processor and the individual components. You build each individual piece of the system and then connect the pieces much like wires do in hardware.
Processor emulation:
There are three ways of handling processor emulation:
Interpretation
Dynamic recompilation
Static recompilation
With all of these paths, you have the same overall goal: execute a piece of code to modify processor state and interact with 'hardware'. Processor state is a conglomeration of the processor registers, interrupt handlers, etc for a given processor target. For the 6502, you'd have a number of 8-bit integers representing registers: A, X, Y, P, and S; you'd also have a 16-bit PC register.
With interpretation, you start at the IP (instruction pointer -- also called PC, program counter) and read the instruction from memory. Your code parses this instruction and uses this information to alter processor state as specified by your processor. The core problem with interpretation is that it's very slow; each time you handle a given instruction, you have to decode it and perform the requisite operation.
With dynamic recompilation, you iterate over the code much like interpretation, but instead of just executing opcodes, you build up a list of operations. Once you reach a branch instruction, you compile this list of operations to machine code for your host platform, then you cache this compiled code and execute it. Then when you hit a given instruction group again, you only have to execute the code from the cache. (BTW, most people don't actually make a list of instructions but compile them to machine code on the fly -- this makes it more difficult to optimize, but that's out of the scope of this answer, unless enough people are interested)
With static recompilation, you do the same as in dynamic recompilation, but you follow branches. You end up building a chunk of code that represents all of the code in the program, which can then be executed with no further interference. This would be a great mechanism if it weren't for the following problems:
Code that isn't in the program to begin with (e.g. compressed, encrypted, generated/modified at runtime, etc) won't be recompiled, so it won't run
It's been proven that finding all the code in a given binary is equivalent to the Halting problem
These combine to make static recompilation completely infeasible in 99% of cases. For more information, Michael Steil has done some great research into static recompilation -- the best I've seen.
The other side to processor emulation is the way in which you interact with hardware. This really has two sides:
Processor timing
Interrupt handling
Processor timing:
Certain platforms -- especially older consoles like the NES, SNES, etc -- require your emulator to have strict timing to be completely compatible. With the NES, you have the PPU (pixel processing unit) which requires that the CPU put pixels into its memory at precise moments. If you use interpretation, you can easily count cycles and emulate proper timing; with dynamic/static recompilation, things are a /lot/ more complex.
Interrupt handling:
Interrupts are the primary mechanism that the CPU communicates with hardware. Generally, your hardware components will tell the CPU what interrupts it cares about. This is pretty straightforward -- when your code throws a given interrupt, you look at the interrupt handler table and call the proper callback.
Hardware emulation:
There are two sides to emulating a given hardware device:
Emulating the functionality of the device
Emulating the actual device interfaces
Take the case of a hard-drive. The functionality is emulated by creating the backing storage, read/write/format routines, etc. This part is generally very straightforward.
The actual interface of the device is a bit more complex. This is generally some combination of memory mapped registers (e.g. parts of memory that the device watches for changes to do signaling) and interrupts. For a hard-drive, you may have a memory mapped area where you place read commands, writes, etc, then read this data back.
I'd go into more detail, but there are a million ways you can go with it. If you have any specific questions here, feel free to ask and I'll add the info.
Resources:
I think I've given a pretty good intro here, but there are a ton of additional areas. I'm more than happy to help with any questions; I've been very vague in most of this simply due to the immense complexity.
Obligatory Wikipedia links:
Emulator
Dynamic recompilation
General emulation resources:
Zophar -- This is where I got my start with emulation, first downloading emulators and eventually plundering their immense archives of documentation. This is the absolute best resource you can possibly have.
NGEmu -- Not many direct resources, but their forums are unbeatable.
RomHacking.net -- The documents section contains resources regarding machine architecture for popular consoles
Emulator projects to reference:
IronBabel -- This is an emulation platform for .NET, written in Nemerle and recompiles code to C# on the fly. Disclaimer: This is my project, so pardon the shameless plug.
BSnes -- An awesome SNES emulator with the goal of cycle-perfect accuracy.
MAME -- The arcade emulator. Great reference.
6502asm.com -- This is a JavaScript 6502 emulator with a cool little forum.
dynarec'd 6502asm -- This is a little hack I did over a day or two. I took the existing emulator from 6502asm.com and changed it to dynamically recompile the code to JavaScript for massive speed increases.
Processor recompilation references:
The research into static recompilation done by Michael Steil (referenced above) culminated in this paper and you can find source and such here.
Addendum:
It's been well over a year since this answer was submitted and with all the attention it's been getting, I figured it's time to update some things.
Perhaps the most exciting thing in emulation right now is libcpu, started by the aforementioned Michael Steil. It's a library intended to support a large number of CPU cores, which use LLVM for recompilation (static and dynamic!). It's got huge potential, and I think it'll do great things for emulation.
emu-docs has also been brought to my attention, which houses a great repository of system documentation, which is very useful for emulation purposes. I haven't spent much time there, but it looks like they have a lot of great resources.
I'm glad this post has been helpful, and I'm hoping I can get off my arse and finish up my book on the subject by the end of the year/early next year.

A guy named Victor Moya del Barrio wrote his thesis on this topic. A lot of good information on 152 pages. You can download the PDF here.
If you don't want to register with scribd, you can google for the PDF title, "Study of the techniques for emulation programming". There are a couple of different sources for the PDF.

Emulation may seem daunting but is actually quite easier than simulating.
Any processor typically has a well-written specification that describes states, interactions, etc.
If you did not care about performance at all, then you could easily emulate most older processors using very elegant object oriented programs. For example, an X86 processor would need something to maintain the state of registers (easy), something to maintain the state of memory (easy), and something that would take each incoming command and apply it to the current state of the machine. If you really wanted accuracy, you would also emulate memory translations, caching, etc., but that is doable.
In fact, many microchip and CPU manufacturers test programs against an emulator of the chip and then against the chip itself, which helps them find out if there are issues in the specifications of the chip, or in the actual implementation of the chip in hardware. For example, it is possible to write a chip specification that would result in deadlocks, and when a deadline occurs in the hardware it's important to see if it could be reproduced in the specification since that indicates a greater problem than something in the chip implementation.
Of course, emulators for video games usually care about performance so they don't use naive implementations, and they also include code that interfaces with the host system's OS, for example to use drawing and sound.
Considering the very slow performance of old video games (NES/SNES, etc.), emulation is quite easy on modern systems. In fact, it's even more amazing that you could just download a set of every SNES game ever or any Atari 2600 game ever, considering that when these systems were popular having free access to every cartridge would have been a dream come true.

I know that this question is a bit old, but I would like to add something to the discussion. Most of the answers here center around emulators interpreting the machine instructions of the systems they emulate.
However, there is a very well-known exception to this called "UltraHLE" (WIKIpedia article). UltraHLE, one of the most famous emulators ever created, emulated commercial Nintendo 64 games (with decent performance on home computers) at a time when it was widely considered impossible to do so. As a matter of fact, Nintendo was still producing new titles for the Nintendo 64 when UltraHLE was created!
For the first time, I saw articles about emulators in print magazines where before, I had only seen them discussed on the web.
The concept of UltraHLE was to make possible the impossible by emulating C library calls instead of machine level calls.

Something worth taking a look at is Imran Nazar's attempt at writing a Gameboy emulator in JavaScript.

Having created my own emulator of the BBC Microcomputer of the 80s (type VBeeb into Google), there are a number of things to know.
You're not emulating the real thing as such, that would be a replica. Instead, you're emulating State. A good example is a calculator, the real thing has buttons, screen, case etc. But to emulate a calculator you only need to emulate whether buttons are up or down, which segments of LCD are on, etc. Basically, a set of numbers representing all the possible combinations of things that can change in a calculator.
You only need the interface of the emulator to appear and behave like the real thing. The more convincing this is the closer the emulation is. What goes on behind the scenes can be anything you like. But, for ease of writing an emulator, there is a mental mapping that happens between the real system, i.e. chips, displays, keyboards, circuit boards, and the abstract computer code.
To emulate a computer system, it's easiest to break it up into smaller chunks and emulate those chunks individually. Then string the whole lot together for the finished product. Much like a set of black boxes with inputs and outputs, which lends itself beautifully to object oriented programming. You can further subdivide these chunks to make life easier.
Practically speaking, you're generally looking to write for speed and fidelity of emulation. This is because software on the target system will (may) run more slowly than the original hardware on the source system. That may constrain the choice of programming language, compilers, target system etc.
Further to that you have to circumscribe what you're prepared to emulate, for example its not necessary to emulate the voltage state of transistors in a microprocessor, but its probably necessary to emulate the state of the register set of the microprocessor.
Generally speaking the smaller the level of detail of emulation, the more fidelity you'll get to the original system.
Finally, information for older systems may be incomplete or non-existent. So getting hold of original equipment is essential, or at least prising apart another good emulator that someone else has written!

Yes, you have to interpret the whole binary machine code mess "by hand". Not only that, most of the time you also have to simulate some exotic hardware that doesn't have an equivalent on the target machine.
The simple approach is to interpret the instructions one-by-one. That works well, but it's slow. A faster approach is recompilation - translating the source machine code to target machine code. This is more complicated, as most instructions will not map one-to-one. Instead you will have to make elaborate work-arounds that involve additional code. But in the end it's much faster. Most modern emulators do this.

When you develop an emulator you are interpreting the processor assembly that the system is working on (Z80, 8080, PS CPU, etc.).
You also need to emulate all peripherals that the system has (video output, controller).
You should start writing emulators for the simpe systems like the good old Game Boy (that use a Z80 processor, am I not not mistaking) OR for C64.

Emulator are very hard to create since there are many hacks (as in unusual
effects), timing issues, etc that you need to simulate.
For an example of this, see http://queue.acm.org/detail.cfm?id=1755886.
That will also show you why you ‘need’ a multi-GHz CPU for emulating a 1MHz one.

Also check out Darek Mihocka's Emulators.com for great advice on instruction-level optimization for JITs, and many other goodies on building efficient emulators.

I've never done anything so fancy as to emulate a game console but I did take a course once where the assignment was to write an emulator for the machine described in Andrew Tanenbaums Structured Computer Organization. That was fun an gave me a lot of aha moments. You might want to pick that book up before diving in to writing a real emulator.

Advice on emulating a real system or your own thing?
I can say that emulators work by emulating the ENTIRE hardware. Maybe not down to the circuit (as moving bits around like the HW would do. Moving the byte is the end result so copying the byte is fine). Emulator are very hard to create since there are many hacks (as in unusual effects), timing issues, etc that you need to simulate. If one (input) piece is wrong the entire system can do down or at best have a bug/glitch.

The Shared Source Device Emulator contains buildable source code to a PocketPC/Smartphone emulator (Requires Visual Studio, runs on Windows). I worked on V1 and V2 of the binary release.
It tackles many emulation issues:
- efficient address translation from guest virtual to guest physical to host virtual
- JIT compilation of guest code
- simulation of peripheral devices such as network adapters, touchscreen and audio
- UI integration, for host keyboard and mouse
- save/restore of state, for simulation of resume from low-power mode

To add the answer provided by #Cody Brocious
In the context of virtualization where you are emulating a new system(CPU , I/O etc ) to a virtual machine we can see the following categories of emulators.
Interpretation: bochs is an example of interpreter , it is a x86 PC emulator,it takes each instruction from guest system translates it in another set of instruction( of the host ISA) to produce the intended effect.Yes it is very slow , it doesn't cache anything so every instruction goes through the same cycle.
Dynamic emalator: Qemu is a dynamic emulator. It does on the fly translation of guest instruction also caches results.The best part is that executes as many instructions as possible directly on the host system so that emulation is faster. Also as mentioned by Cody, it divides the code into blocks ( 1 single flow of execution).
Static emulator: As far I know there are no static emulator that can be helpful in virtualization.

How I would start emulation.
1.Get books based around low level programming, you'll need it for the "pretend" operating system of the Nintendo...game boy...
2.Get books on emulation specifically, and maybe os development. (you won't be making an os, but the closest to it.
3.look at some open source emulators, especially ones of the system you want to make an emulator for.
4.copy snippets of the more complex code into your IDE/compliler. This will save you writing out long code. This is what I do for os development, use a district of linux

I wrote an article about emulating the Chip-8 system in JavaScript.
It's a great place to start as the system isn't very complicated, but you still learn how opcodes, the stack, registers, etc work.
I will be writing a longer guide soon for the NES.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string