For emulating the Gameboy, why does it matter that the memory is broken up into different areas?

For emulating the Gameboy, why does it matter that the memory is broken up into different areas? - emulation

So I'm writing a gameboy emulator, and I'm not 100% sure why other projects took the time to break up the memory into proper categories. I don't know if there is a major technical dilemma I'm missing (maybe handling illegal parameters in instructions?), but it seems like the only thing that matters is that the address given by a write instruction is retrievable by the proper read instruction. So for a sub question, if I'm working under the assumption that the assembly is perfectly legal (meaning nothing is trying to read/write where it can't), can I just make a big array and read and write to it?
Note that this is a conceptual question and that I am aware a big array would be a memory hog, I'm not necessarily looking for the best way to do it, simply trying to learn how it works and why other emulator developers did it the way they did.

You are going to have read only memory, read/write memory and memory mapped I/O (peripherals etc). So you need to decode the address to some extent to break it into the major categories, then for the peripherals you have to emulate all of those so you have to get very detailed in your address decoding.
For the peripherals you will need to detect a read/write to some address which you cannot do by simply landing the writes in an array (two writes of the same value for example make a difference, you cant just scan some array to look for changes you have to trigger on reads and writes and perform the hardware action).
If you wish to be cycle accurate you will also need to know the timings for the rams and roms in order to mimic those, depending on how many banks of each or if timing is dependent on that you will need to decode the address further.
Hardware decodes these addresses to the same level so if you are emulating hardware then you need to...emulate hardware...and do the same amount of address decoding.

I'm going to be gameboy specific here. Look at gameboy's address space map. The address space itself is divided, it's not that emulators do it. Hardware itself operates that way.
Here's some of the regions that can't be implemented as just an array:
0x0000-0x3FFF. First bank of a ROM. It's read-only but not quite. Read the next one
0x4000-0x7FFF. Switchable ROM bank, it's also not quite read-only. Cartridges that don't fit into gameboy's address space contain memory bank controller. ROM will write to some specific read-only ROM regions to actually select which ROM bank is mapped into 0x4000-0x7FFF address range. So you have to detect these writes and then redirect reads into the selected ROM bank.
0xA000-0xBFFF. Switchable RAM bank. Same thing as with switchable ROM banks but now for RAM. Cartridges may contain additional RAM that's being mapped into gameboy's address space. Which bank of the RAM is mapped is controlled, again, by writes to specific read-only regions.
0xFF00-0xFF4B. IO ports. Here you have hardware registers mapped into address space. Gameboy has several hardware components each with it's own registers and even memory (display controller, sound processor, timers etc). To control that hardware ROM reads and writes into the IO ports. You obviously have to detect these writes so you can emulate the hardware they correspond to. It's not just CPU and memory you have to emulate. I would even say that the least part of it and the easy one. For example, it much harder to get display controller and sound channels right. They have complicated logic, bugs and very tricky behaviour that's not documented very well but is crucial to achieve accurate emulation. Wave channel in particular gave me a hard time.

Related

Is memory allocated with "ftruncate" is physically contiguous? [duplicate]

Is there a way to allocate contiguous physical memory from userspace in linux? At least few guaranteed contiguous memory pages. One huge page isn't the answer.

No. There is not. You do need to do this from Kernel space.
If you say "we need to do this from User Space" - without anything going on in kernel-space it makes little sense - because a user space program has no way of controlling or even knowing if the underlying memory is contiguous or not.
The only reason where you would need to do this - is if you were working in-conjunction with a piece of hardware, or some other low-level (i.e. Kernel) service that needed this requirement. So again, you would have to deal with it at that level.
So the answer isn't just "you can't" - but "you should never need to".
I have written such memory managers that do allow me to do this - but it was always because of some underlying issue at the kernel level, which had to be addressed at the kernel level. Generally because some other agent on the bus (PCI card, BIOS or even another computer over RDMA interface) had the physical contiguous memory requirement. Again, all of this had to be addressed in kernel space.
When you talk about "cache lines" - you don't need to worry. You can be assured that each page of your user-space memory is contiguous, and each page is much larger than a cache-line (no matter what architecture you're talking about).

Yes, if all you need is a few pages, this may indeed be possible.
The file /proc/[pid]/pagemap now allows programs to inspect the mapping of their virtual memory to physical memory.
While you cannot explicitly modify the mapping, you can just allocate a virtual page, lock it into memory via a call to mlock, record its physical address via a lookup into /proc/self/pagemap, and repeat until you just happen to get enough blocks touching eachother to create a large enough contiguous block. Then unlock and free your excess blocks.
It's hackish, clunky and potentially slow, but it's worth a try. On the other hand, there's a decently large chance that this isn't actually what you really need.

DPDK library's memory allocator uses approach #Wallacoloo described. eal_memory.c. The code is BSD licensed.

if specific device driver exports dma buffer which is physical contiguous, user space can access through dma buf apis
so user task can access but not allocate directly
that is because physically contiguous constraints are not from user aplications but only from device
so only device drivers should care.

Access PCI memory BAR with low latency (Linux)

Background:
I have a PCI card, which is basically a clock. It gets the time by GPS and saves the current time in a certain register.
Goal:
I want to read a limited number of registers/bytes (for example the current time) over and over again, with the lowest possible latency. (The clock provides very high precision and I think I will loose precision the higher the latency is.). The operating system is RedHat. The programming language is C/C++. I also want to write to the device memory, whereby latency is not an issue.
Possible Ways to go:
I see these ways. If you see another, please tell me:
Writing a Linux kernel module driver, which creates a character device (or one character device for each register to read). Then a user space application can do a "read" on the /dev/ file(s).
DMA
mmap the sysfs resourceX file to user space by a user space application (systemcall). (like here for example)
Write a Linux kernel module driver which implements a mmap file operation.
Questions:
Which is the way with the lowest latency when it comes to the actual reading of the register? I am aware that mmap causes a lot of overhead in the kernel, but as far as I understand that is only for initialisation.
Is way 3 a legit way to go? It looks like a hack to me. How can I determine the /sys/ path automatically from the application?
Is there a difference between way 3 and 4? I am new to PCI driver programming and I think I didn't really understand how way 4 works. I read this (and other chapters of that book), but maybe you can give me a hint or an example. I would appreciate that.

Method 3 or 4 should work fine. There’s no difference between them with respect to latency. Latency would be in the order of 100 ns.
Method 4 would be needed if you need to initialize the device, or control which applications are allowed to access it, or enforce one reader at a time, etc. Method 3 does seem like a bit of a hack because it skips all of this. But it is simpler if you don’t need such things.
A character device is definitely higher latency, because it requires a kernel transition each time the device is read.
The latency of a DMA method depends entirely on how frequently the device writes the time to memory. It is lower latency for the CPU to access memory than MMIO, but if the device only does DMA once a millisecond, then that would be your latency. Also, that method generates a lot of useless DMA traffic, since the CPU would read the value far less often than it is written.

Adding to #prl's answer...
Method 3 seems perfectly legit to me. That's what it's for. You may want to take a look at the kernel documentation file: https://www.kernel.org/doc/Documentation/filesystems/sysfs-pci.txt
You can also use the /sys filesystem to find your device. First, note the vendor ID and device ID for your clock card (and subsystem vendor / device if necessary), then you can easily walk the /sys/devices hierarchy, looking for a matching device (using the vendor, device, etc. special files). Once you've found it, you presumably know which resourceN file to open from the device's data sheet, then mmap it at the appropriate offset and you're done.
That all assumes that your device is configured and enabled already. Typically a PCI device is not enabled to do anything when the system boots. Some driver needs to claim the device, and initialize / configure it. Once that is done, if the time is accessible just by reading a register or two, you can can go with method 3. (I'm not sure: it may be possible for a PCI device to be self-initializing but I've never seen one. I think probably something needs to enable its memory space at the very least. Likely that could be done from user-space if the setup is small enough / simple enough.)
The primary difference with method 4 is that the driver controlling the device would provide support for allowing the area to be mmap'd explicitly. For the user-space application, there is little difference between the two methods aside from the device name used. For method 4, the driver's probably going to provide a symbolic device name /dev/clock0 or something like that for use by the user-space application (and presumably the application then doesn't need to go find the device, it would just know the device file name to open).
From user-space, you will do the mmap operation in much the same way with either method. In method 4, the driver internally supplies the physical address to map -- and possibly the offset -- instead of the generic PCI subsystem doing so, but either way, it's just open + mmap.
Linux driver programming is not terribly difficult, but there's a significant learning curve there if you haven't done it before, so I definitely wouldn't go with method 4 unless there were a real need to do so.

How does an OS kernel define all the inputs and outputs?

I want to know how an OS kernel defines its own inputs and outputs to make the computer run. Of course you need the right hardware for it to work, but how can you simply just make some variable and call it USB_PORT_1 or something? Is it having to do with firmware as well? Assigning arbitrary values do nothing on its own, so there must be something I am missing between the interaction of the hardware and software when you plug in a 1 terabyte HDD into your USB 3.0 Slot that is marked by the kernel as USB3_PORT_0. At this point there is obviously some stuff going on in the firmware, so what is it?
Reason: I'm making one.

To truely understand the interaction between hardware and software you have to understand how things work at a low level. What is a variable? In a programming language, variables may be assigned values that can later be modified, etc. But where is this stored physically in the machine? The truth is that it can be stored in several places. It could be in one of the processor's registers, it could be in RAM, or it could be in a completely difference place entirely.
When the kernel wishes to communicate with hardware, sometimes it may go through what you call firmware but for the most part it doesn't have to. Hardware exposes itself to the kernel in a variety of ways but the simplest way to think about it is as RAM. RAM is accessable with an address, so 0x1000 is a memory address somewhere in RAM. Speaking generallly, There is no reason that any particular address has to be mapped to RAM. Suppose I have a USB controller. I can map some address (lets call it 0xDEADBEEF) to this memory controller. So, if I read from 0xDEADBEEF it might tell me how many devices are connected to the system. Another adjacent address may tell me which port, etc etc. Every device does this differently, so we have device drivers that tell the kernel how the device is accessed, and then the kernel doesn't have to wory about specific memory addresses or anything, it simply abstracts everything into something called "USB3_PORT_0." The kernel and software simply use this to refer to the device, and the device driver translates that into a set of accesses via memory with interrupts, etc.
It is impossible for me to enumerate the number of ways harware and software can interact, however this should give you an idea of how it is done.

How to work with physical addresses as a network and how does DMA connect to it?

I am working on a network driver and am somewhat confused with the memory management.
On the TX path, i receive a skb, as the lower layer expects to get only physical addresses, I think I need to call *virt_to_phys* and send the return value to the lower layer.
(Does it make sense?)
Now, I know there are the functions *dma_map_single* and *dma_unmap_single*. I am still not sure how they come to the picture here. So the lower layer wants to work with DMA... Does it mean that I need to run the above commands (in the appropriate time) before dispatching the packet to the lower layer?
I also not sure I understand the meaning of the description of dma_map_single
Ensure that any data held in the cache is appropriately discarded or written back.
Would appreciate your help.

The files DMA-API.txt and DMA-API-HOWTO.txt in the Documentation directory of the Linux sources document how to use these functions.
You can't use virt_to_phys() and the like to get DMA addresses. That worked a long time ago in simpler times. Linux supports a wide range of hardware architectures and buses and for many of those, the address space the devices see does not map 1:1 to the physical address space of the CPU. Then there are also IOMMUs that can change that mapping dynamically. All of this necessitates the use of the DMA API.

Physical addresses are not necessarily the same as I/O bus addresses, so you must always use the dma_map_ functions.
On many systems, DMA accesses go only to main memory, which would break if there were a copy of that memory's contents in the CPU cache.
On these systems, the dma_ functions take the appropriate architecture-dependent action to ensure that such conflicts do not happen.

Contiguous physical memory from userspace

Is there a way to allocate contiguous physical memory from userspace in linux? At least few guaranteed contiguous memory pages. One huge page isn't the answer.

No. There is not. You do need to do this from Kernel space.
If you say "we need to do this from User Space" - without anything going on in kernel-space it makes little sense - because a user space program has no way of controlling or even knowing if the underlying memory is contiguous or not.
The only reason where you would need to do this - is if you were working in-conjunction with a piece of hardware, or some other low-level (i.e. Kernel) service that needed this requirement. So again, you would have to deal with it at that level.
So the answer isn't just "you can't" - but "you should never need to".
I have written such memory managers that do allow me to do this - but it was always because of some underlying issue at the kernel level, which had to be addressed at the kernel level. Generally because some other agent on the bus (PCI card, BIOS or even another computer over RDMA interface) had the physical contiguous memory requirement. Again, all of this had to be addressed in kernel space.
When you talk about "cache lines" - you don't need to worry. You can be assured that each page of your user-space memory is contiguous, and each page is much larger than a cache-line (no matter what architecture you're talking about).

Yes, if all you need is a few pages, this may indeed be possible.
The file /proc/[pid]/pagemap now allows programs to inspect the mapping of their virtual memory to physical memory.
While you cannot explicitly modify the mapping, you can just allocate a virtual page, lock it into memory via a call to mlock, record its physical address via a lookup into /proc/self/pagemap, and repeat until you just happen to get enough blocks touching eachother to create a large enough contiguous block. Then unlock and free your excess blocks.
It's hackish, clunky and potentially slow, but it's worth a try. On the other hand, there's a decently large chance that this isn't actually what you really need.

DPDK library's memory allocator uses approach #Wallacoloo described. eal_memory.c. The code is BSD licensed.

if specific device driver exports dma buffer which is physical contiguous, user space can access through dma buf apis
so user task can access but not allocate directly
that is because physically contiguous constraints are not from user aplications but only from device
so only device drivers should care.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string