RAM performance comparison of User mode vs Kernel mode

RAM performance comparison of User mode vs Kernel mode - linux

Unlike Kernel mode, User mode uses address translation due to the use of virtual memory. So it seems that there must be a trade-off for translating memory addresses when an access occurs (Even if there is no TLB miss).
Since Kernel mode accesses RAM directly without any address translation, Is there any performance gain in RAM if we run a code in Kernel mode rather than User mode?

Is there any performance gain in RAM if we run a code in Kernel mode rather than User mode?
Probably not. The MMU is always used. It is just configured differently for kernel mode and user mode (so the "address translation" for kernel code might be some "identity" function).
And CPU cache considerations matter much more than MMU. a cache miss could cost several hundreds of cycles or nanoseconds (to fetch data from your RAM modules). Also, context switches are costly.
But you need to benchmark. See the ending Answers section of P.Norvig page.
(indeed, the kernel address space does not have any major page faults; I guess -but don't really know- that on some hardware it could have minor page faults)
Read also about the Unikernel approach.

Related

How to read stale values on x86

My goal is to read in stale and outdated values of memory without cache-coherence. I have attempted to use prefetchnta to perform a non-temporal load, but it failed to fetch outdated values. I am looking into performing some kind of Streaming Memory-to-Memory Direct-Memory-Access, but am having a little trouble due to the overwhelming amount of background knowledge required to proceed with my current project. Currently I am attempting to mess around with udmabuf but even that is going slowly. It should be noted that ideally I would like to ignore the contents of all CPU caches, including the current CPU.
To provide my reasoning as to why: I am developing software that can be used to prove correctness of programs written for non-volatile memory. As the CPU Cache is volatile, the CPU's write-back cache will still be volatile and the arbitrary nature of how they are written back to memory needs to be observed.
I would sincerely appreciate it if someone could give me some pointers of how to proceed. I do not mind digging into the Linux kernel, as in fact I am doing that now, nor do I mind modifying it, I just need a little guidance in the right direction.

I haven't played around with this, but my understanding from the docs is that for loads (unlike NT stores) nothing can bypass cache or override the strong ordering of memory types like the normal WB (write-back). And even NT stores evict already-cached data, so they can't break coherence for this or another core that has cached data for the line you're writing.
You can do weakly-ordered loads from WC (write-combining) memory regions (with prefetchnta or SSE4 movntdqa), but they're probably still coherent at the physical address level.
#MargaretBloom commented
IIRC Intel warns the developer about multiple mapping with different cache types, which may indeed be good in this case.
so maybe you could actually bypass cache coherence with multiple virtual mappings of the same physical page.
I don't know if it's possible to do non-coherent DMA with a PCI / PCIe device, but that might be your only hope for getting actual DRAM contents without going through cache.
Normally (always?) DMA on modern x86 systems is cache-coherent, which is good for performance. To maintain backwards compat with 386 and earlier CPUs without caches, the first x86 CPUs with caches had cache-coherent DMA, not introducing cache-control instructions until later generations, since existing OSes didn't use them. In modern systems, memory controllers are built-in to the CPU. So on Intel CPUs, the system agent can snoop L3 tags to see if a line is cached anywhere on-chip in parallel with sending the request to the memory controller. Or a Xeon can DMA right into L3 cache without data having to bounce through DRAM, good for high bandwidth NICs.
There's an INVD instruction which invalidates all caches without doing write-back first, but I think that includes the shared L3 cache, and probably the private caches of all other cores. So you can't practically use it on a Linux system where other cores are potentially in the middle of doing stuff; you'd potentially corrupt kernel data structures by using it, as well as simulating power failure on a machine with NVDIMMs for the process you were interested in.
Maybe if you somehow offlined all the other CPU cores, and disabled interrupts on the one core that was still up
you could wbinvd (write-back+invalidate) to flush all caches
then run some code under test
then invd and see what made it to DRAM
Then re-enable interrupts. Interrupt handlers could end up with some kernel data cached and some in memory, or get device drivers out of sync with hardware, if any interrupts are handled between the wbinvd and the invd.
Update: someone did actually attempt this:
How to run "invd" instruction with disabled SMP support?
How to explicitly load a structure into L1d cache? Weird results with INVD with CR0.CD = 1 on isolated core with/without hyperthreading - invd worked so well it nuked some of the stores done by printk in the mis-designed attempt to log something about it.

What does ASLR(address space layout randomiztion) do?

I read that it is a security measure to protect against common attacks. The idea is that it keeps randomizing the virtual memory space which I believe will require periodic updates to the page table and the TLB? Am I correct?
My other question is, does it, at all, randomize the physical location of pages in the physical memory? Because I have been looking into the behavior of the physical memory under and without ASLR and the behavior is different.

ASLR is a security feature that's supposed to randomise the virtual address space of a program on each run. It doesn't hot-swap the virtual address space of a program while it's running (that would be a disaster).
The operating system will usually have to make periodic updates to the page table simply as part of scheduling. Whether ASLR is enabled is largely irrelevant to page table updates.
As to your other question, the layout of physical memory is essentially random whether you have ASLR enabled or not under Linux. Pages get swapped in and out of memory very often and, furthermore, the physical memory layout can quickly become fragmented. The only semi-predictable parts of physical memory would probably be memory reserved for DMA.
I'm still unsure how you managed to "look into the behaviour of physical memory" and conclude that the behaviour is significantly different between ASLR enabled and not.

Linux Page Table Management and MMU

I have a question about relationship between linux kernel and MMU.
I now got a point that the linux kernel manages page table between virtual memory addresses and physical memory addresses.
At the same time there is MMU in x86 architecture which manages page table between virtual memory addresses and physical memory addresses.
If MMU presents near CPU, does kernel still need to take care of page table?
This question may be stupid, but the other question is, if MMU takes care of memory space, who manages high memory and low memory? I believe kernel will receive size of virtual memory from MMU (4GB in 32bit) then kernel will distinguish between userspace and kernel space in virtual address.
Am I correct? or completely wrong?
Thanks a lot in advance!

The OS and MMU page management responsibilities are 2 sides of the same mechanism, that lives on the boundary between architecture and micro-architecture.
The first side defines the "contract" between the hardware and the software that runs over it (in this case - the OS) - if you want to use virtual memory, you need build and maintain a page table as described in that contract.
The MMU side, on the other hand, is a hardware unit that's responsible for performing the HW tasks of the address translation. This may or may not include hardware optimizations, these are usually hidden and may be implemented in various ways to run under the hood, as long as it maintains the hardware side of the contract.
In theory, the MMU may decide to issue a set of memory accesses for each translation (a page walk), in order to achieve the required behavior. However, since it's a performance critical element, most MMUs optimize this by caching the results of previous page walks inside the TLB, just like a cache stores the results of previous accesses (actually, on some implementations, the caches themselves may also store some of the accesses to the page table since it usually resides in cacheable memory). The MMU can manage multiple TLBs (most implementations separate the ones for data and code pages, and some have 2nd level TLBs), and provide the translation from there without you noticing that except for the faster access time.
It should also be noted that the hardware must guard against many corner cases that can harm the coherency of such TLB "caching" of previous translations, for example page aliasing or remaps during usage. On some machines, the nastier cases even require a massive flush flow called TLB shootdown.

Userspace vs kernel space driver

I am looking to write a PWM driver. I know that there are two ways we can control a hardware driver:
User space driver.
Kernel space driver
If in general (do not consider a PWM driver case) we have to make a decision whether to go for user space or kernel space driver. Then what factors we have to take into consideration apart from these?
User space driver can directly mmap() /dev/mem memory to their virtual address space and need no context switching.
Userspace driver cannot have interrupt handlers implemented (They have to poll for interrupt).
Userspace driver cannot perform DMA (As DMA capable memory can be allocated from kernel space).

From those three factors that you have listed only the first one is actually correct. As for the rest — not really. It is possible for a user space code to perform DMA operations — no problem with that. There are many hardware appliance companies who employ this technique in their products. It is also possible to have an interrupt driven user-space application, even when all of the I/O is done with a full kernel-bypass. Of course, it is not as easy simply doing an mmap() on /dev/mem.
You would have to have a minimal portion of your driver in the kernel — that is needed in order to provide your user space with a bare minimum that it needs from the kernel (because if you think about it — /dev/mem is also backed up by a character device driver).
For DMA, it is actually too darn easy — all you have to do is to handle mmap request and map a DMA buffer into the user space. For interrupts — it is a little bit more tricky, the interrupt must be handled by the kernel no matter what, however, the kernel may not do any work and just wake up the process that calls, say, epoll_wait(). Another approach is to deliver a signal to the process as done by DOSEMU, but that is very slow and is not recommended.
As for your actual question, one factor that you should take into consideration is resource sharing. As long as you don't have to share a device across multiple applications and there is nothing that you cannot do in user space — go for the user space. You will probably save tons of time during the development cycle as writing user space code is extremely easy. When, however, two or more applications need to share the device (or its resources) then chances are that you will spend tremendous amount of time making it possible — just imagine multiple processes forking, crashing, mapping (the same?) memory concurrently etc. And after all, IPC is generally done through the kernel, so if application would need to start "talking" to each other, the performance might degrade greatly. This is still done in real-life for certain performance-critical applications, though, but I don't want to go into those details.
Another factor is the kernel infrastructure. Let's say you want to write a network device driver. That's not a problem to do it in user space. However, if you do that then you'd need to write a full network stack too as it won't be possible to user Linux's default one that lives in the kernel.
I'd say go for user space if it is possible and the amount of effort to make things work is less than writing a kernel driver, and keeping in mind that one day it might be necessary to move code into the kernel. In fact, this is a common practice to have the same code being compiled for both user space and kernel space depending on whether some macro is defined or not, because testing in user space is a lot more pleasant.

Another consideration: it is far easier to debug user-space drivers. You can use gdb, valgrind, etc. Heck, you don't even have to write your driver in C.
There's a third option beyond just user space or kernel space drivers: some of both. You can do just the kernel-space-only stuff in a kernel driver and do everything else in user space. You might not even have to write the kernel space driver if you use the Linux UIO driver framework (see https://www.kernel.org/doc/html/latest/driver-api/uio-howto.html).
I've had luck writing a DMA-capable driver almost completely in user space. UIO provides the infrastructure so you can just read/select/epoll on a file to wait on an interrupt.
You should be cognizant of the security implications of programming the DMA descriptors from user space: unless you have some protection in the device itself or an IOMMU, the user space driver can cause the device to read from or write to any address in physical memory.

Force Linux to use only memory over 4G?

I have a Linux device driver that interfaces to a device that, in theory, can perform DMA using 64-bit addresses. I'd like to test to see that this actually works.
Is there a simple way that I can force a Linux machine not to use any memory below physical address 4G? It's OK if the kernel image is in low memory; I just want to be able to force a situation where I know all my dynamically allocated buffers, and any kernel or user buffers allocated for me are not addressable in 32 bits. This is a little brute force, but would be more comprehensive than anything else I can think of.
This should help me catch (1) hardware that wasn't configured correctly or loaded with the full address (or is just plain broken) as well as (2) accidental and unnecessary use of bounce buffers (because there's nowhere to bounce to).
clarification: I'm running x86_64, so I don't care about most of the old 32-bit addressing issues. I just want to test that a driver can correctly interface with multitudes of buffers using 64-bit physical addresses.

/usr/src/linux/Documentation/kernel-parameters.txt
memmap=exactmap [KNL,X86] Enable setting of an exact
E820 memory map, as specified by the user.
Such memmap=exactmap lines can be constructed based on
BIOS output or other requirements. See the memmap=nn#ss
option description.
memmap=nn[KMG]#ss[KMG]
[KNL] Force usage of a specific region of memory
Region of memory to be used, from ss to ss+nn.
memmap=nn[KMG]#ss[KMG]
[KNL,ACPI] Mark specific memory as ACPI data.
Region of memory to be used, from ss to ss+nn.
memmap=nn[KMG]$ss[KMG]
[KNL,ACPI] Mark specific memory as reserved.
Region of memory to be used, from ss to ss+nn.
Example: Exclude memory from 0x18690000-0x1869ffff
memmap=64K$0x18690000
or
memmap=0x10000$0x18690000
If you add memmap=4G$0 to the kernel's boot parameters, the lower 4GB of physical memory will no longer be accessible. Also, your system will no longer boot... but some variation hereof (memmap=3584M$512M?) may allow for enough memory below 4GB for the system to boot but not enough that your driver's DMA buffers will be allocated there.

IIRC there's an option within kernel configuration to use PAE extensions which will enable you to use more than 4GB (I am a bit rusty on the kernel config - last kernel I recompiled was 2.6.4 - so please excuse my lack of recall). You do know how to trigger a kernel config
make clean && make menuconfig
Hope this helps,
Best regards,
Tom.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string