Here is the problem I have:
rx/tx packet in kernel driver. User space program need to access each of these packet. So, there are huge amount of data transfer between kernel and user space. (data stream: kernel rx -> user space process -> kernel tx)
throughput is the KPI.
I decide to use share memory/mmap to avoid data copy. although I haven't test it, others have told me tlb missing will be a problem.
The system I use is a
mips32 system (mips74kc, single core)
default page size 4KB.
kernel 2.6.32
It can only fit in one data packet. During the data transformation, there will be lots of tlb missing that impact throughput.
I found huge page might be a solution. But, it seems like only mips64 support hugetlbfs currently.
https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
https://www.linux-mips.org/archives/linux-mips/2009-05/msg00429.html
So, my question is: how can I use hugetlbfs on mips32. or is there other way to solve the throughput problem.(I must do the data process part in user space)
According to ddaney's patch,
Currently the patch only works for 64-bit kernels because the value of
PTRS_PER_PTE in 32-bit kernels is such that it is impossible to have a
valid PageMask. It is thought that by adjusting the page allocation
scheme, 32-bit kernels could be supported in the future.
It seems possible. Could someone give me a hint, what need to be modify, in order to enable hugetlb.
thank you!
Does documentation of your core list support of non 4KB page in its TLB? If it is not supported you should modify your CPU (replace it with some which support larger pages, or redesign your CPU and make new chip).
But most probably you are on wrong track, and tlb missing is not yet proven to be the problem (and the 2MB huge page is wrong solution to 8KB or 15KB packets).
I will tell you "zero-copy" and/or user-space networking (netmap, snabb, PF_RING, DPDK, Network stack in userspace), or user-space network driver; or kernel-based data handler. But many of these tools are only for newer kernels.
Related
My goal is to read in stale and outdated values of memory without cache-coherence. I have attempted to use prefetchnta to perform a non-temporal load, but it failed to fetch outdated values. I am looking into performing some kind of Streaming Memory-to-Memory Direct-Memory-Access, but am having a little trouble due to the overwhelming amount of background knowledge required to proceed with my current project. Currently I am attempting to mess around with udmabuf but even that is going slowly. It should be noted that ideally I would like to ignore the contents of all CPU caches, including the current CPU.
To provide my reasoning as to why: I am developing software that can be used to prove correctness of programs written for non-volatile memory. As the CPU Cache is volatile, the CPU's write-back cache will still be volatile and the arbitrary nature of how they are written back to memory needs to be observed.
I would sincerely appreciate it if someone could give me some pointers of how to proceed. I do not mind digging into the Linux kernel, as in fact I am doing that now, nor do I mind modifying it, I just need a little guidance in the right direction.
I haven't played around with this, but my understanding from the docs is that for loads (unlike NT stores) nothing can bypass cache or override the strong ordering of memory types like the normal WB (write-back). And even NT stores evict already-cached data, so they can't break coherence for this or another core that has cached data for the line you're writing.
You can do weakly-ordered loads from WC (write-combining) memory regions (with prefetchnta or SSE4 movntdqa), but they're probably still coherent at the physical address level.
#MargaretBloom commented
IIRC Intel warns the developer about multiple mapping with different cache types, which may indeed be good in this case.
so maybe you could actually bypass cache coherence with multiple virtual mappings of the same physical page.
I don't know if it's possible to do non-coherent DMA with a PCI / PCIe device, but that might be your only hope for getting actual DRAM contents without going through cache.
Normally (always?) DMA on modern x86 systems is cache-coherent, which is good for performance. To maintain backwards compat with 386 and earlier CPUs without caches, the first x86 CPUs with caches had cache-coherent DMA, not introducing cache-control instructions until later generations, since existing OSes didn't use them. In modern systems, memory controllers are built-in to the CPU. So on Intel CPUs, the system agent can snoop L3 tags to see if a line is cached anywhere on-chip in parallel with sending the request to the memory controller. Or a Xeon can DMA right into L3 cache without data having to bounce through DRAM, good for high bandwidth NICs.
There's an INVD instruction which invalidates all caches without doing write-back first, but I think that includes the shared L3 cache, and probably the private caches of all other cores. So you can't practically use it on a Linux system where other cores are potentially in the middle of doing stuff; you'd potentially corrupt kernel data structures by using it, as well as simulating power failure on a machine with NVDIMMs for the process you were interested in.
Maybe if you somehow offlined all the other CPU cores, and disabled interrupts on the one core that was still up
you could wbinvd (write-back+invalidate) to flush all caches
then run some code under test
then invd and see what made it to DRAM
Then re-enable interrupts. Interrupt handlers could end up with some kernel data cached and some in memory, or get device drivers out of sync with hardware, if any interrupts are handled between the wbinvd and the invd.
Update: someone did actually attempt this:
How to run "invd" instruction with disabled SMP support?
How to explicitly load a structure into L1d cache? Weird results with INVD with CR0.CD = 1 on isolated core with/without hyperthreading - invd worked so well it nuked some of the stores done by printk in the mis-designed attempt to log something about it.
I know processors these days, some of them, support 2MB and 1GB page sizes. Is it possible to compile the Linux kernel to natively support 2MB as opposed to the standard 4Kb page?
Thanks.
Well, I can say yes and no.
The page size is fixed. But that depend on your patience to the erros and issues that you will encounter.
The page size is known and determined by the MMU hardware, so the operating system is taking that into account. However, notice that some Linux systems (and hardware!) have hugetlbpage and Linux mmap(2) might accept MAP_HUGETLB (but your code should handle the case of processors or kernels without huge page support, e.g. by calling mmap again without MAP_HUGETLB when the first mmap with MAP_HUGETLB has failed).
You may find these links interested for you:
https://www.cloudibee.com/linux-hugepages/
https://forums.opensuse.org/showthread.php/437078-changing-pagesize-kernel
https://linuxgazette.net/155/krishnakumar.html
https://lwn.net/Articles/375096/
What is the difference between "zero-copy networking" and "kernel bypass"? Are they two phrases meaning the same thing, or different? Is kernel bypass a technique used within "zero copy networking" and this is the relationship?
What is the difference between "zero-copy networking" and "kernel bypass"? Are they two phrases meaning the same thing, or different? Is kernel bypass a technique used within "zero copy networking" and this is the relationship?
TL;DR - They are different concepts, but it is quite likely that zero copy is supported within kernel bypass API/framework.
User Bypass
This mode of communicating should also be considered. It maybe possible for DMA-to-DMA transactions which do not involve the CPU at all. The idea is to use splice() or similar functions to avoid user space at all. Note, that with splice(), the entire data stream does not need to bypass user space. Headers can be read in user space and data streamed directly to disk. The most common downfall of this is splice() doesn't do checksum offloading.
Zero copy
The zero copy concept is only that the network buffers are fixed in place and are not moved around. In many cases, this is not really beneficial. Most modern network hardware supports scatter gather, also know as buffer descriptors, etc. The idea is the network hardware understands physical pointers. The buffer descriptor typically consists of,
Data pointer
Length
Next buffer descriptor
The benefit is that the network headers do not need to exist side-by-side and IP, TCP, and Application headers can reside physically seperate from the application data.
If a controller doesn't support this, then the TCP/IP headers must precede the user data so that they can be filled in before sending to the network controller.
zero copy also implies some kernel-user MMU setup so that pages are shared.
Kernel Bypass
Of course, you can bypass the kernel. This is what pcap and other sniffer software has been doing for some time. However, pcap does not prevent the normal kernel processing; but the concept is similar to what a kernel bypass framework would allow. Ie, directly deliver packets to user space where processing headers would happen.
However, it is difficult to see a case where user space will have a definite win unless it is tied to the particular hardware. Some network controllers may have scatter gather supported in the controller and others may not.
There are various incarnation of kernel interfaces to accomplish kernel by-pass. A difficulty is what happens with the received data and producing the data for transmission. Often this involve other devices and so there are many solutions.
To put this together...
Are they two phrases meaning the same thing, or different?
They are different as above hopefully explains.
Is kernel bypass a technique used within "zero copy networking" and this is the relationship?
It is the opposite. Kernel bypass can use zero copy and most likely will support it as the buffers are completely under control of the application. Also, there is no memory sharing between the kernel and user space (meaning no need for MMU shared pages and whatever cache/TLB effects that may cause). So if you are using kernel bypass, it will often be advantageous to support zero copy; so the things may seem the same at first.
If scatter-gather DMA is available (most modern controllers) either user space or the kernel can use it. zero copy is not as useful in this case.
Reference:
Technical reference on OnLoad, a high band width kernel by-pass system.
PF Ring as of 2.6.32, if configured
Linux kernel network buffer management by David Miller. This gives an idea of how the protocols headers/trailers are managed in the kernel.
Zero-copy networking
You're doing zero-copy networking when you never copy the data between the user-space and the kernel-space (I mean memory space). By example:
C language
recv(fd, buffer, BUFFER_SIZE, 0);
By default the data are copied:
The kernel gets the data from the network stack
The kernel copies this data to the buffer, which is in the user-space.
With zero-copy method, the data are not copied and come to the user-space directly from the network stack.
Kernel Bypass
The kernel bypass is when you manage yourself, in the user-space, the network stack and hardware stuff. It is hard, but you will gain a lot of performance (there is zero copy, since all the data are in the user-space). This link could be interesting if you want more information.
ZERO-COPY:
When transmitting and receiving packets,
all packet data must be copied from user-space buffers to kernel-space buffers for transmitting and vice versa for receiving. A zero-copy driver avoids this by having user space and the driver share packet buffer memory directly.
Instead of having the transmit and receive point to buffers in kernel space which will later require to copy, a region of memory in user space is allocated, and mapped to a given region of physical memory, to be shared memory between the kernel buffers and the user-space buffers, then point each descriptor buffer to its corresponding place in the newly allocated memory.
Other examples of kernel bypass and zero copy are DPDK and RDMA. When an application uses DPDK it is bypassing the kernel TCP/IP stack. The application is creating the Ethernet frames and the NIC grabbing those frames with DMA directly from user space memory so it's zero copy because there is no copy from user space to kernel space. Applications can do similar things with RDMA. The application writes to queue pairs that the NIC directly access and transmits. RDMA iblibverbs is used inside the kernel as well so when iSER is using RDMA it's not Kernel bypass but it is zero copy.
http://dpdk.org/
https://www.openfabrics.org/index.php/openfabrics-software.html
I have a Linux device driver that interfaces to a device that, in theory, can perform DMA using 64-bit addresses. I'd like to test to see that this actually works.
Is there a simple way that I can force a Linux machine not to use any memory below physical address 4G? It's OK if the kernel image is in low memory; I just want to be able to force a situation where I know all my dynamically allocated buffers, and any kernel or user buffers allocated for me are not addressable in 32 bits. This is a little brute force, but would be more comprehensive than anything else I can think of.
This should help me catch (1) hardware that wasn't configured correctly or loaded with the full address (or is just plain broken) as well as (2) accidental and unnecessary use of bounce buffers (because there's nowhere to bounce to).
clarification: I'm running x86_64, so I don't care about most of the old 32-bit addressing issues. I just want to test that a driver can correctly interface with multitudes of buffers using 64-bit physical addresses.
/usr/src/linux/Documentation/kernel-parameters.txt
memmap=exactmap [KNL,X86] Enable setting of an exact
E820 memory map, as specified by the user.
Such memmap=exactmap lines can be constructed based on
BIOS output or other requirements. See the memmap=nn#ss
option description.
memmap=nn[KMG]#ss[KMG]
[KNL] Force usage of a specific region of memory
Region of memory to be used, from ss to ss+nn.
memmap=nn[KMG]#ss[KMG]
[KNL,ACPI] Mark specific memory as ACPI data.
Region of memory to be used, from ss to ss+nn.
memmap=nn[KMG]$ss[KMG]
[KNL,ACPI] Mark specific memory as reserved.
Region of memory to be used, from ss to ss+nn.
Example: Exclude memory from 0x18690000-0x1869ffff
memmap=64K$0x18690000
or
memmap=0x10000$0x18690000
If you add memmap=4G$0 to the kernel's boot parameters, the lower 4GB of physical memory will no longer be accessible. Also, your system will no longer boot... but some variation hereof (memmap=3584M$512M?) may allow for enough memory below 4GB for the system to boot but not enough that your driver's DMA buffers will be allocated there.
IIRC there's an option within kernel configuration to use PAE extensions which will enable you to use more than 4GB (I am a bit rusty on the kernel config - last kernel I recompiled was 2.6.4 - so please excuse my lack of recall). You do know how to trigger a kernel config
make clean && make menuconfig
Hope this helps,
Best regards,
Tom.
I know that information exchange can happen via following interfaces between kernel and user space programs
system calls
ioctls
/proc & /sys
netlink
I want to find out
If I have missed any other interface?
Which one of them is the fastest way to exchange large amounts of data?
(and if there is any document/mail/explanation supporting such a claim that I can refer to)
Which one is the recommended way to communicate? (I think its netlink, but still would love to hear opinions)
The fastest way to exchange vast amount of data is memory mapping. The mmap call can be used on a device file, and the corresponding kernel driver can then decide to map kernel memory to user address space. A good example of this is the Video For Linux drivers, and I suppose the frame buffer driver works the same way. For an good explanation of how the V4L2 driver works, you have :
The lwn.net article about streaming I/O
The V4L2 spec itself
You can't beat memory mapping for large amount of data, because there is no memcopy like operation involved, the physical underlying memory is effectively shared between kernel and userspace. Of course, like in all shared memory mechanism, you have to provide some synchronisation so that kernel and userspace don't think they have ownership at the same time.
Shared Memory between kernel and usespace is doable.
http://kerneltrap.org/node/14326
For instructions/examples.
You can also use a named pipe which are pretty fast.
All this really depends on what data you are sharing, is it concurrently accessed and what the data is structured like. Calls may be enough for simple data.
Linux kernel /proc FIFO/pipe
Might also help
good luck
You may also consider relay (formerly relayfs):
"Basically relayfs is just a bunch of per-cpu kernel buffers that can be efficiently written into from kernel code. These buffers are represented as files which can be mmap'ed and directly read from in user space. The purpose of this setup is to provide the simplest possible mechanism allowing potentially large amounts of data to be logged in the kernel and 'relayed' to user space."
http://relayfs.sourceforge.net/
You can obviously do shared memory with copy_from_user etc, you can easily set up a character device driver basically all you have to do is make a file_operation structures but this is by far not the fastest way.
I have no benchmarks but system calls on moderns systems should be the fastest. My reasoning is that its what's been most optimized for. It used to be that to get to from user -> kernel one had to create an interrupt, which would then go to the Interrupt table(an array) then locate the interrupt handlex(0x80) and then go to kernel mode. This was really slow, and then came the .sysenter instruction, which basically makes this process really fast. Without going into details, .sysenter reads form a register CS:EIP immediately and the change is quite fast.
Shared memory on the contrary requires writing to and reading from memory, which is infinitely more expensive than reading from a register.
Here is a possible compilation of all the possible interface, although in some ways they overlapped one another (eg, socket and system call are both effectively using system calls):
Procfs
Sysfs
Configfs
Debugfs
Sysctl
devfs (eg, Character Devices)
TCP/UDP Sockets
Netlink Sockets
Ioctl
Kernel System Calls
Signals
Mmap
As for shared memory , I've found that even with NUMA the two thread running on two differrent cores communicate through shared memory still required write/read from L3 cache which if lucky (in one socket)is
about 2X slower than syscall , and if(not on one socket ),is about 5X-UP
slower than syscall,i think syscall's hardware mechanism helped.