Converting EFI memory Map to E820 map - linux

I am new to Linux and learing about how Linux comes to know about the avaible Physical Mmeory .I came to know there are some BIOS system call int 0x15 which willl gives you E20 memory Map.
Now I find a piece of code where its says its defination for converting EFI memory map to E820 Memory map.What does above mean??
Is it meant underlaying motherboard firmware is EFI based but since this code runs on x86 we need to convert it to E820 Memory Map
If so ,does x86 knows only about E820 memory maps??
What is difference between E820 and EFI memory maps??
Looking forward to get detailed answer on same.

In both cases, what you have is your firmware (BIOS or EFI) which is responsible for detecting what memory (and how much) is actually physically plugged in, and the operating system which needs to know this information in some format.
Is it meant underlaying motherboard firmware is EFI based but since this code runs on x86 we need to convert it to E820 Memory Map
Your confusion here is that EFI and x86 are incompatible - they aren't. EFI firmware has its own mechanisms for reporting available memory - specifically, you can use the GetMemoryMap boot service (before you invoke ExitBootServices) to retrieve the memory map from the firmware. However, critically, this memory map is in the format the EFI firmware wishes to report (EFI_MEMORY_DESCRIPTOR) rather than E820. In this scenario, you would not also attempt int 15h, since you already have the information you need.
I suspect what the Linux kernel does is to use the E820 format as its internal representation of memory on the x86 architecture. However, when booting EFI, the kernel must use the EFI firmware boot services, but chooses the convert the answer it gets back to the E820 format.
This is not a necessary thing for a kernel you are writing to do. You simply need to know how memory is mapped.
It is also the case that some bootloaders will provide this information for you, for example GRUB. Part of the multiboot specification allows you to instruct the bootloader that it must provide this information to your kernel.
For more on this, the ever-useful osdev wiki has code samples etc. The relevant sections for getting memory maps from grub are here.
Further points:
The OS needs to understand what memory is mapped where for several reasons. One is to avoid using physical memory where firmware services reside, but the other is for communication with devices who share memory with the CPU. The video buffer is a common example of this.
Secondly, listing the memory map in EFI is not too difficult. If you haven't already discovered it, the UEFI shell that comes with some firmware has a memmap command to display the memory map. If you want to implement this yourself, a quick and dirty way to do that looks like this:
EFI_STATUS EFIAPI PrintMemoryMap(EFI_SYSTEM_TABLE* SystemTable)
{
EFI_STATUS status = EFI_SUCCESS;
UINTN MemMapSize = sizeof(EFI_MEMORY_DESCRIPTOR)*16;
UINTN MemMapSizeOut = MemMapSize;
UINTN MemMapKey = 0; UINTN MemMapDescriptorSize = 0;
UINT32 MemMapDescriptorVersion = 0;
UINTN DescriptorCount = 0;
UINTN i = 0;
uint8_t* buffer = NULL;
EFI_MEMORY_DESCRIPTOR* MemoryDescriptorPtr = NULL;
do
{
buffer = AllocatePool(MemMapSize);
if ( buffer == NULL ) break;
status = gBS->GetMemoryMap(&MemMapSizeOut, (EFI_MEMORY_DESCRIPTOR*)buffer,
&MemMapKey, &MemMapDescriptorSize, &MemMapDescriptorVersion);
Print(L"MemoryMap: Status %x\n", status);
if ( status != EFI_SUCCESS )
{
FreePool(buffer);
MemMapSize += sizeof(EFI_MEMORY_DESCRIPTOR)*16;
}
} while ( status != EFI_SUCCESS );
if ( buffer != NULL )
{
DescriptorCount = MemMapSizeOut / MemMapDescriptorSize;
MemoryDescriptorPtr = (EFI_MEMORY_DESCRIPTOR*)buffer;
Print(L"MemoryMap: DescriptorCount %d\n", DescriptorCount);
for ( i = 0; i < DescriptorCount; i++ )
{
MemoryDescriptorPtr = (EFI_MEMORY_DESCRIPTOR*)(buffer + (i*MemMapDescriptorSize));
Print(L"Type: %d PhsyicalStart: %lx VirtualStart: %lx NumberofPages: %d Attribute %lx\n",
MemoryDescriptorPtr->Type, MemoryDescriptorPtr->PhysicalStart,
MemoryDescriptorPtr->VirtualStart, MemoryDescriptorPtr->NumberOfPages,
MemoryDescriptorPtr->Attribute);
}
FreePool(buffer);
}
return status;
}
This is a reasonably straightforward function. GetMemoryMap complains bitterly if you don't pass in a large enough buffer, so we keep incrementing the buffer size until we have enough space. Then we loop and print. Be aware that sizeof(EFI_MEMORY_DESCRIPTOR) is in fact not the difference between structs in the output buffer - use the returned size calculation shown above, or you'll end up with a much larger table than you really have (and the address spaces will all look wrong).
It wouldn't be massively difficult to decide on a common format with E820 from this table.

Related

Zero-copy user-space TCP send of dma_mmap_coherent() mapped memory

I'm running Linux 5.1 on a Cyclone V SoC, which is an FPGA with two ARMv7 cores in one chip. My goal is to gather lots of data from an external interface and stream (part of) this data out through a TCP socket. The challenge here is that the data rate is very high and could come close to saturating the GbE interface. I have a working implementation that just uses write() calls to the socket, but it tops out at 55MB/s; roughly half the theoretical GbE limit. I'm now trying to get zero-copy TCP transmission to work to increase the throughput, but I'm hitting a wall.
To get the data out of the FPGA into Linux user-space, I've written a kernel driver. This driver uses a DMA block in the FPGA to copy a large amount of data from an external interface into DDR3 memory attached to the ARMv7 cores. The driver allocates this memory as a bunch of contiguous 1MB buffers when probed using dma_alloc_coherent() with GFP_USER, and exposes these to the userspace application by implementing mmap() on a file in /dev/ and returning an address to the application using dma_mmap_coherent() on the preallocated buffers.
So far so good; the user-space application is seeing valid data and the throughput is more than enough at >360MB/s with room to spare (the external interface is not fast enough to really see what the upper bound is).
To implement zero-copy TCP networking, my first approach was to use SO_ZEROCOPY on the socket:
sent_bytes = send(fd, buf, len, MSG_ZEROCOPY);
if (sent_bytes < 0) {
perror("send");
return -1;
}
However, this results in send: Bad address.
After googling for a bit, my second approach was to use a pipe and splice() followed by vmsplice():
ssize_t sent_bytes;
int pipes[2];
struct iovec iov = {
.iov_base = buf,
.iov_len = len
};
pipe(pipes);
sent_bytes = vmsplice(pipes[1], &iov, 1, 0);
if (sent_bytes < 0) {
perror("vmsplice");
return -1;
}
sent_bytes = splice(pipes[0], 0, fd, 0, sent_bytes, SPLICE_F_MOVE);
if (sent_bytes < 0) {
perror("splice");
return -1;
}
However, the result is the same: vmsplice: Bad address.
Note that if I replace the call to vmsplice() or send() to a function that just prints the data pointed to by buf (or a send() without MSG_ZEROCOPY), everything is working just fine; so the data is accessible to userspace, but the vmsplice()/send(..., MSG_ZEROCOPY) calls seem unable to handle it.
What am I missing here? Is there any way of using zero-copy TCP sending with a user-space address obtained from a kernel driver through dma_mmap_coherent()? Is there another approach I could use?
UPDATE
So I dove a bit deeper into the sendmsg() MSG_ZEROCOPY path in the kernel, and the call that eventually fails is get_user_pages_fast(). This call returns -EFAULT because check_vma_flags() finds the VM_PFNMAP flag set in the vma. This flag is apparently set when the pages are mapped into user space using remap_pfn_range() or dma_mmap_coherent(). My next approach is to find another way to mmap these pages.
As I posted in an update in my question, the underlying problem is that zerocopy networking does not work for memory that has been mapped using remap_pfn_range() (which dma_mmap_coherent() happens to use under the hood as well). The reason is that this type of memory (with the VM_PFNMAP flag set) does not have metadata in the form of struct page* associated with each page, which it needs.
The solution then is to allocate the memory in a way that struct page*s are associated with the memory.
The workflow that now works for me to allocate the memory is:
Use struct page* page = alloc_pages(GFP_USER, page_order); to allocate a block of contiguous physical memory, where the number of contiguous pages that will be allocated is given by 2**page_order.
Split the high-order/compound page into 0-order pages by calling split_page(page, page_order);. This now means that struct page* page has become an array with 2**page_order entries.
Now to submit such a region to the DMA (for data reception):
dma_addr = dma_map_page(dev, page, 0, length, DMA_FROM_DEVICE);
dma_desc = dmaengine_prep_slave_single(dma_chan, dma_addr, length, DMA_DEV_TO_MEM, 0);
dmaengine_submit(dma_desc);
When we get a callback from the DMA that the transfer has finished, we need to unmap the region to transfer ownership of this block of memory back to the CPU, which takes care of caches to make sure we're not reading stale data:
dma_unmap_page(dev, dma_addr, length, DMA_FROM_DEVICE);
Now, when we want to implement mmap(), all we really have to do is call vm_insert_page() repeatedly for all of the 0-order pages that we pre-allocated:
static int my_mmap(struct file *file, struct vm_area_struct *vma) {
int res;
...
for (i = 0; i < 2**page_order; ++i) {
if ((res = vm_insert_page(vma, vma->vm_start + i*PAGE_SIZE, &page[i])) < 0) {
break;
}
}
vma->vm_flags |= VM_LOCKED | VM_DONTCOPY | VM_DONTEXPAND | VM_DENYWRITE;
...
return res;
}
When the file is closed, don't forget to free the pages:
for (i = 0; i < 2**page_order; ++i) {
__free_page(&dev->shm[i].pages[i]);
}
Implementing mmap() this way now allows a socket to use this buffer for sendmsg() with the MSG_ZEROCOPY flag.
Although this works, there are two things that don't sit well with me with this approach:
You can only allocate power-of-2-sized buffers with this method, although you could implement logic to call alloc_pages as many times as needed with decreasing orders to get any size buffer made up of sub-buffers of varying sizes. This will then require some logic to tie these buffers together in the mmap() and to DMA them with scatter-gather (sg) calls rather than single.
split_page() says in its documentation:
* Note: this is probably too low level an operation for use in drivers.
* Please consult with lkml before using this in your driver.
These issues would be easily solved if there was some interface in the kernel to allocate an arbitrary amount of contiguous physical pages. I don't know why there isn't, but I don't find the above issues so important as to go digging into why this isn't available / how to implement it :-)
Maybe this will help you to understand why alloc_pages requires a power-of-2 page number.
To optimize the page allocation process(and decrease external fragmentations), which is frequently engaged, Linux kernel developed per-cpu page cache and buddy-allocator to allocate memory(there is another allocator, slab, to serve memory allocations that are smaller than a page).
Per-cpu page cache serve the one-page allocation request, while buddy-allocator keeps 11 lists, each containing 2^{0-10} physical pages respectively. These lists perform well when allocate and free pages, and of course, the premise is you are requesting a power-of-2-sized buffer.

Why shouldn't I use ioremap on system memory for ARMv6+?

I need to reserve a large buffer of physically contiguous RAM from the kernel and be able to gaurantee that the buffer will always use a specific, hard-coded physical address. This buffer should remain reserved for the kernel's entire lifetime. I have written a chardev driver as an interface for accessing this buffer in userspace. My platform is an embedded system with ARMv7 architecture running a 2.6 Linux kernel.
Chapter 15 of Linux Device Drivers, Third Edition has the following to say on the topic (page 443):
Reserving the top of RAM is accomplished by passing a mem= argument to the kernel at boot time. For example, if you have 256 MB, the argument mem=255M keeps the kernel from using the top megabyte. Your module could later use the following code to gain access to such memory:
dmabuf = ioremap (0xFF00000 /* 255M */, 0x100000 /* 1M */);
I've done that plus a couple of other things:
I'm using the memmap bootarg in addition to the mem one. The kernel boot parameters documentation suggests always using memmap whenever you use mem to avoid address collisions.
I used request_mem_region before calling ioremap and, of course, I check that it succeeds before moving ahead.
This is what the system looks like after I've done all that:
# cat /proc/cmdline
root=/dev/mtdblock2 console=ttyS0,115200 init=/sbin/preinit earlyprintk debug mem=255M memmap=1M$255M
# cat /proc/iomem
08000000-0fffffff : PCIe Outbound Window, Port 0
08000000-082fffff : PCI Bus 0001:01
08000000-081fffff : 0001:01:00.0
08200000-08207fff : 0001:01:00.0
18000300-18000307 : serial
18000400-18000407 : serial
1800c000-1800cfff : dmu_regs
18012000-18012fff : pcie0
18013000-18013fff : pcie1
18014000-18014fff : pcie2
19000000-19000fff : cru_regs
1e000000-1fffffff : norflash
40000000-47ffffff : PCIe Outbound Window, Port 1
40000000-403fffff : PCI Bus 0002:01
40000000-403fffff : 0002:01:00.0
40400000-409fffff : PCI Bus 0002:01
40400000-407fffff : 0002:01:00.0
40800000-40807fff : 0002:01:00.0
80000000-8fefffff : System RAM
80052000-8045dfff : Kernel text
80478000-80500143 : Kernel data
8ff00000-8fffffff : foo
Everything so far looks good, and my driver is working perfectly. I'm able to read and write directly to the specific physical address I've chosen.
However, during bootup, a big scary warning (™) was triggered:
BUG: Your driver calls ioremap() on system memory. This leads
to architecturally unpredictable behaviour on ARMv6+, and ioremap()
will fail in the next kernel release. Please fix your driver.
------------[ cut here ]------------
WARNING: at arch/arm/mm/ioremap.c:211 __arm_ioremap_pfn_caller+0x8c/0x144()
Modules linked in:
[] (unwind_backtrace+0x0/0xf8) from [] (warn_slowpath_common+0x4c/0x64)
[] (warn_slowpath_common+0x4c/0x64) from [] (warn_slowpath_null+0x1c/0x24)
[] (warn_slowpath_null+0x1c/0x24) from [] (__arm_ioremap_pfn_caller+0x8c/0x144)
[] (__arm_ioremap_pfn_caller+0x8c/0x144) from [] (__arm_ioremap_caller+0x50/0x58)
[] (__arm_ioremap_caller+0x50/0x58) from [] (foo_init+0x204/0x2b0)
[] (foo_init+0x204/0x2b0) from [] (do_one_initcall+0x30/0x19c)
[] (do_one_initcall+0x30/0x19c) from [] (kernel_init+0x154/0x218)
[] (kernel_init+0x154/0x218) from [] (kernel_thread_exit+0x0/0x8)
---[ end trace 1a4cab5dbc05c3e7 ]---
Triggered from: arc/arm/mm/ioremap.c
/*
* Don't allow RAM to be mapped - this causes problems with ARMv6+
*/
if (pfn_valid(pfn)) {
printk(KERN_WARNING "BUG: Your driver calls ioremap() on system memory. This leads\n"
KERN_WARNING "to architecturally unpredictable behaviour on ARMv6+, and ioremap()\n"
KERN_WARNING "will fail in the next kernel release. Please fix your driver.\n");
WARN_ON(1);
}
What problems, exactly, could this cause? Can they be mitigated? What are my alternatives?
So I've done exactly that, and it's working.
Provide the kernel command line (e.g. /proc/cmdline) and the resulting memory map (i.e. /proc/iomem) to verify this.
What problems, exactly, could this cause?
The problem with using ioremap() on system memory is that you end up assigning conflicting attributes to the memory which causes "unpredictable" behavior.
See the article "ARM's multiply-mapped memory mess", which provides a history to the warning that you are triggering.
The ARM kernel maps RAM as normal memory with writeback caching; it's also marked non-shared on uniprocessor systems. The ioremap() system call, used to map I/O memory for CPU use, is different: that memory is mapped as device memory, uncached, and, maybe, shared. These different mappings give the expected behavior for both types of memory. Where things get tricky is when somebody calls ioremap() to create a new mapping for system RAM.
The problem with these multiple mappings is that they will have differing attributes. As of version 6 of the ARM architecture, the specified behavior in that situation is "unpredictable."
Note that "system memory" is the RAM that is managed by the kernel.
The fact that you trigger the warning indicates that your code is generating multiple mappings for a region of memory.
Can they be mitigated?
You have to ensure that the RAM you want to ioremap() is not "system memory", i.e. managed by the kernel.
See also this answer.
ADDENDUM
This warning that concerns you is the result of pfn_valid(pfn) returning TRUE rather than FALSE.
Based on the Linux cross-reference link that you provided for version 2.6.37,
pfn_valid() is simply returning the result of
memblock_is_memory(pfn << PAGE_SHIFT);
which in turn is simply returning the result of
memblock_search(&memblock.memory, addr) != -1;
I suggest that the kernel code be hacked so that the conflict is revealed.
Before the call to ioremap(), assign TRUE to the global variable memblock_debug.
The following patch should display the salient information about the memory conflict.
(The memblock list is ordered by base-address, so memblock_search() performs a binary search on this list, hence the use of mid as the index.)
static int __init_memblock memblock_search(struct memblock_type *type, phys_addr_t addr)
{
unsigned int left = 0, right = type->cnt;
do {
unsigned int mid = (right + left) / 2;
if (addr < type->regions[mid].base)
right = mid;
else if (addr >= (type->regions[mid].base +
type->regions[mid].size))
left = mid + 1;
- else
+ else {
+ if (memblock_debug)
+ pr_info("MATCH for 0x%x: m=0x%x b=0x%x s=0x%x\n",
+ addr, mid,
+ type->regions[mid].base,
+ type->regions[mid].size);
return mid;
+ }
} while (left < right);
return -1;
}
If you want to see all the memory blocks, then call memblock_dump_all() with the variable memblock_debug is TRUE.
[Interesting that this is essentially a programming question, yet we haven't seen any of your code.]
ADDENDUM 2
Since you're probably using ATAGs (instead of Device Tree), and you want to dedicate a memory region, fix up the ATAG_MEM to reflect this smaller size of physical memory.
Assuming you have made zero changes to your boot code, the ATAG_MEM is still specifying the full RAM, so perhaps this could be the source of the system memory conflict that causes the warning.
See this answer about ATAGs and this related answer.

Accessing memory pointers in hardware registers

I'm working on enhancing the stock ahci driver provided in Linux in order to perform some needed tasks. I'm at the point of attempting to issue commands to the AHCI HBA for the hard drive to process. However, whenever I do so, my system locks up and reboots. Trying to explain the process of issuing a command to an AHCI drive is far to much for this question. If needed, reference this link for the full discussion (the process is rather defined all over because there are several pieces, however, ch 4 has the data structures necessary).
Essentially, one writes the appropriate structures into memory regions defined by either the BIOS or the OS. The first memory region I should write to is the Command List Base Address contained in the register PxCLB (and PxCLBU if 64-bit addressing applies). My system is 64 bits and so I'm trying to getting both 32-bit registers. My code is essentially this:
void __iomem * pbase = ahci_port_base(ap);
u32 __iomem *temp = (u32*)(pbase + PORT_LST_ADDR);
struct ahci_cmd_hdr *cmd_hdr = NULL;
cmd_hdr = (struct ahci_cmd_hdr*)(u64)
((u64)(*(temp + PORT_LST_ADDR_HI)) << 32 | *temp);
pr_info("%s:%d cmd_list is %p\n", __func__, __LINE__, cmd_hdr);
// problems with this next line, makes the system reboot
//pr_info("%s:%d cl[0]:0x%08x\n", __func__, __LINE__, cmd_hdr->opts);
The function ahci_port_base() is found in the ahci driver (at least it is for CentOS 6.x). Basically, it returns the proper address for that port in the AHCI memory region. PORT_LST_ADDR and PORT_LST_ADDR_HI are both macros defined in that driver. The address that I get after getting both the high and low addresses is usually something like 0x0000000037900000. Is this memory address in a space that I cannot simply dereference it?
I'm hitting my head against the wall at this point because this link shows that accessing it in this manner is essentially how it's done.
The address that I get after getting both the high and low addresses
is usually something like 0x0000000037900000. Is this memory address
in a space that I cannot simply dereference it?
Yes, you are correct - that's a bus address, and you can't just dereference it because paging is enabled. (You shouldn't be just dereferencing the iomapped addresses either - you should be using readl() / writel() for those, but the breakage here is more subtle).
It looks like the right way to access the ahci_cmd_hdr in that driver is:
struct ahci_port_priv *pp = ap->private_data;
cmd_hdr = pp->cmd_slot;

Writing out DMA buffers into memory mapped file

I need to write in embedded Linux(2.6.37) as fast as possible incoming DMA buffers to HD partition as raw device /dev/sda1. Buffers are aligned as required and are of equal 512KB length. The process may continue for a very long time and fill as much as, for example, 256GB of data.
I need to use the memory-mapped file technique (O_DIRECT not applicable), but can't understand the exact way how to do this.
So, in pseudo code "normal" writing:
fd=open(/dev/sda1",O_WRONLY);
while(1) {
p = GetVirtualPointerToNewBuffer();
if (InputStopped())
break;
write(fd, p, BLOCK512KB);
}
Now, I will be very thankful for the similar pseudo/real code example of how to utilize memory-mapped technique for this writing.
UPDATE2:
Thanks to kestasx the latest working test code looks like following:
#define TSIZE (64*KB)
void* TBuf;
int main(int argc, char **argv) {
int fdi=open("input.dat", O_RDONLY);
//int fdo=open("/dev/sdb2", O_RDWR);
int fdo=open("output.dat", O_RDWR);
int i, offs=0;
void* addr;
i = posix_memalign(&TBuf, TSIZE, TSIZE);
if ((fdo < 1) || (fdi < 1)) {
printf("Error in files\n");
return -1; }
while(1) {
addr = mmap((void*)TBuf, TSIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fdo, offs);
if ((unsigned int)addr == 0xFFFFFFFFUL) {
printf("Error MMAP=%d, %s\n", errno, strerror(errno));
return -1; }
i = read(fdi, TBuf, TSIZE);
if (i != TSIZE) {
printf("End of data\n");
return 0; }
i = munmap(addr, TSIZE);
offs += TSIZE;
sleep(1);
};
}
UPDATE3:
1. To precisely imitate the DMA work, I need to move read() call before mmp(), because when the DMA finishes it provides me with the address where it has put data. So, in pseudo code:
while(1) {
read(fdi, TBuf, TSIZE);
addr = mmap((void*)TBuf, TSIZE, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_SHARED, fdo, offs);
munmap(addr, TSIZE);
offs += TSIZE; }
This variant fails after(!) the first loop - read() says BAD ADDRESS on TBuf.
Without understanding exactly what I do, I substituted munmap() with msync(). This worked perfectly.
So, the question here - why unmapping the addr influenced on TBuf?
2.With the previous example working I went to the real system with the DMA. The same loop, just instead of read() call is the call which waits for a DMA buffer to be ready and its virtual address provided.
There are no error, the code runs, BUT nothing is recorded (!).
My thought was that Linux does not see that the area was updated and therefore does not sync() a thing.
To test this, I eliminated in the working example the read() call - and yes, nothing was recorded too.
So, the question here - how can I tell Linux that the mapped region contains new data, please, flush it!
Thanks a lot!!!
If I correctly understand, it makes sense if You mmap() file (not sure if it You can mmap() raw partition/block-device) and data via DMA is written directly to this memory region.
For this to work You need to be able to control p (where new buffer is placed) or address where file is maped. If You don't - You'll have to copy memory contents (and will lose some benefits of mmap).
So psudo code would be:
truncate("data.bin", 256GB);
fd = open( "data.bin", O_RDWR );
p = GetVirtualPointerToNewBuffer();
adr = mmap( p, 1GB, PROT_READ | PROT_WRITE, MAP_SHARED, fd, offset_in_file );
startDMA();
waitDMAfinish();
munmap( adr, 1GB );
This is first step only and I'm not completely sure if it will work with DMA (have no such experience).
I assume it is 32bit system, but even then 1GB mapped file size may be too big (if Your RAM is smaller You'll be swaping).
If this setup will work, next step would be to make loop to map regions of file at different offsets and unmap already filled ones.
Most likely You'll need to align addr to 4KB boundary.
When You'll unmap region, it's data will be synced to disk. So You'll need some testing to select appropriate mapped region size (while next region is filled by DMA, there must be enough time to unmap/write previous one).
UPDATE:
What exactly happens when You fill mmap'ed region via DMA I simply don't know (not sure how exactly dirty pages are detected: what is done by hardware, and what must be done by software).
UPDATE2: To my best knowledge:
DMA works the following way:
CPU arranges DMA transfer (address where to write transfered data in RAM);
DMA controller does the actual work, while CPU can do it's own work in parallel;
once DMA transfer is complete - DMA controller signals CPU via IRQ line (interrupt), so CPU can handle the result.
This seems simple while virtual memory is not involved: DMA should work independently from runing process (actual VM table in use by CPU). Yet it should be some mehanism to invalidate CPU cache for modified by DMA physical RAM pages (don't know if CPU needs to do something, or it is done authomatically by hardware).
mmap() forks the following way:
after successfull call of mmap(), file on disk is attached to process memory range (most likely some data structure is filled in OS kernel to hold this info);
I/O (reading or writing) from mmaped range triggers pagefault, which is handled by kernel loading appropriate blocks from atached file;
writes to mmaped range are handled by hardware (don't know how exactly: maybe writes to previously unmodified pages triger some fault, which is handled by kernel marking these pages dirty; or maybe this marking is done entirely in hardware and this info is available to kernel when it needs to flush modified pages to disk).
modified (dirty) pages are written to disk by OS (as it sees appropriate) or can be forced via msync() or munmap()
In theory it should be possible to do DMA transfers to mmaped range, but You need to find out, how exactly pages ar marked dirty (if You need to do something to inform kernel which pages need to be written to disk).
UPDATE3:
Even if modified by DMA pages are not marked dirty, You should be able to triger marking by rewriting (reading ant then writing the same) at least one value in each page (most likely each 4KB) transfered. Just make sure this rewriting is not removed (optimised out) by compiler.
UPDATE4:
It seems file opened O_WRONLY can't be mmap'ed (see question comments, my experimets confirm this too). It is logical conclusion of mmap() workings described above. The same is confirmed here (with reference to POSIX standart requirement to ensure file is readable regardless of maping protection flags).
Unless there is some way around, it actually means that by using mmap() You can't avoid reading of results file (unnecessary step in Your case).
Regarding DMA transfers to mapped range, I think it will be a requirement to ensure maped pages are preloalocated before DMA starts (so there is real memory asigned to both DMA and maped region). On Linux there is MAP_POPULATE mmap flag, but from manual it seams it works with MAP_PRIVATE mapings only (changes are not writen to disk), so most likely it is usuitable. Likely You'll have to triger pagefaults manually by accessing each maped page. This should triger reading of results file.
If You still wish to use mmap and DMA together, but avoid reading of results file, You'll have to modify kernel internals to allow mmap to use O_WRONLY files (for example by zero-filling trigered pages, instead of reading them from disk).

Linux kernel device driver to DMA from a device into user-space memory

I want to get data from a DMA enabled, PCIe hardware device into user-space as quickly as possible.
Q: How do I combine "direct I/O to user-space with/and/via a DMA transfer"
Reading through LDD3, it seems that I need to perform a few different types of IO operations!?
dma_alloc_coherent gives me the physical address that I can pass to the hardware device.
But would need to have setup get_user_pages and perform a copy_to_user type call when the transfer completes. This seems a waste, asking the Device to DMA into kernel memory (acting as buffer) then transferring it again to user-space.
LDD3 p453: /* Only now is it safe to access the buffer, copy to user, etc. */
What I ideally want is some memory that:
I can use in user-space (Maybe request driver via a ioctl call to create DMA'able memory/buffer?)
I can get a physical address from to pass to the device so that all user-space has to do is perform a read on the driver
the read method would activate the DMA transfer, block waiting for the DMA complete interrupt and release the user-space read afterwards (user-space is now safe to use/read memory).
Do I need single-page streaming mappings, setup mapping and user-space buffers mapped with get_user_pages dma_map_page?
My code so far sets up get_user_pages at the given address from user-space (I call this the Direct I/O part). Then, dma_map_page with a page from get_user_pages. I give the device the return value from dma_map_page as the DMA physical transfer address.
I am using some kernel modules as reference: drivers_scsi_st.c and drivers-net-sh_eth.c. I would look at infiniband code, but cant find which one is the most basic!
Many thanks in advance.
I'm actually working on exactly the same thing right now and I'm going the ioctl() route. The general idea is for user space to allocate the buffer which will be used for the DMA transfer and an ioctl() will be used to pass the size and address of this buffer to the device driver. The driver will then use scatter-gather lists along with the streaming DMA API to transfer data directly to and from the device and user-space buffer.
The implementation strategy I'm using is that the ioctl() in the driver enters a loop that DMA's the userspace buffer in chunks of 256k (which is the hardware imposed limit for how many scatter/gather entries it can handle). This is isolated inside a function that blocks until each transfer is complete (see below). When all bytes are transfered or the incremental transfer function returns an error the ioctl() exits and returns to userspace
Pseudo code for the ioctl()
/*serialize all DMA transfers to/from the device*/
if (mutex_lock_interruptible( &device_ptr->mtx ) )
return -EINTR;
chunk_data = (unsigned long) user_space_addr;
while( *transferred < total_bytes && !ret ) {
chunk_bytes = total_bytes - *transferred;
if (chunk_bytes > HW_DMA_MAX)
chunk_bytes = HW_DMA_MAX; /* 256kb limit imposed by my device */
ret = transfer_chunk(device_ptr, chunk_data, chunk_bytes, transferred);
chunk_data += chunk_bytes;
chunk_offset += chunk_bytes;
}
mutex_unlock(&device_ptr->mtx);
Pseudo code for incremental transfer function:
/*Assuming the userspace pointer is passed as an unsigned long, */
/*calculate the first,last, and number of pages being transferred via*/
first_page = (udata & PAGE_MASK) >> PAGE_SHIFT;
last_page = ((udata+nbytes-1) & PAGE_MASK) >> PAGE_SHIFT;
first_page_offset = udata & PAGE_MASK;
npages = last_page - first_page + 1;
/* Ensure that all userspace pages are locked in memory for the */
/* duration of the DMA transfer */
down_read(&current->mm->mmap_sem);
ret = get_user_pages(current,
current->mm,
udata,
npages,
is_writing_to_userspace,
0,
&pages_array,
NULL);
up_read(&current->mm->mmap_sem);
/* Map a scatter-gather list to point at the userspace pages */
/*first*/
sg_set_page(&sglist[0], pages_array[0], PAGE_SIZE - fp_offset, fp_offset);
/*middle*/
for(i=1; i < npages-1; i++)
sg_set_page(&sglist[i], pages_array[i], PAGE_SIZE, 0);
/*last*/
if (npages > 1) {
sg_set_page(&sglist[npages-1], pages_array[npages-1],
nbytes - (PAGE_SIZE - fp_offset) - ((npages-2)*PAGE_SIZE), 0);
}
/* Do the hardware specific thing to give it the scatter-gather list
and tell it to start the DMA transfer */
/* Wait for the DMA transfer to complete */
ret = wait_event_interruptible_timeout( &device_ptr->dma_wait,
&device_ptr->flag_dma_done, HZ*2 );
if (ret == 0)
/* DMA operation timed out */
else if (ret == -ERESTARTSYS )
/* DMA operation interrupted by signal */
else {
/* DMA success */
*transferred += nbytes;
return 0;
}
The interrupt handler is exceptionally brief:
/* Do hardware specific thing to make the device happy */
/* Wake the thread waiting for this DMA operation to complete */
device_ptr->flag_dma_done = 1;
wake_up_interruptible(device_ptr->dma_wait);
Please note that this is just a general approach, I've been working on this driver for the last few weeks and have yet to actually test it... So please, don't treat this pseudo code as gospel and be sure to double check all logic and parameters ;-).
You basically have the right idea: in 2.1, you can just have userspace allocate any old memory. You do want it page-aligned, so posix_memalign() is a handy API to use.
Then have userspace pass in the userspace virtual address and size of this buffer somehow; ioctl() is a good quick and dirty way to do this. In the kernel, allocate an appropriately sized buffer array of struct page* -- user_buf_size/PAGE_SIZE entries -- and use get_user_pages() to get a list of struct page* for the userspace buffer.
Once you have that, you can allocate an array of struct scatterlist that is the same size as your page array and loop through the list of pages doing sg_set_page(). After the sg list is set up, you do dma_map_sg() on the array of scatterlist and then you can get the sg_dma_address and sg_dma_len for each entry in the scatterlist (note you have to use the return value of dma_map_sg() because you may end up with fewer mapped entries because things might get merged by the DMA mapping code).
That gives you all the bus addresses to pass to your device, and then you can trigger the DMA and wait for it however you want. The read()-based scheme you have is probably fine.
You can refer to drivers/infiniband/core/umem.c, specifically ib_umem_get(), for some code that builds up this mapping, although the generality that that code needs to deal with may make it a bit confusing.
Alternatively, if your device doesn't handle scatter/gather lists too well and you want contiguous memory, you could use get_free_pages() to allocate a physically contiguous buffer and use dma_map_page() on that. To give userspace access to that memory, your driver just needs to implement an mmap method instead of the ioctl as described above.
At some point I wanted to allow user-space application to allocate DMA buffers and get it mapped to user-space and get the physical address to be able to control my device and do DMA transactions (bus mastering) entirely from user-space, totally bypassing the Linux kernel. I have used a little bit different approach though. First I started with a minimal kernel module that was initializing/probing PCIe device and creating a character device. That driver then allowed a user-space application to do two things:
Map PCIe device's I/O bar into user-space using remap_pfn_range() function.
Allocate and free DMA buffers, map them to user space and pass a physical bus address to user-space application.
Basically, it boils down to a custom implementation of mmap() call (though file_operations). One for I/O bar is easy:
struct vm_operations_struct a2gx_bar_vma_ops = {
};
static int a2gx_cdev_mmap_bar2(struct file *filp, struct vm_area_struct *vma)
{
struct a2gx_dev *dev;
size_t size;
size = vma->vm_end - vma->vm_start;
if (size != 134217728)
return -EIO;
dev = filp->private_data;
vma->vm_ops = &a2gx_bar_vma_ops;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_private_data = dev;
if (remap_pfn_range(vma, vma->vm_start,
vmalloc_to_pfn(dev->bar2),
size, vma->vm_page_prot))
{
return -EAGAIN;
}
return 0;
}
And another one that allocates DMA buffers using pci_alloc_consistent() is a little bit more complicated:
static void a2gx_dma_vma_close(struct vm_area_struct *vma)
{
struct a2gx_dma_buf *buf;
struct a2gx_dev *dev;
buf = vma->vm_private_data;
dev = buf->priv_data;
pci_free_consistent(dev->pci_dev, buf->size, buf->cpu_addr, buf->dma_addr);
buf->cpu_addr = NULL; /* Mark this buffer data structure as unused/free */
}
struct vm_operations_struct a2gx_dma_vma_ops = {
.close = a2gx_dma_vma_close
};
static int a2gx_cdev_mmap_dma(struct file *filp, struct vm_area_struct *vma)
{
struct a2gx_dev *dev;
struct a2gx_dma_buf *buf;
size_t size;
unsigned int i;
/* Obtain a pointer to our device structure and calculate the size
of the requested DMA buffer */
dev = filp->private_data;
size = vma->vm_end - vma->vm_start;
if (size < sizeof(unsigned long))
return -EINVAL; /* Something fishy is happening */
/* Find a structure where we can store extra information about this
buffer to be able to release it later. */
for (i = 0; i < A2GX_DMA_BUF_MAX; ++i) {
buf = &dev->dma_buf[i];
if (buf->cpu_addr == NULL)
break;
}
if (buf->cpu_addr != NULL)
return -ENOBUFS; /* Oops, hit the limit of allowed number of
allocated buffers. Change A2GX_DMA_BUF_MAX and
recompile? */
/* Allocate consistent memory that can be used for DMA transactions */
buf->cpu_addr = pci_alloc_consistent(dev->pci_dev, size, &buf->dma_addr);
if (buf->cpu_addr == NULL)
return -ENOMEM; /* Out of juice */
/* There is no way to pass extra information to the user. And I am too lazy
to implement this mmap() call using ioctl(). So we simply tell the user
the bus address of this buffer by copying it to the allocated buffer
itself. Hacks, hacks everywhere. */
memcpy(buf->cpu_addr, &buf->dma_addr, sizeof(buf->dma_addr));
buf->size = size;
buf->priv_data = dev;
vma->vm_ops = &a2gx_dma_vma_ops;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_private_data = buf;
/*
* Map this DMA buffer into user space.
*/
if (remap_pfn_range(vma, vma->vm_start,
vmalloc_to_pfn(buf->cpu_addr),
size, vma->vm_page_prot))
{
/* Out of luck, rollback... */
pci_free_consistent(dev->pci_dev, buf->size, buf->cpu_addr,
buf->dma_addr);
buf->cpu_addr = NULL;
return -EAGAIN;
}
return 0; /* All good! */
}
Once those are in place, user space application can pretty much do everything — control the device by reading/writing from/to I/O registers, allocate and free DMA buffers of arbitrary size, and have the device perform DMA transactions. The only missing part is interrupt-handling. I was doing polling in user space, burning my CPU, and had interrupts disabled.
Hope it helps. Good Luck!
I'm getting confused with the direction to implement. I want to...
Consider the application when designing a driver.
What is the nature of data movement, frequency, size and what else might be going on in the system?
Is the traditional read/write API sufficient?
Is direct mapping the device into user space OK?
Is a reflective (semi-coherent) shared memory desirable?
Manually manipulating data (read/write) is a pretty good option if the data lends itself to being well understood. Using general purpose VM and read/write may be sufficient with an inline copy. Direct mapping non cachable accesses to the peripheral is convenient, but can be clumsy. If the access is the relatively infrequent movement of large blocks, it may make sense to use regular memory, have the drive pin, translate addresses, DMA and release the pages. As an optimization, the pages (maybe huge) can be pre pinned and translated; the drive then can recognize the prepared memory and avoid the complexities of dynamic translation. If there are lots of little I/O operations, having the drive run asynchronously makes sense. If elegance is important, the VM dirty page flag can be used to automatically identify what needs to be moved and a (meta_sync()) call can be used to flush pages. Perhaps a mixture of the above works...
Too often people don't look at the larger problem, before digging into the details. Often the simplest solutions are sufficient. A little effort constructing a behavioral model can help guide what API is preferable.
first_page_offset = udata & PAGE_MASK;
It seems wrong. It should be either:
first_page_offset = udata & ~PAGE_MASK;
or
first_page_offset = udata & (PAGE_SIZE - 1)
It is worth mention that driver with Scatter-Gather DMA support and user space memory allocation is most efficient and has highest performance. However in case we don't need high performance or we want to develop a driver in some simplified conditions we can use some tricks.
Give up zero copy design. It is worth to consider when data throughput is not too big. In such a design data can by copied to user by
copy_to_user(user_buffer, kernel_dma_buffer, count);
user_buffer might be for example buffer argument in character device read() system call implementation. We still need to take care of kernel_dma_buffer allocation. It might by memory obtained from dma_alloc_coherent() call for example.
The another trick is to limit system memory at the boot time and then use it as huge contiguous DMA buffer. It is especially useful during driver and FPGA DMA controller development and rather not recommended in production environments. Lets say PC has 32GB of RAM. If we add mem=20GB to kernel boot parameters list we can use 12GB as huge contiguous dma buffer. To map this memory to user space simply implement mmap() as
remap_pfn_range(vma,
vma->vm_start,
(0x500000000 >> PAGE_SHIFT) + vma->vm_pgoff,
vma->vm_end - vma->vm_start,
vma->vm_page_prot)
Of course this 12GB is completely omitted by OS and can be used only by process which has mapped it into its address space. We can try to avoid it by using Contiguous Memory Allocator (CMA).
Again above tricks will not replace full Scatter-Gather, zero copy DMA driver, but are useful during development time or in some less performance platforms.

Resources