Accessing memory pointers in hardware registers - linux

I'm working on enhancing the stock ahci driver provided in Linux in order to perform some needed tasks. I'm at the point of attempting to issue commands to the AHCI HBA for the hard drive to process. However, whenever I do so, my system locks up and reboots. Trying to explain the process of issuing a command to an AHCI drive is far to much for this question. If needed, reference this link for the full discussion (the process is rather defined all over because there are several pieces, however, ch 4 has the data structures necessary).
Essentially, one writes the appropriate structures into memory regions defined by either the BIOS or the OS. The first memory region I should write to is the Command List Base Address contained in the register PxCLB (and PxCLBU if 64-bit addressing applies). My system is 64 bits and so I'm trying to getting both 32-bit registers. My code is essentially this:
void __iomem * pbase = ahci_port_base(ap);
u32 __iomem *temp = (u32*)(pbase + PORT_LST_ADDR);
struct ahci_cmd_hdr *cmd_hdr = NULL;
cmd_hdr = (struct ahci_cmd_hdr*)(u64)
((u64)(*(temp + PORT_LST_ADDR_HI)) << 32 | *temp);
pr_info("%s:%d cmd_list is %p\n", __func__, __LINE__, cmd_hdr);
// problems with this next line, makes the system reboot
//pr_info("%s:%d cl[0]:0x%08x\n", __func__, __LINE__, cmd_hdr->opts);
The function ahci_port_base() is found in the ahci driver (at least it is for CentOS 6.x). Basically, it returns the proper address for that port in the AHCI memory region. PORT_LST_ADDR and PORT_LST_ADDR_HI are both macros defined in that driver. The address that I get after getting both the high and low addresses is usually something like 0x0000000037900000. Is this memory address in a space that I cannot simply dereference it?
I'm hitting my head against the wall at this point because this link shows that accessing it in this manner is essentially how it's done.

The address that I get after getting both the high and low addresses
is usually something like 0x0000000037900000. Is this memory address
in a space that I cannot simply dereference it?
Yes, you are correct - that's a bus address, and you can't just dereference it because paging is enabled. (You shouldn't be just dereferencing the iomapped addresses either - you should be using readl() / writel() for those, but the breakage here is more subtle).
It looks like the right way to access the ahci_cmd_hdr in that driver is:
struct ahci_port_priv *pp = ap->private_data;
cmd_hdr = pp->cmd_slot;

Related

Zero-copy user-space TCP send of dma_mmap_coherent() mapped memory

I'm running Linux 5.1 on a Cyclone V SoC, which is an FPGA with two ARMv7 cores in one chip. My goal is to gather lots of data from an external interface and stream (part of) this data out through a TCP socket. The challenge here is that the data rate is very high and could come close to saturating the GbE interface. I have a working implementation that just uses write() calls to the socket, but it tops out at 55MB/s; roughly half the theoretical GbE limit. I'm now trying to get zero-copy TCP transmission to work to increase the throughput, but I'm hitting a wall.
To get the data out of the FPGA into Linux user-space, I've written a kernel driver. This driver uses a DMA block in the FPGA to copy a large amount of data from an external interface into DDR3 memory attached to the ARMv7 cores. The driver allocates this memory as a bunch of contiguous 1MB buffers when probed using dma_alloc_coherent() with GFP_USER, and exposes these to the userspace application by implementing mmap() on a file in /dev/ and returning an address to the application using dma_mmap_coherent() on the preallocated buffers.
So far so good; the user-space application is seeing valid data and the throughput is more than enough at >360MB/s with room to spare (the external interface is not fast enough to really see what the upper bound is).
To implement zero-copy TCP networking, my first approach was to use SO_ZEROCOPY on the socket:
sent_bytes = send(fd, buf, len, MSG_ZEROCOPY);
if (sent_bytes < 0) {
perror("send");
return -1;
}
However, this results in send: Bad address.
After googling for a bit, my second approach was to use a pipe and splice() followed by vmsplice():
ssize_t sent_bytes;
int pipes[2];
struct iovec iov = {
.iov_base = buf,
.iov_len = len
};
pipe(pipes);
sent_bytes = vmsplice(pipes[1], &iov, 1, 0);
if (sent_bytes < 0) {
perror("vmsplice");
return -1;
}
sent_bytes = splice(pipes[0], 0, fd, 0, sent_bytes, SPLICE_F_MOVE);
if (sent_bytes < 0) {
perror("splice");
return -1;
}
However, the result is the same: vmsplice: Bad address.
Note that if I replace the call to vmsplice() or send() to a function that just prints the data pointed to by buf (or a send() without MSG_ZEROCOPY), everything is working just fine; so the data is accessible to userspace, but the vmsplice()/send(..., MSG_ZEROCOPY) calls seem unable to handle it.
What am I missing here? Is there any way of using zero-copy TCP sending with a user-space address obtained from a kernel driver through dma_mmap_coherent()? Is there another approach I could use?
UPDATE
So I dove a bit deeper into the sendmsg() MSG_ZEROCOPY path in the kernel, and the call that eventually fails is get_user_pages_fast(). This call returns -EFAULT because check_vma_flags() finds the VM_PFNMAP flag set in the vma. This flag is apparently set when the pages are mapped into user space using remap_pfn_range() or dma_mmap_coherent(). My next approach is to find another way to mmap these pages.
As I posted in an update in my question, the underlying problem is that zerocopy networking does not work for memory that has been mapped using remap_pfn_range() (which dma_mmap_coherent() happens to use under the hood as well). The reason is that this type of memory (with the VM_PFNMAP flag set) does not have metadata in the form of struct page* associated with each page, which it needs.
The solution then is to allocate the memory in a way that struct page*s are associated with the memory.
The workflow that now works for me to allocate the memory is:
Use struct page* page = alloc_pages(GFP_USER, page_order); to allocate a block of contiguous physical memory, where the number of contiguous pages that will be allocated is given by 2**page_order.
Split the high-order/compound page into 0-order pages by calling split_page(page, page_order);. This now means that struct page* page has become an array with 2**page_order entries.
Now to submit such a region to the DMA (for data reception):
dma_addr = dma_map_page(dev, page, 0, length, DMA_FROM_DEVICE);
dma_desc = dmaengine_prep_slave_single(dma_chan, dma_addr, length, DMA_DEV_TO_MEM, 0);
dmaengine_submit(dma_desc);
When we get a callback from the DMA that the transfer has finished, we need to unmap the region to transfer ownership of this block of memory back to the CPU, which takes care of caches to make sure we're not reading stale data:
dma_unmap_page(dev, dma_addr, length, DMA_FROM_DEVICE);
Now, when we want to implement mmap(), all we really have to do is call vm_insert_page() repeatedly for all of the 0-order pages that we pre-allocated:
static int my_mmap(struct file *file, struct vm_area_struct *vma) {
int res;
...
for (i = 0; i < 2**page_order; ++i) {
if ((res = vm_insert_page(vma, vma->vm_start + i*PAGE_SIZE, &page[i])) < 0) {
break;
}
}
vma->vm_flags |= VM_LOCKED | VM_DONTCOPY | VM_DONTEXPAND | VM_DENYWRITE;
...
return res;
}
When the file is closed, don't forget to free the pages:
for (i = 0; i < 2**page_order; ++i) {
__free_page(&dev->shm[i].pages[i]);
}
Implementing mmap() this way now allows a socket to use this buffer for sendmsg() with the MSG_ZEROCOPY flag.
Although this works, there are two things that don't sit well with me with this approach:
You can only allocate power-of-2-sized buffers with this method, although you could implement logic to call alloc_pages as many times as needed with decreasing orders to get any size buffer made up of sub-buffers of varying sizes. This will then require some logic to tie these buffers together in the mmap() and to DMA them with scatter-gather (sg) calls rather than single.
split_page() says in its documentation:
* Note: this is probably too low level an operation for use in drivers.
* Please consult with lkml before using this in your driver.
These issues would be easily solved if there was some interface in the kernel to allocate an arbitrary amount of contiguous physical pages. I don't know why there isn't, but I don't find the above issues so important as to go digging into why this isn't available / how to implement it :-)
Maybe this will help you to understand why alloc_pages requires a power-of-2 page number.
To optimize the page allocation process(and decrease external fragmentations), which is frequently engaged, Linux kernel developed per-cpu page cache and buddy-allocator to allocate memory(there is another allocator, slab, to serve memory allocations that are smaller than a page).
Per-cpu page cache serve the one-page allocation request, while buddy-allocator keeps 11 lists, each containing 2^{0-10} physical pages respectively. These lists perform well when allocate and free pages, and of course, the premise is you are requesting a power-of-2-sized buffer.

Why shouldn't I use ioremap on system memory for ARMv6+?

I need to reserve a large buffer of physically contiguous RAM from the kernel and be able to gaurantee that the buffer will always use a specific, hard-coded physical address. This buffer should remain reserved for the kernel's entire lifetime. I have written a chardev driver as an interface for accessing this buffer in userspace. My platform is an embedded system with ARMv7 architecture running a 2.6 Linux kernel.
Chapter 15 of Linux Device Drivers, Third Edition has the following to say on the topic (page 443):
Reserving the top of RAM is accomplished by passing a mem= argument to the kernel at boot time. For example, if you have 256 MB, the argument mem=255M keeps the kernel from using the top megabyte. Your module could later use the following code to gain access to such memory:
dmabuf = ioremap (0xFF00000 /* 255M */, 0x100000 /* 1M */);
I've done that plus a couple of other things:
I'm using the memmap bootarg in addition to the mem one. The kernel boot parameters documentation suggests always using memmap whenever you use mem to avoid address collisions.
I used request_mem_region before calling ioremap and, of course, I check that it succeeds before moving ahead.
This is what the system looks like after I've done all that:
# cat /proc/cmdline
root=/dev/mtdblock2 console=ttyS0,115200 init=/sbin/preinit earlyprintk debug mem=255M memmap=1M$255M
# cat /proc/iomem
08000000-0fffffff : PCIe Outbound Window, Port 0
08000000-082fffff : PCI Bus 0001:01
08000000-081fffff : 0001:01:00.0
08200000-08207fff : 0001:01:00.0
18000300-18000307 : serial
18000400-18000407 : serial
1800c000-1800cfff : dmu_regs
18012000-18012fff : pcie0
18013000-18013fff : pcie1
18014000-18014fff : pcie2
19000000-19000fff : cru_regs
1e000000-1fffffff : norflash
40000000-47ffffff : PCIe Outbound Window, Port 1
40000000-403fffff : PCI Bus 0002:01
40000000-403fffff : 0002:01:00.0
40400000-409fffff : PCI Bus 0002:01
40400000-407fffff : 0002:01:00.0
40800000-40807fff : 0002:01:00.0
80000000-8fefffff : System RAM
80052000-8045dfff : Kernel text
80478000-80500143 : Kernel data
8ff00000-8fffffff : foo
Everything so far looks good, and my driver is working perfectly. I'm able to read and write directly to the specific physical address I've chosen.
However, during bootup, a big scary warning (™) was triggered:
BUG: Your driver calls ioremap() on system memory. This leads
to architecturally unpredictable behaviour on ARMv6+, and ioremap()
will fail in the next kernel release. Please fix your driver.
------------[ cut here ]------------
WARNING: at arch/arm/mm/ioremap.c:211 __arm_ioremap_pfn_caller+0x8c/0x144()
Modules linked in:
[] (unwind_backtrace+0x0/0xf8) from [] (warn_slowpath_common+0x4c/0x64)
[] (warn_slowpath_common+0x4c/0x64) from [] (warn_slowpath_null+0x1c/0x24)
[] (warn_slowpath_null+0x1c/0x24) from [] (__arm_ioremap_pfn_caller+0x8c/0x144)
[] (__arm_ioremap_pfn_caller+0x8c/0x144) from [] (__arm_ioremap_caller+0x50/0x58)
[] (__arm_ioremap_caller+0x50/0x58) from [] (foo_init+0x204/0x2b0)
[] (foo_init+0x204/0x2b0) from [] (do_one_initcall+0x30/0x19c)
[] (do_one_initcall+0x30/0x19c) from [] (kernel_init+0x154/0x218)
[] (kernel_init+0x154/0x218) from [] (kernel_thread_exit+0x0/0x8)
---[ end trace 1a4cab5dbc05c3e7 ]---
Triggered from: arc/arm/mm/ioremap.c
/*
* Don't allow RAM to be mapped - this causes problems with ARMv6+
*/
if (pfn_valid(pfn)) {
printk(KERN_WARNING "BUG: Your driver calls ioremap() on system memory. This leads\n"
KERN_WARNING "to architecturally unpredictable behaviour on ARMv6+, and ioremap()\n"
KERN_WARNING "will fail in the next kernel release. Please fix your driver.\n");
WARN_ON(1);
}
What problems, exactly, could this cause? Can they be mitigated? What are my alternatives?
So I've done exactly that, and it's working.
Provide the kernel command line (e.g. /proc/cmdline) and the resulting memory map (i.e. /proc/iomem) to verify this.
What problems, exactly, could this cause?
The problem with using ioremap() on system memory is that you end up assigning conflicting attributes to the memory which causes "unpredictable" behavior.
See the article "ARM's multiply-mapped memory mess", which provides a history to the warning that you are triggering.
The ARM kernel maps RAM as normal memory with writeback caching; it's also marked non-shared on uniprocessor systems. The ioremap() system call, used to map I/O memory for CPU use, is different: that memory is mapped as device memory, uncached, and, maybe, shared. These different mappings give the expected behavior for both types of memory. Where things get tricky is when somebody calls ioremap() to create a new mapping for system RAM.
The problem with these multiple mappings is that they will have differing attributes. As of version 6 of the ARM architecture, the specified behavior in that situation is "unpredictable."
Note that "system memory" is the RAM that is managed by the kernel.
The fact that you trigger the warning indicates that your code is generating multiple mappings for a region of memory.
Can they be mitigated?
You have to ensure that the RAM you want to ioremap() is not "system memory", i.e. managed by the kernel.
See also this answer.
ADDENDUM
This warning that concerns you is the result of pfn_valid(pfn) returning TRUE rather than FALSE.
Based on the Linux cross-reference link that you provided for version 2.6.37,
pfn_valid() is simply returning the result of
memblock_is_memory(pfn << PAGE_SHIFT);
which in turn is simply returning the result of
memblock_search(&memblock.memory, addr) != -1;
I suggest that the kernel code be hacked so that the conflict is revealed.
Before the call to ioremap(), assign TRUE to the global variable memblock_debug.
The following patch should display the salient information about the memory conflict.
(The memblock list is ordered by base-address, so memblock_search() performs a binary search on this list, hence the use of mid as the index.)
static int __init_memblock memblock_search(struct memblock_type *type, phys_addr_t addr)
{
unsigned int left = 0, right = type->cnt;
do {
unsigned int mid = (right + left) / 2;
if (addr < type->regions[mid].base)
right = mid;
else if (addr >= (type->regions[mid].base +
type->regions[mid].size))
left = mid + 1;
- else
+ else {
+ if (memblock_debug)
+ pr_info("MATCH for 0x%x: m=0x%x b=0x%x s=0x%x\n",
+ addr, mid,
+ type->regions[mid].base,
+ type->regions[mid].size);
return mid;
+ }
} while (left < right);
return -1;
}
If you want to see all the memory blocks, then call memblock_dump_all() with the variable memblock_debug is TRUE.
[Interesting that this is essentially a programming question, yet we haven't seen any of your code.]
ADDENDUM 2
Since you're probably using ATAGs (instead of Device Tree), and you want to dedicate a memory region, fix up the ATAG_MEM to reflect this smaller size of physical memory.
Assuming you have made zero changes to your boot code, the ATAG_MEM is still specifying the full RAM, so perhaps this could be the source of the system memory conflict that causes the warning.
See this answer about ATAGs and this related answer.

Linux DMA API: specifying address increment behavior?

I am writing a driver for Altera Soc Developement Kit and need to support two modes of data transfer to/from a FPGA:
FIFO transfers: When writing to (or reading from) an FPGA FIFO, the destination (or source) address must not be incremented by the DMA controller.
non-FIFO transfers: These are normal (RAM-like) transfers where both the source and destination addresses require an increment for each word transferred.
The particular DMA controller I am using is the CoreLink DMA-330 DMA Controller and its Linux driver is pl330.c (drivers/dma/pl330.c). This DMA controller does provide a mechanism to switch between "Fixed-address burst" and "Incrementing-address burst" (these are synonymous with my "FIFO transfers" and "non-FIFO transfers"). The pl330 driver specifies which behavior it wants by setting the appropriate bits in the CCRn register
#define CC_SRCINC (1 << 0)
#define CC_DSTINC (1 << 14)
My question: it is not at all clear to me how clients of the pl330 (my driver, for example) should specify the address-incrementing behavior.
The DMA engine client API says nothing about how to specify this while the DMA engine provider API simply states:
Addresses pointing to RAM are typically incremented (or decremented)
after each transfer. In case of a ring buffer, they may loop
(DMA_CYCLIC). Addresses pointing to a device's register (e.g. a FIFO)
are typically fixed.
without giving any detail as to how the address types are communicated to providers (in my case the pl300 driver).
The in the pl330_prep_slave_sg method it does:
if (direction == DMA_MEM_TO_DEV) {
desc->rqcfg.src_inc = 1;
desc->rqcfg.dst_inc = 0;
desc->req.rqtype = MEMTODEV;
fill_px(&desc->px,
addr, sg_dma_address(sg), sg_dma_len(sg));
} else {
desc->rqcfg.src_inc = 0;
desc->rqcfg.dst_inc = 1;
desc->req.rqtype = DEVTOMEM;
fill_px(&desc->px,
sg_dma_address(sg), addr, sg_dma_len(sg));
}
where later, the desc->rqcfg.src_inc, and desc->rqcfg.dst_inc are used by the driver to specify the address-increment behavior.
This implies the following:
Specifying a direction = DMA_MEM_TO_DEV means the client wishes to pull data from a FIFO into RAM. And presumably DMA_DEV_TO_MEM means the client wishes to push data from RAM into a FIFO.
Scatter-gather DMA operations (for the pl300 at least) is restricted to cases where either the source or destination end point is a FIFO. What if I wanted to do a scatter-gather operation from system RAM into FPGA (non-FIFO) memory?
Am I misunderstanding and/or overlooking something? Does the DMA engine already provide a (undocumented) mechanism to specify address-increment behavior?
Look at this
pd->device_prep_dma_memcpy = pl330_prep_dma_memcpy;
pd->device_prep_dma_cyclic = pl330_prep_dma_cyclic;
pd->device_prep_slave_sg = pl330_prep_slave_sg;
It means you have different approaches like you have read in documentation. RAM-like transfers could be done, I suspect, via device_prep_dma_memcpy().
It appears to me (after looking to various drivers in a kernel) the only DMA transfer styles which allows you (indirectly) control auto-increment behavior is the ones which have enum dma_transfer_direction in its corresponding device_prep_... function.
And this parameter declared only for device_prep_slave_sg and device_prep_dma_cyclic, according to include/linux/dmaengine.h
Another option should be to use and struct dma_interleaved_template which allows you to specify increment behaviour directly. But support for this method is limited (only i.MX DMA driver does support it in 3.8 kernel, for example. And even this support seems to be limited)
So I think, we are stuck with device_prep_slave_sg case with all sg-related complexities for a some while.
That is that I am doing at the moment (although it is for accessing of some EBI-connected device on Atmel SAM9 SOC)
Another thing to consider is a device's bus width. memcopy-variant can perform different bus-width transfers, depending on a source and target addresses and sizes. And this may not match size of FIFO element.

Writing out DMA buffers into memory mapped file

I need to write in embedded Linux(2.6.37) as fast as possible incoming DMA buffers to HD partition as raw device /dev/sda1. Buffers are aligned as required and are of equal 512KB length. The process may continue for a very long time and fill as much as, for example, 256GB of data.
I need to use the memory-mapped file technique (O_DIRECT not applicable), but can't understand the exact way how to do this.
So, in pseudo code "normal" writing:
fd=open(/dev/sda1",O_WRONLY);
while(1) {
p = GetVirtualPointerToNewBuffer();
if (InputStopped())
break;
write(fd, p, BLOCK512KB);
}
Now, I will be very thankful for the similar pseudo/real code example of how to utilize memory-mapped technique for this writing.
UPDATE2:
Thanks to kestasx the latest working test code looks like following:
#define TSIZE (64*KB)
void* TBuf;
int main(int argc, char **argv) {
int fdi=open("input.dat", O_RDONLY);
//int fdo=open("/dev/sdb2", O_RDWR);
int fdo=open("output.dat", O_RDWR);
int i, offs=0;
void* addr;
i = posix_memalign(&TBuf, TSIZE, TSIZE);
if ((fdo < 1) || (fdi < 1)) {
printf("Error in files\n");
return -1; }
while(1) {
addr = mmap((void*)TBuf, TSIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fdo, offs);
if ((unsigned int)addr == 0xFFFFFFFFUL) {
printf("Error MMAP=%d, %s\n", errno, strerror(errno));
return -1; }
i = read(fdi, TBuf, TSIZE);
if (i != TSIZE) {
printf("End of data\n");
return 0; }
i = munmap(addr, TSIZE);
offs += TSIZE;
sleep(1);
};
}
UPDATE3:
1. To precisely imitate the DMA work, I need to move read() call before mmp(), because when the DMA finishes it provides me with the address where it has put data. So, in pseudo code:
while(1) {
read(fdi, TBuf, TSIZE);
addr = mmap((void*)TBuf, TSIZE, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_SHARED, fdo, offs);
munmap(addr, TSIZE);
offs += TSIZE; }
This variant fails after(!) the first loop - read() says BAD ADDRESS on TBuf.
Without understanding exactly what I do, I substituted munmap() with msync(). This worked perfectly.
So, the question here - why unmapping the addr influenced on TBuf?
2.With the previous example working I went to the real system with the DMA. The same loop, just instead of read() call is the call which waits for a DMA buffer to be ready and its virtual address provided.
There are no error, the code runs, BUT nothing is recorded (!).
My thought was that Linux does not see that the area was updated and therefore does not sync() a thing.
To test this, I eliminated in the working example the read() call - and yes, nothing was recorded too.
So, the question here - how can I tell Linux that the mapped region contains new data, please, flush it!
Thanks a lot!!!
If I correctly understand, it makes sense if You mmap() file (not sure if it You can mmap() raw partition/block-device) and data via DMA is written directly to this memory region.
For this to work You need to be able to control p (where new buffer is placed) or address where file is maped. If You don't - You'll have to copy memory contents (and will lose some benefits of mmap).
So psudo code would be:
truncate("data.bin", 256GB);
fd = open( "data.bin", O_RDWR );
p = GetVirtualPointerToNewBuffer();
adr = mmap( p, 1GB, PROT_READ | PROT_WRITE, MAP_SHARED, fd, offset_in_file );
startDMA();
waitDMAfinish();
munmap( adr, 1GB );
This is first step only and I'm not completely sure if it will work with DMA (have no such experience).
I assume it is 32bit system, but even then 1GB mapped file size may be too big (if Your RAM is smaller You'll be swaping).
If this setup will work, next step would be to make loop to map regions of file at different offsets and unmap already filled ones.
Most likely You'll need to align addr to 4KB boundary.
When You'll unmap region, it's data will be synced to disk. So You'll need some testing to select appropriate mapped region size (while next region is filled by DMA, there must be enough time to unmap/write previous one).
UPDATE:
What exactly happens when You fill mmap'ed region via DMA I simply don't know (not sure how exactly dirty pages are detected: what is done by hardware, and what must be done by software).
UPDATE2: To my best knowledge:
DMA works the following way:
CPU arranges DMA transfer (address where to write transfered data in RAM);
DMA controller does the actual work, while CPU can do it's own work in parallel;
once DMA transfer is complete - DMA controller signals CPU via IRQ line (interrupt), so CPU can handle the result.
This seems simple while virtual memory is not involved: DMA should work independently from runing process (actual VM table in use by CPU). Yet it should be some mehanism to invalidate CPU cache for modified by DMA physical RAM pages (don't know if CPU needs to do something, or it is done authomatically by hardware).
mmap() forks the following way:
after successfull call of mmap(), file on disk is attached to process memory range (most likely some data structure is filled in OS kernel to hold this info);
I/O (reading or writing) from mmaped range triggers pagefault, which is handled by kernel loading appropriate blocks from atached file;
writes to mmaped range are handled by hardware (don't know how exactly: maybe writes to previously unmodified pages triger some fault, which is handled by kernel marking these pages dirty; or maybe this marking is done entirely in hardware and this info is available to kernel when it needs to flush modified pages to disk).
modified (dirty) pages are written to disk by OS (as it sees appropriate) or can be forced via msync() or munmap()
In theory it should be possible to do DMA transfers to mmaped range, but You need to find out, how exactly pages ar marked dirty (if You need to do something to inform kernel which pages need to be written to disk).
UPDATE3:
Even if modified by DMA pages are not marked dirty, You should be able to triger marking by rewriting (reading ant then writing the same) at least one value in each page (most likely each 4KB) transfered. Just make sure this rewriting is not removed (optimised out) by compiler.
UPDATE4:
It seems file opened O_WRONLY can't be mmap'ed (see question comments, my experimets confirm this too). It is logical conclusion of mmap() workings described above. The same is confirmed here (with reference to POSIX standart requirement to ensure file is readable regardless of maping protection flags).
Unless there is some way around, it actually means that by using mmap() You can't avoid reading of results file (unnecessary step in Your case).
Regarding DMA transfers to mapped range, I think it will be a requirement to ensure maped pages are preloalocated before DMA starts (so there is real memory asigned to both DMA and maped region). On Linux there is MAP_POPULATE mmap flag, but from manual it seams it works with MAP_PRIVATE mapings only (changes are not writen to disk), so most likely it is usuitable. Likely You'll have to triger pagefaults manually by accessing each maped page. This should triger reading of results file.
If You still wish to use mmap and DMA together, but avoid reading of results file, You'll have to modify kernel internals to allow mmap to use O_WRONLY files (for example by zero-filling trigered pages, instead of reading them from disk).

writing to an ioport resulting in segfaults

I'm writing for an atmel at91sam9260 arm 9 cored single board computer [glomation gesbc9260]
Using request_mem_region(0xFFFFFC00,0x100,"name"); //port range runs from fc00 to fcff
that works fine and shows up in /proc/iomem
then i try to write to the last bit of the port at fc20 with
writel(0x1, 0xFFFFFC20);
and i segfault...specifically "unable to handle kernel paging request at virtual address fffffc20."
I'm of the mind that i'm not allocating the right memory space...
any helpful insight would be great...
You need to ioremap the mem region you requested. ioremap maps a virtual address to a physical one.
writel works with virtual addresses, not with physical ones.
/* request mem_region */
...
base = ioremap(0xFFFFFC00, 0x100);
if(base == NULL)
release_mem_region(...);
/* now you can use base */
writel(0x1, base + 20)
...
What you probably need is to write your driver as a platform_driver, and declare a platform device in your board_file
An example of a relatively simple platform_driver can be found here
In fact, navigating through the kernel sources using lxr is probably the best way to learn how to
stuff like this.

Resources