How to Verify Atomic Writes?

How to Verify Atomic Writes? - linux

I have searched diligently (both within the S[O|F|U] network and elsewhere) and believe this to be an uncommon question. I am working with an Atmel AT91SAM9263-EK development board (ARM926EJ-S core, ARMv5 instruction set) running Debian Linux 2.6.28-4. I am writing using (I believe) the tty driver to talk to an RS-485 serial controller. I need to ensure that writes and reads are atomic. Several lines of source code (listed below the end of this post relative to the kernel source installation directory) either imply or implicitly state this.
Is there any way I can verify that writing/reading to/from this device is actually an atomic operation? Or, is the /dev/ttyXX device considered a FIFO and the argument ends there? It seems not enough to simply trust that the code is enforcing this claim it makes - as recently as February of this year freebsd was demonstrated to lack atomic writes for small lines. Yes I realize that freebsd is not exactly the same as Linux, but my point is that it doesn't hurt to be carefully sure. All I can think of is to keep sending data and look for a permutation - I was hoping for something a little more scientific and, ideally, deterministic. Unfortunately, I remember precisely nothing from my concurrent programming classes in the college days of yore. I would thoroughly appreciate a slap or a shove in the right direction. Thank you in advance should you choose to reply.
Kind regards,
Jayce
drivers/char/tty_io.c:1087
void tty_write_message(struct tty_struct *tty, char *msg)
{
lock_kernel();
if (tty) {
mutex_lock(&tty->atomic_write_lock);
if (tty->ops->write && !test_bit(TTY_CLOSING, &tty->flags))
tty->ops->write(tty, msg, strlen(msg));
tty_write_unlock(tty);
}
unlock_kernel();
return;
}
arch/arm/include/asm/bitops.h:37
static inline void ____atomic_set_bit(unsigned int bit, volatile unsigned long *p)
{
unsigned long flags;
unsigned long mask = 1UL << (bit & 31);
p += bit >> 5;
raw_local_irq_save(flags);
*p |= mask;
raw_local_irq_restore(flags);
}
drivers/serial/serial_core.c:2376
static int
uart_write(struct tty_struct *tty, const unsigned char *buf, int count)
{
struct uart_state *state = tty->driver_data;
struct uart_port *port;
struct circ_buf *circ;
unsigned long flags;
int c, ret = 0;
/*
* This means you called this function _after_ the port was
* closed. No cookie for you.
*/
if (!state || !state->info) {
WARN_ON(1);
return -EL3HLT;
}
port = state->port;
circ = &state->info->xmit;
if (!circ->buf)
return 0;
spin_lock_irqsave(&port->lock, flags);
while (1) {
c = CIRC_SPACE_TO_END(circ->head, circ->tail, UART_XMIT_SIZE);
if (count < c)
c = count;
if (c <= 0)
break;
memcpy(circ->buf + circ->head, buf, c);
circ->head = (circ->head + c) & (UART_XMIT_SIZE - 1);
buf += c;
count -= c;
ret += c;
}
spin_unlock_irqrestore(&port->lock, flags);
uart_start(tty);
return ret;
}
Also, from the man write(3) documentation:
An attempt to write to a pipe or FIFO has several major characteristics:
Atomic/non-atomic: A write is atomic if the whole amount written in one operation is not interleaved with data from any other process. This is useful when there are multiple writers sending data to a single reader. Applications need to know how large a write request can be expected to be performed atomically. This maximum is called {PIPE_BUF}. This volume of IEEE Std 1003.1-2001 does not say whether write requests for more than {PIPE_BUF} bytes are atomic, but requires that writes of {PIPE_BUF} or fewer bytes shall be atomic.

I think that, technically, devices are not FIFOs, so it's not at all clear that the guarantees you quote are supposed to apply.
Are you concerned about partial writes and reads within a process, or are you actually reading and/or writing the same device from different processes? Assuming the latter, you might be better off implementing a proxy process of some sort. The proxy owns the device exclusively and performs all reads and writes, thus avoiding the multi-process atomicity problem entirely.
In short, I advise not attempting to verify that "reading/writing from this device is actually an atomic operation". It will be difficult to do with confidence, and leave you with an application that is subject to subtle failures if a later version of linux (or different o/s altogether) fails to implement atomicity the way you need.

I think PIPE_BUF is the right thing. Now, writes of less than PIPE_BUF bytes may not be atomic, but if they aren't it's an OS bug. I suppose you could ask here if an OS has known bugs. But really, if it has a bug like that, it just ought to be immediately fixed.
If you want to write more than PIPE_BUF atomically, I think you're out of luck. I don't think there is any way outside of application coordination and cooperation to make sure that writes of larger sizes happen atomically.
One solution to this problem is to put your own process in front of the device and make sure everybody who wants to write to the device contacts the process and sends the data to it instead. Then you can do whatever makes sense for your application in terms of atomicity guarantees.

Related

Linux UART slower than specified Baudrate

I'm trying to communicate between two Linux systems via UART.
I want to send large chunks of data. With the specified Baudrate it should take around 5 seconds, but it takes nearly 10 times the expected time.
As I'm sending more than the buffer can handle at once it is send in small parts and I'm draining the buffer in between. If I measure the time needed for the drain and the number of bytes written to the buffer I calculate a Baudrate nearly 10 times lower than the specified Baudrate.
I would expect a slower transmission as the optimal, but not this much.
Did I miss something while setting the UART or while writing? Or is this normal?
The code used for setup:
int bus = open(interface.c_str(), O_RDWR | O_NOCTTY | O_NDELAY); // <- also tryed blocking
if (bus < 0) {
return;
}
struct termios options;
memset (&options, 0, sizeof options);
if(tcgetattr(bus, &options) != 0){
close(bus);
bus = -1;
return;
}
cfsetspeed (&options, B230400);
cfmakeraw(&options); // <- also tried this manually. did not make a difference
if(tcsetattr(bus, TCSANOW, &options) != 0)
{
close(bus);
bus = -1;
return;
}
tcflush(bus, TCIFLUSH);
The code used to send:
int32_t res = write(bus, data, dataLength);
while (res < dataLength){
tcdrain(bus); // <- taking way longer than expected
int32_t r = write(bus, &data[res], dataLength - res);
if(r == 0)
break;
if(r == -1){
break;
}
res += r;
}

B230400
The docs are contradictory. cfsetspeed is documented as requiring a speed_t type, while the note says you need to use one of the "B" constants like "B230400." Have you tried using an actual speed_t type?
In any case, the speed you're supplying is the baud rate, which in this case should get you approximately 23,000 bytes/second, assuming there is no throttling.
The speed is dependent on hardware and link limitations. Also the serial protocol allows pausing the transmission.
FWIW, according to the time and speed you listed, if everything works perfectly, you'll get about 1 MB in 50 seconds. What speed are you actually getting?
Another "also" is the options structure. It's been years since I've had to do any serial I/O, but IIRC, you need to actually set the options that you want and are supported by your hardware, like CTS/RTS, XON/XOFF, etc.
This might be helpful.

As I'm sending more than the buffer can handle at once it is send in small parts and I'm draining the buffer in between.
You have only provided code snippets (rather than a minimal, complete, and verifiable example), so your data size is unknown.
But the Linux kernel buffer size is known. What do you think it is?
(FYI it's 4KB.)
If I measure the time needed for the drain and the number of bytese written to the buffer I calculate a Baudrate nearly 10 times lower than the specified Baudrate.
You're confusing throughput with baudrate.
The maximum throughput (of just payload) of an asynchronous serial link will always be less than the baudrate due to framing overhead per character, which could be two of the ten bits of the frame (assuming 8N1). Since your termios configuration is incomplete, the overhead could actually be three of the eleven bits of the frame (assuming 8N2).
In order to achieve the maximum throughput, the tranmitting UART must saturate the line with frames and never let the line go idle.
The userspace program must be able to supply data fast enough, preferably by one large write() to reduce syscall overhead.
Did I miss something while setting the UART or while writing?
With Linux, you have limited access to the UART hardware.
From userspace your program accesses a serial terminal.
Your program accesses the serial terminal in a sub-optinal manner.
Your termios configuration appears to be incomplete.
It leaves both hardware and software flow-control untouched.
The number of stop bits is untouched.
The Ignore modem control lines and Enable receiver flags are not enabled.
For raw reading, the VMIN and VTIME values are not assigned.
Or is this normal?
There are ways to easily speed up the transfer.
First, your program combines non-blocking mode with non-canonical mode. That's a degenerate combination for receiving, and suboptimal for transmitting.
You have provided no reason for using non-blocking mode, and your program is not written to properly utilize it.
Therefore your program should be revised to use blocking mode instead of non-blocking mode.
Second, the tcdrain() between write() syscalls can introduce idle time on the serial link. Use of blocking mode eliminates the need for this delay tactic between write() syscalls.
In fact with blocking mode only one write() syscall should be needed to transmit the entire dataLength. This would also minimize any idle time introduced on the serial link.
Note that the first write() does not properly check the return value for a possible error condition, which is always possible.
Bottom line: your program would be simpler and throughput would be improved by using blocking I/O.

In general, on ucLinux, is ioctl faster than writing to /sys filesystem?

I have an embedded system I'm working with, and it currently uses the sysfs to control certain features.
However, there is function that we would like to speed up, if possible.
I discovered that this subsystem also supports and ioctl interface, but before rewriting the code, I decided to search to see which is a faster interface (on ucLinux) in general: sysfs or ioctl.
Does anybody understand both implementations well enough to give me a rough idea of the difference in overhead for each? I'm looking for generic info, such as "ioctl is faster because you've removed the file layer from the function calls". Or "they are roughly the same because sysfs has a very simple interface".
Update 10/24/2013:
The specific case I'm currently doing is as follows:
int fd = open("/sys/power/state",O_WRONLY);
write( fd, "standby", 7 );
close( fd );
In kernel/power/main.c, the code that handles this write looks like:
static ssize_t state_store(struct kobject *kobj, struct kobj_attribute *attr,
const char *buf, size_t n)
{
#ifdef CONFIG_SUSPEND
suspend_state_t state = PM_SUSPEND_STANDBY;
const char * const *s;
#endif
char *p;
int len;
int error = -EINVAL;
p = memchr(buf, '\n', n);
len = p ? p - buf : n;
/* First, check if we are requested to hibernate */
if (len == 7 && !strncmp(buf, "standby", len)) {
error = enter_standby();
goto Exit;
((( snip )))
Can this be sped up by moving to a custom ioctl() where the code to handle the ioctl call looks something like:
case SNAPSHOT_STANDBY:
if (!data->frozen) {
error = -EPERM;
break;
}
error = enter_standby();
break;
(so the ioctl() calls the same low-level function that the sysfs function did).

If by sysfs you mean the sysfs() library call, notice this in man 2 sysfs:
NOTES
This System-V derived system call is obsolete; don't use it. On systems with /proc, the same information can be obtained via
/proc/filesystems; use that interface instead.
I can't recall noticing stuff that had an ioctl() and a sysfs interface, but probably they exist. I'd use the proc or sys handle anyway, since that tends to be less cryptic and more flexible.
If by sysfs you mean accessing files in /sys, that's the preferred method.
I'm looking for generic info, such as "ioctl is faster because you've removed the file layer from the function calls".
Accessing procfs or sysfs files does not entail an I/O bottleneck because they are not real files -- they are kernel interfaces. So no, accessing this stuff through "the file layer" does not affect performance. This is a not uncommon misconception in linux systems programming, I think. Programmers can be squeamish about system calls that aren't well, system calls, and paranoid that opening a file will be somehow slower. Of course, file I/O in the ABI is just system calls anyway. What makes a normal (disk) file read slow is not the calls to open, read, write, whatever, it's the hardware bottleneck.
I always use low level descriptor based functions (open(), read()) instead of high level streams when doing this because at some point some experience led me to believe they were more reliable for this specifically (reading from /proc). I can't say whether that's definitively true.

So, the question was interesting, I built a couple of modules, one for ioctl and one for sysfs, the ioctl implementing only a 4 bytes copy_from_user and nothing more, and the sysfs having nothing in its write interface.
Then a couple of userspace test up to 1 million iterations, here the results:
time ./sysfs /sys/kernel/kobject_example/bar
real 0m0.427s
user 0m0.056s
sys 0m0.368s
time ./ioctl /run/temp
real 0m0.236s
user 0m0.060s
sys 0m0.172s
edit
I agree with #goldilocks answer, HW is the real bottleneck, in a Linux environment with a well written driver choosing ioctl or sysfs doesn't make a big difference, but if you are using uClinux probably in your HW even few cpu cycles can make a difference.
The test I've done is for Linux not uClinux and it never wanted to be an absolute reference profiling the two interfaces, my point is that you can write a book about how fast is one or another but only testing will let you know, took me few minutes to setup the thing.

Are file descriptors for linux sockets always in increasing order

I have a socket server in C/linux. Each time I create a new socket it is assigned a file descriptor. I want to use these FD's as uniqueID's for each client. If they are guaranteed to always be assigned in increasing order (which is the case for the Ubuntu that I am running) then I could just use them as array indices.
So the question: Are the file descriptors that are assigned from linux sockets guaranteed to always be in increasing order?

Let's look at how this works internally (I'm using kernel 4.1.20). The way file descriptors are allocated in Linux is with __alloc_fd. When you do a open syscall, do_sys_open is called. This routine gets a free file descriptor from get_unused_fd_flags:
long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
...
fd = get_unused_fd_flags(flags);
if (fd >= 0) {
struct file *f = do_filp_open(dfd, tmp, &op);
get_unused_d_flags calls __alloc_fd setting minimum and maximum fd:
int get_unused_fd_flags(unsigned flags)
{
return __alloc_fd(current->files, 0, rlimit(RLIMIT_NOFILE), flags);
}
__alloc_fd gets the file descriptor table for the process, and gets the fd as next_fd, which is actually set from the previous time it ran:
int __alloc_fd(struct files_struct *files,
unsigned start, unsigned end, unsigned flags)
{
...
fd = files->next_fd;
...
if (start <= files->next_fd)
files->next_fd = fd + 1;
So you can see how file descriptors indeed grow monotonically... up to certain point. When the fd reaches the maximum, __alloc_fd will try to find the smallest unused file descriptor:
if (fd < fdt->max_fds)
fd = find_next_zero_bit(fdt->open_fds, fdt->max_fds, fd);
At this point the file descriptors will not be growing monotonically anymore, but instead will jump trying to find free file descriptors. After this, if the table gets full, it will be expanded:
error = expand_files(files, fd);
At which point they will grow again monotonically.
Hope this helps

FD's are guaranteed to be unique, for the lifetime of the socket. So yes, in theory, you could probably use the FD as an index into an array of clients. However, I'd caution against this for at least a couple of reasons:
As has already been said, there is no guarantee that FDs will be allocated monotonically. accept() would be within its rights to return a highly-numbered FD, which would then make your array inefficient. So short answer to your question: no, they are not guaranteed to be monotonic.
Your server is likely to end up with lots of other open FDs - stdin, stdout and stderr to name but three - so again, your array is wasting space.
I'd recommend some other way of mapping from FDs to clients. Indeed, unless you're going to be dealing with thousands of clients, searching through a list of clients should be fine - it's not really an operation that you should need to do a huge amount.

Do not depend on the monotonicity of file descriptors. Always refer to the remote system via a address:port pair.

Is there a way to check whether the processor cache has been flushed recently?

On i386 linux. Preferably in c/(c/posix std libs)/proc if possible. If not is there any piece of assembly or third party library that can do this?
Edit: I'm trying to develop test whether a kernel module clear a cache line or the whole proccesor(with wbinvd()). Program runs as root but I'd prefer to stay in user space if possible.

Cache coherent systems do their utmost to hide such things from you. I think you will have to observe it indirectly, either by using performance counting registers to detect cache misses or by carefully measuring the time to read a memory location with a high resolution timer.
This program works on my x86_64 box to demonstrate the effects of clflush. It times how long it takes to read a global variable using rdtsc. Being a single instruction tied directly to the CPU clock makes direct use of rdtsc ideal for this.
Here is the output:
took 81 ticks
took 81 ticks
flush: took 387 ticks
took 72 ticks
You see 3 trials: The first ensures i is in the cache (which it is, because it was just zeroed as part of BSS), the second is a read of i that should be in the cache. Then clflush kicks i out of the cache (along with its neighbors) and shows that re-reading it takes significantly longer. A final read verifies it is back in the cache. The results are very reproducible and the difference is substantial enough to easily see the cache misses. If you cared to calibrate the overhead of rdtsc() you could make the difference even more pronounced.
If you can't read the memory address you want to test (although even mmap of /dev/mem should work for these purposes) you may be able to infer what you want if you know the cacheline size and associativity of the cache. Then you can use accessible memory locations to probe the activity in the set you're interested in.
Source code:
#include <stdio.h>
#include <stdint.h>
inline void
clflush(volatile void *p)
{
asm volatile ("clflush (%0)" :: "r"(p));
}
inline uint64_t
rdtsc()
{
unsigned long a, d;
asm volatile ("rdtsc" : "=a" (a), "=d" (d));
return a | ((uint64_t)d << 32);
}
volatile int i;
inline void
test()
{
uint64_t start, end;
volatile int j;
start = rdtsc();
j = i;
end = rdtsc();
printf("took %lu ticks\n", end - start);
}
int
main(int ac, char **av)
{
test();
test();
printf("flush: ");
clflush(&i);
test();
test();
return 0;
}

I dont know of any generic command to get the the cache state, but there are ways:
I guess this is the easiest: If you got your kernel module, just disassemble it and look for cache invalidation / flushing commands (atm. just 3 came to my mind: WBINDVD, CLFLUSH, INVD).
You just said it is for i386, but I guess you dont mean a 80386. The problem is that there are many different with different extension and features. E.g. the newest Intel series has some performance/profiling registers for the cache system included, which you can use to evalute cache misses/hits/number of transfers and similar.
Similar to 2, very depending on the system you got. But when you have a multiprocessor configuration you could watch the first cache coherence protocol (MESI) with the 2nd.
You mentioned WBINVD - afaik that will always flush complete, i.e. all, cache lines

It may not be an answer to your specific question, but have you tried using a cache profiler such as Cachegrind? It can only be used to profile userspace code, but you might be able to use it nonetheless, by e.g. moving the code of your function to userspace if it does not depend on any kernel-specific interfaces.
It might actually be more effective than trying to ask the processor for information that may or may not exist and that will be probably affected by your mere asking about it - yes, Heisenberg was way before his time :-)

Linux kernel device driver to DMA from a device into user-space memory

I want to get data from a DMA enabled, PCIe hardware device into user-space as quickly as possible.
Q: How do I combine "direct I/O to user-space with/and/via a DMA transfer"
Reading through LDD3, it seems that I need to perform a few different types of IO operations!?
dma_alloc_coherent gives me the physical address that I can pass to the hardware device.
But would need to have setup get_user_pages and perform a copy_to_user type call when the transfer completes. This seems a waste, asking the Device to DMA into kernel memory (acting as buffer) then transferring it again to user-space.
LDD3 p453: /* Only now is it safe to access the buffer, copy to user, etc. */
What I ideally want is some memory that:
I can use in user-space (Maybe request driver via a ioctl call to create DMA'able memory/buffer?)
I can get a physical address from to pass to the device so that all user-space has to do is perform a read on the driver
the read method would activate the DMA transfer, block waiting for the DMA complete interrupt and release the user-space read afterwards (user-space is now safe to use/read memory).
Do I need single-page streaming mappings, setup mapping and user-space buffers mapped with get_user_pages dma_map_page?
My code so far sets up get_user_pages at the given address from user-space (I call this the Direct I/O part). Then, dma_map_page with a page from get_user_pages. I give the device the return value from dma_map_page as the DMA physical transfer address.
I am using some kernel modules as reference: drivers_scsi_st.c and drivers-net-sh_eth.c. I would look at infiniband code, but cant find which one is the most basic!
Many thanks in advance.

I'm actually working on exactly the same thing right now and I'm going the ioctl() route. The general idea is for user space to allocate the buffer which will be used for the DMA transfer and an ioctl() will be used to pass the size and address of this buffer to the device driver. The driver will then use scatter-gather lists along with the streaming DMA API to transfer data directly to and from the device and user-space buffer.
The implementation strategy I'm using is that the ioctl() in the driver enters a loop that DMA's the userspace buffer in chunks of 256k (which is the hardware imposed limit for how many scatter/gather entries it can handle). This is isolated inside a function that blocks until each transfer is complete (see below). When all bytes are transfered or the incremental transfer function returns an error the ioctl() exits and returns to userspace
Pseudo code for the ioctl()
/*serialize all DMA transfers to/from the device*/
if (mutex_lock_interruptible( &device_ptr->mtx ) )
return -EINTR;
chunk_data = (unsigned long) user_space_addr;
while( *transferred < total_bytes && !ret ) {
chunk_bytes = total_bytes - *transferred;
if (chunk_bytes > HW_DMA_MAX)
chunk_bytes = HW_DMA_MAX; /* 256kb limit imposed by my device */
ret = transfer_chunk(device_ptr, chunk_data, chunk_bytes, transferred);
chunk_data += chunk_bytes;
chunk_offset += chunk_bytes;
}
mutex_unlock(&device_ptr->mtx);
Pseudo code for incremental transfer function:
/*Assuming the userspace pointer is passed as an unsigned long, */
/*calculate the first,last, and number of pages being transferred via*/
first_page = (udata & PAGE_MASK) >> PAGE_SHIFT;
last_page = ((udata+nbytes-1) & PAGE_MASK) >> PAGE_SHIFT;
first_page_offset = udata & PAGE_MASK;
npages = last_page - first_page + 1;
/* Ensure that all userspace pages are locked in memory for the */
/* duration of the DMA transfer */
down_read(&current->mm->mmap_sem);
ret = get_user_pages(current,
current->mm,
udata,
npages,
is_writing_to_userspace,
0,
&pages_array,
NULL);
up_read(&current->mm->mmap_sem);
/* Map a scatter-gather list to point at the userspace pages */
/*first*/
sg_set_page(&sglist[0], pages_array[0], PAGE_SIZE - fp_offset, fp_offset);
/*middle*/
for(i=1; i < npages-1; i++)
sg_set_page(&sglist[i], pages_array[i], PAGE_SIZE, 0);
/*last*/
if (npages > 1) {
sg_set_page(&sglist[npages-1], pages_array[npages-1],
nbytes - (PAGE_SIZE - fp_offset) - ((npages-2)*PAGE_SIZE), 0);
}
/* Do the hardware specific thing to give it the scatter-gather list
and tell it to start the DMA transfer */
/* Wait for the DMA transfer to complete */
ret = wait_event_interruptible_timeout( &device_ptr->dma_wait,
&device_ptr->flag_dma_done, HZ*2 );
if (ret == 0)
/* DMA operation timed out */
else if (ret == -ERESTARTSYS )
/* DMA operation interrupted by signal */
else {
/* DMA success */
*transferred += nbytes;
return 0;
}
The interrupt handler is exceptionally brief:
/* Do hardware specific thing to make the device happy */
/* Wake the thread waiting for this DMA operation to complete */
device_ptr->flag_dma_done = 1;
wake_up_interruptible(device_ptr->dma_wait);
Please note that this is just a general approach, I've been working on this driver for the last few weeks and have yet to actually test it... So please, don't treat this pseudo code as gospel and be sure to double check all logic and parameters ;-).

You basically have the right idea: in 2.1, you can just have userspace allocate any old memory. You do want it page-aligned, so posix_memalign() is a handy API to use.
Then have userspace pass in the userspace virtual address and size of this buffer somehow; ioctl() is a good quick and dirty way to do this. In the kernel, allocate an appropriately sized buffer array of struct page* -- user_buf_size/PAGE_SIZE entries -- and use get_user_pages() to get a list of struct page* for the userspace buffer.
Once you have that, you can allocate an array of struct scatterlist that is the same size as your page array and loop through the list of pages doing sg_set_page(). After the sg list is set up, you do dma_map_sg() on the array of scatterlist and then you can get the sg_dma_address and sg_dma_len for each entry in the scatterlist (note you have to use the return value of dma_map_sg() because you may end up with fewer mapped entries because things might get merged by the DMA mapping code).
That gives you all the bus addresses to pass to your device, and then you can trigger the DMA and wait for it however you want. The read()-based scheme you have is probably fine.
You can refer to drivers/infiniband/core/umem.c, specifically ib_umem_get(), for some code that builds up this mapping, although the generality that that code needs to deal with may make it a bit confusing.
Alternatively, if your device doesn't handle scatter/gather lists too well and you want contiguous memory, you could use get_free_pages() to allocate a physically contiguous buffer and use dma_map_page() on that. To give userspace access to that memory, your driver just needs to implement an mmap method instead of the ioctl as described above.

At some point I wanted to allow user-space application to allocate DMA buffers and get it mapped to user-space and get the physical address to be able to control my device and do DMA transactions (bus mastering) entirely from user-space, totally bypassing the Linux kernel. I have used a little bit different approach though. First I started with a minimal kernel module that was initializing/probing PCIe device and creating a character device. That driver then allowed a user-space application to do two things:
Map PCIe device's I/O bar into user-space using remap_pfn_range() function.
Allocate and free DMA buffers, map them to user space and pass a physical bus address to user-space application.
Basically, it boils down to a custom implementation of mmap() call (though file_operations). One for I/O bar is easy:
struct vm_operations_struct a2gx_bar_vma_ops = {
};
static int a2gx_cdev_mmap_bar2(struct file *filp, struct vm_area_struct *vma)
{
struct a2gx_dev *dev;
size_t size;
size = vma->vm_end - vma->vm_start;
if (size != 134217728)
return -EIO;
dev = filp->private_data;
vma->vm_ops = &a2gx_bar_vma_ops;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_private_data = dev;
if (remap_pfn_range(vma, vma->vm_start,
vmalloc_to_pfn(dev->bar2),
size, vma->vm_page_prot))
{
return -EAGAIN;
}
return 0;
}
And another one that allocates DMA buffers using pci_alloc_consistent() is a little bit more complicated:
static void a2gx_dma_vma_close(struct vm_area_struct *vma)
{
struct a2gx_dma_buf *buf;
struct a2gx_dev *dev;
buf = vma->vm_private_data;
dev = buf->priv_data;
pci_free_consistent(dev->pci_dev, buf->size, buf->cpu_addr, buf->dma_addr);
buf->cpu_addr = NULL; /* Mark this buffer data structure as unused/free */
}
struct vm_operations_struct a2gx_dma_vma_ops = {
.close = a2gx_dma_vma_close
};
static int a2gx_cdev_mmap_dma(struct file *filp, struct vm_area_struct *vma)
{
struct a2gx_dev *dev;
struct a2gx_dma_buf *buf;
size_t size;
unsigned int i;
/* Obtain a pointer to our device structure and calculate the size
of the requested DMA buffer */
dev = filp->private_data;
size = vma->vm_end - vma->vm_start;
if (size < sizeof(unsigned long))
return -EINVAL; /* Something fishy is happening */
/* Find a structure where we can store extra information about this
buffer to be able to release it later. */
for (i = 0; i < A2GX_DMA_BUF_MAX; ++i) {
buf = &dev->dma_buf[i];
if (buf->cpu_addr == NULL)
break;
}
if (buf->cpu_addr != NULL)
return -ENOBUFS; /* Oops, hit the limit of allowed number of
allocated buffers. Change A2GX_DMA_BUF_MAX and
recompile? */
/* Allocate consistent memory that can be used for DMA transactions */
buf->cpu_addr = pci_alloc_consistent(dev->pci_dev, size, &buf->dma_addr);
if (buf->cpu_addr == NULL)
return -ENOMEM; /* Out of juice */
/* There is no way to pass extra information to the user. And I am too lazy
to implement this mmap() call using ioctl(). So we simply tell the user
the bus address of this buffer by copying it to the allocated buffer
itself. Hacks, hacks everywhere. */
memcpy(buf->cpu_addr, &buf->dma_addr, sizeof(buf->dma_addr));
buf->size = size;
buf->priv_data = dev;
vma->vm_ops = &a2gx_dma_vma_ops;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_private_data = buf;
/*
* Map this DMA buffer into user space.
*/
if (remap_pfn_range(vma, vma->vm_start,
vmalloc_to_pfn(buf->cpu_addr),
size, vma->vm_page_prot))
{
/* Out of luck, rollback... */
pci_free_consistent(dev->pci_dev, buf->size, buf->cpu_addr,
buf->dma_addr);
buf->cpu_addr = NULL;
return -EAGAIN;
}
return 0; /* All good! */
}
Once those are in place, user space application can pretty much do everything — control the device by reading/writing from/to I/O registers, allocate and free DMA buffers of arbitrary size, and have the device perform DMA transactions. The only missing part is interrupt-handling. I was doing polling in user space, burning my CPU, and had interrupts disabled.
Hope it helps. Good Luck!

I'm getting confused with the direction to implement. I want to...
Consider the application when designing a driver.
What is the nature of data movement, frequency, size and what else might be going on in the system?
Is the traditional read/write API sufficient?
Is direct mapping the device into user space OK?
Is a reflective (semi-coherent) shared memory desirable?
Manually manipulating data (read/write) is a pretty good option if the data lends itself to being well understood. Using general purpose VM and read/write may be sufficient with an inline copy. Direct mapping non cachable accesses to the peripheral is convenient, but can be clumsy. If the access is the relatively infrequent movement of large blocks, it may make sense to use regular memory, have the drive pin, translate addresses, DMA and release the pages. As an optimization, the pages (maybe huge) can be pre pinned and translated; the drive then can recognize the prepared memory and avoid the complexities of dynamic translation. If there are lots of little I/O operations, having the drive run asynchronously makes sense. If elegance is important, the VM dirty page flag can be used to automatically identify what needs to be moved and a (meta_sync()) call can be used to flush pages. Perhaps a mixture of the above works...
Too often people don't look at the larger problem, before digging into the details. Often the simplest solutions are sufficient. A little effort constructing a behavioral model can help guide what API is preferable.

first_page_offset = udata & PAGE_MASK;
It seems wrong. It should be either:
first_page_offset = udata & ~PAGE_MASK;
or
first_page_offset = udata & (PAGE_SIZE - 1)

It is worth mention that driver with Scatter-Gather DMA support and user space memory allocation is most efficient and has highest performance. However in case we don't need high performance or we want to develop a driver in some simplified conditions we can use some tricks.
Give up zero copy design. It is worth to consider when data throughput is not too big. In such a design data can by copied to user by
copy_to_user(user_buffer, kernel_dma_buffer, count);
user_buffer might be for example buffer argument in character device read() system call implementation. We still need to take care of kernel_dma_buffer allocation. It might by memory obtained from dma_alloc_coherent() call for example.
The another trick is to limit system memory at the boot time and then use it as huge contiguous DMA buffer. It is especially useful during driver and FPGA DMA controller development and rather not recommended in production environments. Lets say PC has 32GB of RAM. If we add mem=20GB to kernel boot parameters list we can use 12GB as huge contiguous dma buffer. To map this memory to user space simply implement mmap() as
remap_pfn_range(vma,
vma->vm_start,
(0x500000000 >> PAGE_SHIFT) + vma->vm_pgoff,
vma->vm_end - vma->vm_start,
vma->vm_page_prot)
Of course this 12GB is completely omitted by OS and can be used only by process which has mapped it into its address space. We can try to avoid it by using Contiguous Memory Allocator (CMA).
Again above tricks will not replace full Scatter-Gather, zero copy DMA driver, but are useful during development time or in some less performance platforms.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string