vmsplice() and TCP

vmsplice() and TCP - linux

In the original vmsplice() implementation, it was suggested that if you had a user-land buffer 2x the maximum number of pages that could fit in a pipe, a successful vmsplice() on the second half of the buffer would guarantee that the kernel was done using the first half of the buffer.
But that was not true after all, and particularly for TCP, the kernel pages would be kept until receiving ACK from the other side. Fixing this was left as future work, and thus for TCP, the kernel would still have to copy the pages from the pipe.
vmsplice() has the SPLICE_F_GIFT option that sort-of deals with this, but the problem is that this exposes two other problems - how to efficiently get fresh pages from the kernel, and how to reduce cache trashing. The first issue is that mmap requires the kernel to clear the pages, and the second issue is that although mmap might use the fancy kscrubd feature in the kernel, that increases the working set of the process (cache trashing).
Based on this, I have these questions:
What is the current state for notifying userland about the safe re-use of pages? I am especially interested in pages splice()d onto a socket (TCP). Did anything happen during the last 5 years?
Is mmap / vmsplice / splice / munmap the current best practice for zero-copying in a TCP server or have we better options today?

Yes, due to the TCP socket holding on to the pages for an indeterminate time you cannot use the double-buffering scheme mentioned in the example code. Also, in my use case the pages come from circular buffer so I cannot gift the pages to the kernel and alloc fresh pages. I can verify that I am seeing data corruption in the received data.
I resorted to polling the level of the TCP socket's send queue until it drains to 0. This fixes data corruption but is suboptimal because draining the send queue to 0 affects throughput.
n = ::vmsplice(mVmsplicePipe.fd.w, &iov, 1, 0);
while (n) {
// splice pipe to socket
m = ::splice(mVmsplicePipe.fd.r, NULL, mFd, NULL, n, 0);
n -= m;
}
while(1) {
int outsize=0;
int result;
usleep(20000);
result = ::ioctl(mFd, SIOCOUTQ, &outsize);
if (result == 0) {
LOG_NOISE("outsize %d", outsize);
} else {
LOG_ERR_PERROR("SIOCOUTQ");
break;
}
//if (outsize <= (bufLen >> 1)) {
if (outsize == 0) {
LOG("outsize %d <= %u", outsize, bufLen>>1);
break;
}
};

Related

Zero-copy user-space TCP send of dma_mmap_coherent() mapped memory

I'm running Linux 5.1 on a Cyclone V SoC, which is an FPGA with two ARMv7 cores in one chip. My goal is to gather lots of data from an external interface and stream (part of) this data out through a TCP socket. The challenge here is that the data rate is very high and could come close to saturating the GbE interface. I have a working implementation that just uses write() calls to the socket, but it tops out at 55MB/s; roughly half the theoretical GbE limit. I'm now trying to get zero-copy TCP transmission to work to increase the throughput, but I'm hitting a wall.
To get the data out of the FPGA into Linux user-space, I've written a kernel driver. This driver uses a DMA block in the FPGA to copy a large amount of data from an external interface into DDR3 memory attached to the ARMv7 cores. The driver allocates this memory as a bunch of contiguous 1MB buffers when probed using dma_alloc_coherent() with GFP_USER, and exposes these to the userspace application by implementing mmap() on a file in /dev/ and returning an address to the application using dma_mmap_coherent() on the preallocated buffers.
So far so good; the user-space application is seeing valid data and the throughput is more than enough at >360MB/s with room to spare (the external interface is not fast enough to really see what the upper bound is).
To implement zero-copy TCP networking, my first approach was to use SO_ZEROCOPY on the socket:
sent_bytes = send(fd, buf, len, MSG_ZEROCOPY);
if (sent_bytes < 0) {
perror("send");
return -1;
}
However, this results in send: Bad address.
After googling for a bit, my second approach was to use a pipe and splice() followed by vmsplice():
ssize_t sent_bytes;
int pipes[2];
struct iovec iov = {
.iov_base = buf,
.iov_len = len
};
pipe(pipes);
sent_bytes = vmsplice(pipes[1], &iov, 1, 0);
if (sent_bytes < 0) {
perror("vmsplice");
return -1;
}
sent_bytes = splice(pipes[0], 0, fd, 0, sent_bytes, SPLICE_F_MOVE);
if (sent_bytes < 0) {
perror("splice");
return -1;
}
However, the result is the same: vmsplice: Bad address.
Note that if I replace the call to vmsplice() or send() to a function that just prints the data pointed to by buf (or a send() without MSG_ZEROCOPY), everything is working just fine; so the data is accessible to userspace, but the vmsplice()/send(..., MSG_ZEROCOPY) calls seem unable to handle it.
What am I missing here? Is there any way of using zero-copy TCP sending with a user-space address obtained from a kernel driver through dma_mmap_coherent()? Is there another approach I could use?
UPDATE
So I dove a bit deeper into the sendmsg() MSG_ZEROCOPY path in the kernel, and the call that eventually fails is get_user_pages_fast(). This call returns -EFAULT because check_vma_flags() finds the VM_PFNMAP flag set in the vma. This flag is apparently set when the pages are mapped into user space using remap_pfn_range() or dma_mmap_coherent(). My next approach is to find another way to mmap these pages.

As I posted in an update in my question, the underlying problem is that zerocopy networking does not work for memory that has been mapped using remap_pfn_range() (which dma_mmap_coherent() happens to use under the hood as well). The reason is that this type of memory (with the VM_PFNMAP flag set) does not have metadata in the form of struct page* associated with each page, which it needs.
The solution then is to allocate the memory in a way that struct page*s are associated with the memory.
The workflow that now works for me to allocate the memory is:
Use struct page* page = alloc_pages(GFP_USER, page_order); to allocate a block of contiguous physical memory, where the number of contiguous pages that will be allocated is given by 2**page_order.
Split the high-order/compound page into 0-order pages by calling split_page(page, page_order);. This now means that struct page* page has become an array with 2**page_order entries.
Now to submit such a region to the DMA (for data reception):
dma_addr = dma_map_page(dev, page, 0, length, DMA_FROM_DEVICE);
dma_desc = dmaengine_prep_slave_single(dma_chan, dma_addr, length, DMA_DEV_TO_MEM, 0);
dmaengine_submit(dma_desc);
When we get a callback from the DMA that the transfer has finished, we need to unmap the region to transfer ownership of this block of memory back to the CPU, which takes care of caches to make sure we're not reading stale data:
dma_unmap_page(dev, dma_addr, length, DMA_FROM_DEVICE);
Now, when we want to implement mmap(), all we really have to do is call vm_insert_page() repeatedly for all of the 0-order pages that we pre-allocated:
static int my_mmap(struct file *file, struct vm_area_struct *vma) {
int res;
...
for (i = 0; i < 2**page_order; ++i) {
if ((res = vm_insert_page(vma, vma->vm_start + i*PAGE_SIZE, &page[i])) < 0) {
break;
}
}
vma->vm_flags |= VM_LOCKED | VM_DONTCOPY | VM_DONTEXPAND | VM_DENYWRITE;
...
return res;
}
When the file is closed, don't forget to free the pages:
for (i = 0; i < 2**page_order; ++i) {
__free_page(&dev->shm[i].pages[i]);
}
Implementing mmap() this way now allows a socket to use this buffer for sendmsg() with the MSG_ZEROCOPY flag.
Although this works, there are two things that don't sit well with me with this approach:
You can only allocate power-of-2-sized buffers with this method, although you could implement logic to call alloc_pages as many times as needed with decreasing orders to get any size buffer made up of sub-buffers of varying sizes. This will then require some logic to tie these buffers together in the mmap() and to DMA them with scatter-gather (sg) calls rather than single.
split_page() says in its documentation:
* Note: this is probably too low level an operation for use in drivers.
* Please consult with lkml before using this in your driver.
These issues would be easily solved if there was some interface in the kernel to allocate an arbitrary amount of contiguous physical pages. I don't know why there isn't, but I don't find the above issues so important as to go digging into why this isn't available / how to implement it :-)

Maybe this will help you to understand why alloc_pages requires a power-of-2 page number.
To optimize the page allocation process(and decrease external fragmentations), which is frequently engaged, Linux kernel developed per-cpu page cache and buddy-allocator to allocate memory(there is another allocator, slab, to serve memory allocations that are smaller than a page).
Per-cpu page cache serve the one-page allocation request, while buddy-allocator keeps 11 lists, each containing 2^{0-10} physical pages respectively. These lists perform well when allocate and free pages, and of course, the premise is you are requesting a power-of-2-sized buffer.

Linux UART slower than specified Baudrate

I'm trying to communicate between two Linux systems via UART.
I want to send large chunks of data. With the specified Baudrate it should take around 5 seconds, but it takes nearly 10 times the expected time.
As I'm sending more than the buffer can handle at once it is send in small parts and I'm draining the buffer in between. If I measure the time needed for the drain and the number of bytes written to the buffer I calculate a Baudrate nearly 10 times lower than the specified Baudrate.
I would expect a slower transmission as the optimal, but not this much.
Did I miss something while setting the UART or while writing? Or is this normal?
The code used for setup:
int bus = open(interface.c_str(), O_RDWR | O_NOCTTY | O_NDELAY); // <- also tryed blocking
if (bus < 0) {
return;
}
struct termios options;
memset (&options, 0, sizeof options);
if(tcgetattr(bus, &options) != 0){
close(bus);
bus = -1;
return;
}
cfsetspeed (&options, B230400);
cfmakeraw(&options); // <- also tried this manually. did not make a difference
if(tcsetattr(bus, TCSANOW, &options) != 0)
{
close(bus);
bus = -1;
return;
}
tcflush(bus, TCIFLUSH);
The code used to send:
int32_t res = write(bus, data, dataLength);
while (res < dataLength){
tcdrain(bus); // <- taking way longer than expected
int32_t r = write(bus, &data[res], dataLength - res);
if(r == 0)
break;
if(r == -1){
break;
}
res += r;
}

B230400
The docs are contradictory. cfsetspeed is documented as requiring a speed_t type, while the note says you need to use one of the "B" constants like "B230400." Have you tried using an actual speed_t type?
In any case, the speed you're supplying is the baud rate, which in this case should get you approximately 23,000 bytes/second, assuming there is no throttling.
The speed is dependent on hardware and link limitations. Also the serial protocol allows pausing the transmission.
FWIW, according to the time and speed you listed, if everything works perfectly, you'll get about 1 MB in 50 seconds. What speed are you actually getting?
Another "also" is the options structure. It's been years since I've had to do any serial I/O, but IIRC, you need to actually set the options that you want and are supported by your hardware, like CTS/RTS, XON/XOFF, etc.
This might be helpful.

As I'm sending more than the buffer can handle at once it is send in small parts and I'm draining the buffer in between.
You have only provided code snippets (rather than a minimal, complete, and verifiable example), so your data size is unknown.
But the Linux kernel buffer size is known. What do you think it is?
(FYI it's 4KB.)
If I measure the time needed for the drain and the number of bytese written to the buffer I calculate a Baudrate nearly 10 times lower than the specified Baudrate.
You're confusing throughput with baudrate.
The maximum throughput (of just payload) of an asynchronous serial link will always be less than the baudrate due to framing overhead per character, which could be two of the ten bits of the frame (assuming 8N1). Since your termios configuration is incomplete, the overhead could actually be three of the eleven bits of the frame (assuming 8N2).
In order to achieve the maximum throughput, the tranmitting UART must saturate the line with frames and never let the line go idle.
The userspace program must be able to supply data fast enough, preferably by one large write() to reduce syscall overhead.
Did I miss something while setting the UART or while writing?
With Linux, you have limited access to the UART hardware.
From userspace your program accesses a serial terminal.
Your program accesses the serial terminal in a sub-optinal manner.
Your termios configuration appears to be incomplete.
It leaves both hardware and software flow-control untouched.
The number of stop bits is untouched.
The Ignore modem control lines and Enable receiver flags are not enabled.
For raw reading, the VMIN and VTIME values are not assigned.
Or is this normal?
There are ways to easily speed up the transfer.
First, your program combines non-blocking mode with non-canonical mode. That's a degenerate combination for receiving, and suboptimal for transmitting.
You have provided no reason for using non-blocking mode, and your program is not written to properly utilize it.
Therefore your program should be revised to use blocking mode instead of non-blocking mode.
Second, the tcdrain() between write() syscalls can introduce idle time on the serial link. Use of blocking mode eliminates the need for this delay tactic between write() syscalls.
In fact with blocking mode only one write() syscall should be needed to transmit the entire dataLength. This would also minimize any idle time introduced on the serial link.
Note that the first write() does not properly check the return value for a possible error condition, which is always possible.
Bottom line: your program would be simpler and throughput would be improved by using blocking I/O.

Writing out DMA buffers into memory mapped file

I need to write in embedded Linux(2.6.37) as fast as possible incoming DMA buffers to HD partition as raw device /dev/sda1. Buffers are aligned as required and are of equal 512KB length. The process may continue for a very long time and fill as much as, for example, 256GB of data.
I need to use the memory-mapped file technique (O_DIRECT not applicable), but can't understand the exact way how to do this.
So, in pseudo code "normal" writing:
fd=open(/dev/sda1",O_WRONLY);
while(1) {
p = GetVirtualPointerToNewBuffer();
if (InputStopped())
break;
write(fd, p, BLOCK512KB);
}
Now, I will be very thankful for the similar pseudo/real code example of how to utilize memory-mapped technique for this writing.
UPDATE2:
Thanks to kestasx the latest working test code looks like following:
#define TSIZE (64*KB)
void* TBuf;
int main(int argc, char **argv) {
int fdi=open("input.dat", O_RDONLY);
//int fdo=open("/dev/sdb2", O_RDWR);
int fdo=open("output.dat", O_RDWR);
int i, offs=0;
void* addr;
i = posix_memalign(&TBuf, TSIZE, TSIZE);
if ((fdo < 1) || (fdi < 1)) {
printf("Error in files\n");
return -1; }
while(1) {
addr = mmap((void*)TBuf, TSIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fdo, offs);
if ((unsigned int)addr == 0xFFFFFFFFUL) {
printf("Error MMAP=%d, %s\n", errno, strerror(errno));
return -1; }
i = read(fdi, TBuf, TSIZE);
if (i != TSIZE) {
printf("End of data\n");
return 0; }
i = munmap(addr, TSIZE);
offs += TSIZE;
sleep(1);
};
}
UPDATE3:
1. To precisely imitate the DMA work, I need to move read() call before mmp(), because when the DMA finishes it provides me with the address where it has put data. So, in pseudo code:
while(1) {
read(fdi, TBuf, TSIZE);
addr = mmap((void*)TBuf, TSIZE, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_SHARED, fdo, offs);
munmap(addr, TSIZE);
offs += TSIZE; }
This variant fails after(!) the first loop - read() says BAD ADDRESS on TBuf.
Without understanding exactly what I do, I substituted munmap() with msync(). This worked perfectly.
So, the question here - why unmapping the addr influenced on TBuf?
2.With the previous example working I went to the real system with the DMA. The same loop, just instead of read() call is the call which waits for a DMA buffer to be ready and its virtual address provided.
There are no error, the code runs, BUT nothing is recorded (!).
My thought was that Linux does not see that the area was updated and therefore does not sync() a thing.
To test this, I eliminated in the working example the read() call - and yes, nothing was recorded too.
So, the question here - how can I tell Linux that the mapped region contains new data, please, flush it!
Thanks a lot!!!

If I correctly understand, it makes sense if You mmap() file (not sure if it You can mmap() raw partition/block-device) and data via DMA is written directly to this memory region.
For this to work You need to be able to control p (where new buffer is placed) or address where file is maped. If You don't - You'll have to copy memory contents (and will lose some benefits of mmap).
So psudo code would be:
truncate("data.bin", 256GB);
fd = open( "data.bin", O_RDWR );
p = GetVirtualPointerToNewBuffer();
adr = mmap( p, 1GB, PROT_READ | PROT_WRITE, MAP_SHARED, fd, offset_in_file );
startDMA();
waitDMAfinish();
munmap( adr, 1GB );
This is first step only and I'm not completely sure if it will work with DMA (have no such experience).
I assume it is 32bit system, but even then 1GB mapped file size may be too big (if Your RAM is smaller You'll be swaping).
If this setup will work, next step would be to make loop to map regions of file at different offsets and unmap already filled ones.
Most likely You'll need to align addr to 4KB boundary.
When You'll unmap region, it's data will be synced to disk. So You'll need some testing to select appropriate mapped region size (while next region is filled by DMA, there must be enough time to unmap/write previous one).
UPDATE:
What exactly happens when You fill mmap'ed region via DMA I simply don't know (not sure how exactly dirty pages are detected: what is done by hardware, and what must be done by software).
UPDATE2: To my best knowledge:
DMA works the following way:
CPU arranges DMA transfer (address where to write transfered data in RAM);
DMA controller does the actual work, while CPU can do it's own work in parallel;
once DMA transfer is complete - DMA controller signals CPU via IRQ line (interrupt), so CPU can handle the result.
This seems simple while virtual memory is not involved: DMA should work independently from runing process (actual VM table in use by CPU). Yet it should be some mehanism to invalidate CPU cache for modified by DMA physical RAM pages (don't know if CPU needs to do something, or it is done authomatically by hardware).
mmap() forks the following way:
after successfull call of mmap(), file on disk is attached to process memory range (most likely some data structure is filled in OS kernel to hold this info);
I/O (reading or writing) from mmaped range triggers pagefault, which is handled by kernel loading appropriate blocks from atached file;
writes to mmaped range are handled by hardware (don't know how exactly: maybe writes to previously unmodified pages triger some fault, which is handled by kernel marking these pages dirty; or maybe this marking is done entirely in hardware and this info is available to kernel when it needs to flush modified pages to disk).
modified (dirty) pages are written to disk by OS (as it sees appropriate) or can be forced via msync() or munmap()
In theory it should be possible to do DMA transfers to mmaped range, but You need to find out, how exactly pages ar marked dirty (if You need to do something to inform kernel which pages need to be written to disk).
UPDATE3:
Even if modified by DMA pages are not marked dirty, You should be able to triger marking by rewriting (reading ant then writing the same) at least one value in each page (most likely each 4KB) transfered. Just make sure this rewriting is not removed (optimised out) by compiler.
UPDATE4:
It seems file opened O_WRONLY can't be mmap'ed (see question comments, my experimets confirm this too). It is logical conclusion of mmap() workings described above. The same is confirmed here (with reference to POSIX standart requirement to ensure file is readable regardless of maping protection flags).
Unless there is some way around, it actually means that by using mmap() You can't avoid reading of results file (unnecessary step in Your case).
Regarding DMA transfers to mapped range, I think it will be a requirement to ensure maped pages are preloalocated before DMA starts (so there is real memory asigned to both DMA and maped region). On Linux there is MAP_POPULATE mmap flag, but from manual it seams it works with MAP_PRIVATE mapings only (changes are not writen to disk), so most likely it is usuitable. Likely You'll have to triger pagefaults manually by accessing each maped page. This should triger reading of results file.
If You still wish to use mmap and DMA together, but avoid reading of results file, You'll have to modify kernel internals to allow mmap to use O_WRONLY files (for example by zero-filling trigered pages, instead of reading them from disk).

force socket disconnect without forging RST, Linux

I have a network client which is stuck in recvfrom a server not under my control which, after 24+ hours, is probably never going to respond. The program has processed a great deal of data, so I don't want to kill it; I want it to abandon the current connection and proceed. (It will do so correctly if recvfrom returns EOF or -1.) I have already tried several different programs that purport to be able to disconnect stale TCP channels by forging RSTs (tcpkill, cutter, killcx); none had any effect, the program remained stuck in recvfrom. I have also tried taking the network interface down; again, no effect.
It seems to me that there really should be a way to force a disconnect at the socket-API level without forging network packets. I do not mind horrible hacks, up to and including poking kernel data structures by hand; this is a disaster-recovery situation. Any suggestions?
(For clarity, the TCP channel at issue here is in ESTABLISHED state according to lsof.)

I do not mind horrible hacks
That's all you have to say. I am guessing the tools you tried didn't work because they sniff traffic to get an acceptable ACK number to kill the connection. Without traffic flowing they have no way to get hold of it.
Here are things you can try:
Probe all the sequence numbers
Where those tools failed you can still do it. Make a simple python script and with scapy, for each sequence number send a RST segment with the correct 4-tuple (ports and addresses). There's at most 4 billion (actually fewer assuming a decent window - you can find out the window for free using ss -i).
Make a kernel module to get hold of the socket
Make a kernel module getting a list of TCP sockets: look for sk_nulls_for_each(sk, node, &tcp_hashinfo.ehash[i].chain)
Identify your victim sk
At this point you intimately have access to your socket. So
You can call tcp_reset or tcp_disconnect on it. You won't be able to call tcp_reset directly (since it doesn't have EXPORT_SYMBOL) but you should be able to mimic it: most of the functions it calls are exported
Or you can get the expected ACK number from tcp_sk(sk) and directly forge a RST packet with scapy
Here is function I use to print established sockets - I scrounged bits and pieces from the kernel to make it some time ago:
#include <net/inet_hashtables.h>
#define NIPQUAD(addr) \
((unsigned char *)&addr)[0], \
((unsigned char *)&addr)[1], \
((unsigned char *)&addr)[2], \
((unsigned char *)&addr)[3]
#define NIPQUAD_FMT "%u.%u.%u.%u"
extern struct inet_hashinfo tcp_hashinfo;
/* Decides whether a bucket has any sockets in it. */
static inline bool empty_bucket(int i)
{
return hlist_nulls_empty(&tcp_hashinfo.ehash[i].chain);
}
void print_tcp_socks(void)
{
int i = 0;
struct inet_sock *inet;
/* Walk hash array and lock each if not empty. */
printk("Established ---\n");
for (i = 0; i <= tcp_hashinfo.ehash_mask; i++) {
struct sock *sk;
struct hlist_nulls_node *node;
spinlock_t *lock = inet_ehash_lockp(&tcp_hashinfo, i);
/* Lockless fast path for the common case of empty buckets */
if (empty_bucket(i))
continue;
spin_lock_bh(lock);
sk_nulls_for_each(sk, node, &tcp_hashinfo.ehash[i].chain) {
if (sk->sk_family != PF_INET)
continue;
inet = inet_sk(sk);
printk(NIPQUAD_FMT":%hu ---> " NIPQUAD_FMT
":%hu\n", NIPQUAD(inet->inet_saddr),
ntohs(inet->inet_sport), NIPQUAD(inet->inet_daddr),
ntohs(inet->inet_dport));
}
spin_unlock_bh(lock);
}
}
You should be able to pop this into a simple "Hello World" module and after insmoding it, in dmesg you will see sockets (much like ss or netstat).

I understand that what you want to do it's to automatize the process to make a test. But if you just want to check the correct handling of the recvfrom error, you could attach with the GDB and close the fd with close() call.
Here you could see an example.
Another option is to use scapy for crafting propper RST packets (which is not in your list). This is the way I tested the connections RST in a bridged system (IMHO is the best option), you could also implement a graceful shutdown.
Here an example of the scapy script.

How to Verify Atomic Writes?

I have searched diligently (both within the S[O|F|U] network and elsewhere) and believe this to be an uncommon question. I am working with an Atmel AT91SAM9263-EK development board (ARM926EJ-S core, ARMv5 instruction set) running Debian Linux 2.6.28-4. I am writing using (I believe) the tty driver to talk to an RS-485 serial controller. I need to ensure that writes and reads are atomic. Several lines of source code (listed below the end of this post relative to the kernel source installation directory) either imply or implicitly state this.
Is there any way I can verify that writing/reading to/from this device is actually an atomic operation? Or, is the /dev/ttyXX device considered a FIFO and the argument ends there? It seems not enough to simply trust that the code is enforcing this claim it makes - as recently as February of this year freebsd was demonstrated to lack atomic writes for small lines. Yes I realize that freebsd is not exactly the same as Linux, but my point is that it doesn't hurt to be carefully sure. All I can think of is to keep sending data and look for a permutation - I was hoping for something a little more scientific and, ideally, deterministic. Unfortunately, I remember precisely nothing from my concurrent programming classes in the college days of yore. I would thoroughly appreciate a slap or a shove in the right direction. Thank you in advance should you choose to reply.
Kind regards,
Jayce
drivers/char/tty_io.c:1087
void tty_write_message(struct tty_struct *tty, char *msg)
{
lock_kernel();
if (tty) {
mutex_lock(&tty->atomic_write_lock);
if (tty->ops->write && !test_bit(TTY_CLOSING, &tty->flags))
tty->ops->write(tty, msg, strlen(msg));
tty_write_unlock(tty);
}
unlock_kernel();
return;
}
arch/arm/include/asm/bitops.h:37
static inline void ____atomic_set_bit(unsigned int bit, volatile unsigned long *p)
{
unsigned long flags;
unsigned long mask = 1UL << (bit & 31);
p += bit >> 5;
raw_local_irq_save(flags);
*p |= mask;
raw_local_irq_restore(flags);
}
drivers/serial/serial_core.c:2376
static int
uart_write(struct tty_struct *tty, const unsigned char *buf, int count)
{
struct uart_state *state = tty->driver_data;
struct uart_port *port;
struct circ_buf *circ;
unsigned long flags;
int c, ret = 0;
/*
* This means you called this function _after_ the port was
* closed. No cookie for you.
*/
if (!state || !state->info) {
WARN_ON(1);
return -EL3HLT;
}
port = state->port;
circ = &state->info->xmit;
if (!circ->buf)
return 0;
spin_lock_irqsave(&port->lock, flags);
while (1) {
c = CIRC_SPACE_TO_END(circ->head, circ->tail, UART_XMIT_SIZE);
if (count < c)
c = count;
if (c <= 0)
break;
memcpy(circ->buf + circ->head, buf, c);
circ->head = (circ->head + c) & (UART_XMIT_SIZE - 1);
buf += c;
count -= c;
ret += c;
}
spin_unlock_irqrestore(&port->lock, flags);
uart_start(tty);
return ret;
}
Also, from the man write(3) documentation:
An attempt to write to a pipe or FIFO has several major characteristics:
Atomic/non-atomic: A write is atomic if the whole amount written in one operation is not interleaved with data from any other process. This is useful when there are multiple writers sending data to a single reader. Applications need to know how large a write request can be expected to be performed atomically. This maximum is called {PIPE_BUF}. This volume of IEEE Std 1003.1-2001 does not say whether write requests for more than {PIPE_BUF} bytes are atomic, but requires that writes of {PIPE_BUF} or fewer bytes shall be atomic.

I think that, technically, devices are not FIFOs, so it's not at all clear that the guarantees you quote are supposed to apply.
Are you concerned about partial writes and reads within a process, or are you actually reading and/or writing the same device from different processes? Assuming the latter, you might be better off implementing a proxy process of some sort. The proxy owns the device exclusively and performs all reads and writes, thus avoiding the multi-process atomicity problem entirely.
In short, I advise not attempting to verify that "reading/writing from this device is actually an atomic operation". It will be difficult to do with confidence, and leave you with an application that is subject to subtle failures if a later version of linux (or different o/s altogether) fails to implement atomicity the way you need.

I think PIPE_BUF is the right thing. Now, writes of less than PIPE_BUF bytes may not be atomic, but if they aren't it's an OS bug. I suppose you could ask here if an OS has known bugs. But really, if it has a bug like that, it just ought to be immediately fixed.
If you want to write more than PIPE_BUF atomically, I think you're out of luck. I don't think there is any way outside of application coordination and cooperation to make sure that writes of larger sizes happen atomically.
One solution to this problem is to put your own process in front of the device and make sure everybody who wants to write to the device contacts the process and sends the data to it instead. Then you can do whatever makes sense for your application in terms of atomicity guarantees.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string