linux socket: lifetime of ancillary data for sendmsg - linux

I use cmsg to activate timestamping on linux socket tx.
ssize_t sendWithOptions
(int sd, std::vector<uint8_t> &payload, uint32_t destIP, int flags)
{
msghdr msg { };
.... // filling standard
std::array<uint8_t, CMSG_LEN(sizeof(__u32))> buf;
msg.msg_control = buf.data();
msg.msg_controlen = buf.size();
auto cmsg { CMSG_FIRSTHDR ( &msg ) };
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SO_TIMESTAMPING;
cmsg->cmsg_len = buf.size();
*(reinterpret_cast<__u32>(CMSG_DATA (cmsg)) = static_cast<__u32>(flags);
return sendmsg ( sd, &msg, MSG_DONTWAIT );
}
Leaving the function, "buf" is automatically destroyed, but does sendmsg need this buffer to live longer?
Do I have a guarantee that the function does not need this buffer once it has returned the number of bytes sent.

Except for specific interfaces, it is generally the case that operating system calls do not rely on user-space to maintain data structures affecting their operation after they are finished. The exceptions will be spelled out in the manual pages.
With sendmsg, in particular, you can rely on the call to complete immediately - whether successful or not. It's fine therefore to use a dynamically allocated buffer as you're doing, and destroy it immediately after the call.
As an example of one exception, aio_write(2) is specifically intended to allow user-space to queue a write operation that will be completed asynchronously. For this call, the data is not consumed until it can be successfully written. Hence, you must not modify the data structures provided in the call until you have confirmed it is complete. That caveat is called out in the NOTES section of the manual page:
... The control block must not be changed while the write operation is in progress. The buffer area being written out must not be accessed during the operation or undefined results may occur. The memory areas involved must remain valid.
In summary: check the manual page for the system call. But most of the time, you don't need to worry about it.

Related

register_kprobe() returns EINVAL without additional memory on containing struct

I've written a kernel module (a character device) that registers new KProbes whenever I write to the module.
I have a structure that contains struct kprobe. When I call register_kprobe(), it returns -EINVAL. But when I add a dummy character array to the (possibly some other data types as well), the KProbe registration succeeds.
Probe Registration
struct my_struct *container = kmalloc(sizeof(struct my_struct));
(container->probe).addr = (kprobe_opcode_t *) kallsyms_lookup_name("my_exported_fn"); /* my_exported_fn is in code section */
(container->probe).pre_handler = Pre_Handler;
(container->probe).post_handler = Post_Handler;
register_probe(&container->probe);
/* Returns -EINVAL if my_struct contains only `struct kprobe`. */
Not working:
struct my_struct {
struct kprobe probe;
}
Working:
struct my_struct {
char dummy[512]; /* At 512, it gets consistently registered. At 256, sometimes (maybe one out of 5 - 10 times get registered) */
struct kprobe probe;
}
Why does it need this extra bit of memory to be present in the struct?
This could be unaligned memory access or not, but in this particular case (I mean your original code before the edit) I suspect that the data is not properly initialised. Namely, register_kprobe() calls kprobe_addr() function which in turn implies the following check:
if ((symbol_name && addr) || (!symbol_name && !addr))
goto invalid;
...
invalid:
return ERR_PTR(-EINVAL);
So, if you indeed initialise addr and don't initialise symbol_name, the latter could be a garbage pointer under certain circumstances. Namely, kmalloc() doesn't zeroise allocated memory and, furthermore, depending on requested size, it may take memory object of a suitable size from a different pool (there are different pools to provide objects of different sizes), and when you artificially increase the size of the struct, kmalloc() has to allocate a larger object from a suitable pool. From this perspective, the probability is that such an object may not contain garbage by occasion (since larger chunks are requested less often).
All in all, I suggest zeroising the memory chunk or using kzalloc().

Atomicity of writev() system call in Linux

I've looked in the kernel source for linux kernel 4.4.0-57-generic and don't see any locks in the writev() source. Is there something I'm missing? I don't see how writev() is atomic or thread-safe.
Not a kernel expert here, but I'll share my point of view anyway. Feel free to spot any mistakes.
Browsing the kernel (v4.9 though I wouldn't expect it to be so different), and trying to trace the writev(2) system call, I can observe subsequent function calls that create the following path:
SYSCALL_DEFINE3(writev, ..)
do_writev(..)
vfs_writev(..)
do_readv_writev(..)
Now the path branches, depending on whether a write_iter method is implemented and hooked on the struct file_operations field of the struct file that the system call is referring to.
If it's not NULL, the path is:
5a. do_iter_readv_writev(..), which calls the method filp->f_op->write_iter(..) at this point.
If it is NULL, the path is:
5b. do_loop_readv_writev(..), which calls repeatedly in a loop the method filp->f_op->write at this point.
So, as far as I understand, the writev() system call is as thread safe as the underlying write() (or write_iter()) is, which of course can be implemented in various ways, e.g. in a device driver, and may or may not use locks according to its needs and its design.
EDIT:
In kernel v4.4 the paths look pretty similar:
SYSCALL_DEFINE3(writev, ..)
vfs_writev(..)
do_readv_writev(..)
and then it depends on whether the write_iter method as a field in struct file_operations of the struct file is NULL or not, just like the case in v4.9, described above.
VFS (Virtual File System) by itself doesn't garantee atomicity of writev() call. It just calls filesystem-specific .write_iter method of struct file_operations.
It is responsibility of specific filesystem implementation for make method atomically write to the file.
For example, in ext4 filesystem function ext4_file_write_iter uses
mutex_lock(&inode->i_mutex);
for make writting atomic.
Found it in fs.h:
static inline void file_start_write(struct file *file)
{
if (!S_ISREG(file_inode(file)->i_mode))
return;
__sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, true);
}
and then in super.c:
/*
* This is an internal function, please use sb_start_{write,pagefault,intwrite}
* instead.
*/
int __sb_start_write(struct super_block *sb, int level, bool wait)
{
bool force_trylock = false;
int ret = 1;
#ifdef CONFIG_LOCKDEP
/*
* We want lockdep to tell us about possible deadlocks with freezing
* but it's it bit tricky to properly instrument it. Getting a freeze
* protection works as getting a read lock but there are subtle
* problems. XFS for example gets freeze protection on internal level
* twice in some cases, which is OK only because we already hold a
* freeze protection also on higher level. Due to these cases we have
* to use wait == F (trylock mode) which must not fail.
*/
if (wait) {
int i;
for (i = 0; i < level - 1; i++)
if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) {
force_trylock = true;
break;
}
}
#endif
if (wait && !force_trylock)
percpu_down_read(sb->s_writers.rw_sem + level-1);
else
ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);
WARN_ON(force_trylock & !ret);
return ret;
}
EXPORT_SYMBOL(__sb_start_write);
Thanks again.

ping pong DMA in kernel

I am using a SoC to DMA data from the programmable logic side (PL) to the processor side (PS). My overall scheme is to allocate two separate dma buffers and ping-pong data to these buffers to the user-space application can access data without collisions. I have successfully done this using single dma transactions in a loop (in a kthread) with each transaction either on the first or second buffer. I was able to notify a poll() method of either the first or second buffer.
Now I would like to investigate scatter-gather and/or cyclic DMA.
struct scatterlist sg[2]
struct dma_async_tx_descriptor *chan_desc;
struct dma_slave_config config;
sg_init_table(sg, ARRAY_SIZE(sg));
addr_0 = (unsigned int *)dma_alloc_coherent(NULL, dma_size, &handle_0, GFP_KERNEL);
addr_1 = (unsigned int *)dma_alloc_coherent(NULL, dma_size, &handle_1, GFP_KERNEL);
sg_dma_address(&sg[0]) = handle_0;
sg_dma_len(&sg[0]) = dma_size;
sg_dma_address(&sg[1]) = handle_1;
sg_dma_len(&sg[1]) = dma_size;
memset(&config, 0, sizeof(config));
config.direction = DMA_DEV_TO_MEM;
config.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
config.src_maxburst = DMA_MAX_BURST_SIZE / DMA_SLAVE_BUSWIDTH_4_BYTES;
dmaengine_slave_config(dma_dev->chan, &config);
chan_desc = dmaengine_prep_slave_sg(dma_dev->chan, &sg[0], 2, DMA_DEV_TO_MEM, DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
chan_desc->callback = sync_callback;
chan_desc->callback_param = dma_dev->cmp;
init_completion(dma_dev->cmp);
dmaengine_submit(chan_desc);
dma_async_issue_pending(dma_dev->chan);
dma_free_coherent(NULL, dma_size, addr_0, handle_0);
dma_free_coherent(NULL, dma_size, addr_1, handle_1);
This works fine for a single run through the scatterlist and then calls the call_back at sync_callback. My thought was to chain the scatterlist in a loop but I won't get the callback.
IS THERE A WAY to have a callback for each descriptor in the scatterlist? I wondered if I could use dmaengine_prep_slave_cyclic (calls callback after every transaction) but it looks to me like this is for a single buffer when reviewing dmaengine.h. Looking at DMA Engine API Guide it looks like there is another option dmaengine_prep_interleaved_dma using a dma_interleaved_template that sounds interesting but hard to find info about.
In the end I just want to signal the user space in some manner as to what buffer is ready.

Unclear logic behind pl011_tx_chars() in amba-pl011 Linux kernel module

I'm trying to understand how Linux driver for AMBA serial port (amba-pl011.c) sends characters in non-DMA mode. For port operations, this driver registers only following callbacks:
static struct uart_ops amba_pl011_pops = {
.tx_empty = pl011_tx_empty,
.set_mctrl = pl011_set_mctrl,
.get_mctrl = pl011_get_mctrl,
.stop_tx = pl011_stop_tx,
.start_tx = pl011_start_tx,
.stop_rx = pl011_stop_rx,
.enable_ms = pl011_enable_ms,
.break_ctl = pl011_break_ctl,
.startup = pl011_startup,
.shutdown = pl011_shutdown,
.flush_buffer = pl011_dma_flush_buffer,
.set_termios = pl011_set_termios,
.type = pl011_type,
.release_port = pl011_release_port,
.request_port = pl011_request_port,
.config_port = pl011_config_port,
.verify_port = pl011_verify_port,
.poll_init = pl011_hwinit,
.poll_get_char = pl011_get_poll_char,
.poll_put_char = pl011_put_poll_char };
As you can see, there's no character sending operation among them, namely, pl011_tx_chars() function is not listed there. Since pl011_tx_chars() is declared static, it is not exposed outside the module. I found that within the module it is called only from pl011_int() function which is an interrupt handler. It is called whenever UART011_TXIS occurs:
if (status & UART011_TXIS) pl011_tx_chars(uap);
The function pl011_tx_chars() itself writes characters from circular buffer to UART01x_DR port until the fifo queue size is reached (function returns then so more data will be written at the next interrupt) or until circular buffer is empty (pl011_stop_tx() is called then). As we can see, pl011_start_tx() and pl011_stop_tx() are listed in AMBA port operations (so they can be called as callbacks despite their local static declaration). Seems reasonable, thing is, these two function do something very simple:
static void pl011_stop_tx(struct uart_port *port)
{
struct uart_amba_port *uap = (struct uart_amba_port *)port;
uap->im &= ~UART011_TXIM;
writew(uap->im, uap->port.membase + UART011_IMSC);
pl011_dma_tx_stop(uap);
}
static void pl011_start_tx(struct uart_port *port)
{
struct uart_amba_port *uap = (struct uart_amba_port *)port;
if (!pl011_dma_tx_start(uap)) {
uap->im |= UART011_TXIM;
writew(uap->im, uap->port.membase + UART011_IMSC);
}
}
Since I don't have CONFIG_DMA_ENGINE set, pl011_dma_tx_start() and pl011_dma_tx_stop() are just stubs:
static inline void pl011_dma_tx_stop(struct uart_amba_port *uap)
{
}
static inline bool pl011_dma_tx_start(struct uart_amba_port *uap)
{
return false;
}
Seems like the only thing that pl011_start_tx() does is to arm UART011_TXIM interrupt while the only thing that pl011_stop_tx() does is to disarm it. Nothing initiates the transmission!
I looked at serial_core.c - it's the only file where start_tx operation is invoked, in four places (by the registered callback). The most promissing place is uart_write() function. It fills circular buffer with data and calls local static uart_start() function which is very simple:
static void __uart_start(struct tty_struct *tty)
{
struct uart_state *state = tty->driver_data;
struct uart_port *port = state->uart_port;
if (!uart_circ_empty(&state->xmit) && state->xmit.buf &&
!tty->stopped && !tty->hw_stopped)
port->ops->start_tx(port);
}
static void uart_start(struct tty_struct *tty)
{
struct uart_state *state = tty->driver_data;
struct uart_port *port = state->uart_port;
unsigned long flags;
spin_lock_irqsave(&port->lock, flags);
__uart_start(tty);
spin_unlock_irqrestore(&port->lock, flags);
}
As you can see, no one sends initial characters to the UART port, circular buffer is filled and everything is waiting for UART011_TXIS interrupt.
Is it possible that arming UART011_TXIM interrupt instantly emits UART011_TXIS? I looked into DDI0183.pdf (PrimeCell® UART (PL011) Technical Referecne Manual), Chapter 3: Programmers Model, section 3.4: Interrupts, subsection 3.4.3 UARTTXINTR. What it says is:
....
The transmit interrupt changes state when one of the following events occurs:
• If the FIFOs are enabled and the transmit FIFO reaches the programmed trigger
level. When this happens, the transmit interrupt is asserted HIGH. The transmit
interrupt is cleared by writing data to the transmit FIFO until it becomes greater
than the trigger level, or by clearing the interrupt.
• If the FIFOs are disabled (have a depth of one location) and there is no data
present in the transmitters single location, the transmit interrupt is asserted HIGH.
It is cleared by performing a single write to the transmit FIFO, or by clearing the
interrupt.
....
The note below is even more interesting:
....
The transmit interrupt is based on a transition through a level, rather than on the level
itself. When the interrupt and the UART is enabled before any data is written to the
transmit FIFO the interrupt is not set. The interrupt is only set once written data leaves
the single location of the transmit FIFO and it becomes empty.
....
The emphasis above is mine. I don't know if my English is not sufficient, but from the words above I can't find where it states that unlocking transmit interrupt can be used for triggering transmit routine. What am I missing?
The ARM docs say that the PL011 is a "16550-ish" UART. This sort of gets them off the hook for fully specifying its behavior and instead sends you to the 16550 docs, which state in the "FIFO interrupt mode operation" section...
When the XMIT FIFO and transmitter interrupts are enabled (FCR0e1,
IER1e1), XMIT interrupts will occur as follows: A. The transmitter
holding register interrupt (02) occurs when the XMIT FIFO is empty; it
is cleared as soon as the transmitter holding register is written to
(1 to 16 characters may be written to the XMIT FIFO while servicing
this interrupt) or the IIR is read.
So, it appears that if the FIFO and TX holding register are empty and you enable TX interrupts, you should immediately see a TX interrupt that kickstarts the sending process and fills the holding register and then the FIFO. Once those drain back down below the FIFO trigger, then another interrupt will be generated to keep the process going for as long as there is more buffered data to be sent.

Looking for a lock-free RT-safe single-reader single-writer structure

I'm looking for a lock-free design conforming to these requisites:
a single writer writes into a structure and a single reader reads from this structure (this structure exists already and is safe for simultaneous read/write)
but at some time, the structure needs to be changed by the writer, which then initialises, switches and writes into a new structure (of the same type but with new content)
and at the next time the reader reads, it switches to this new structure (if the writer multiply switches to a new lock-free structure, the reader discards these structures, ignoring their data).
The structures must be reused, i.e. no heap memory allocation/free is allowed during write/read/switch operation, for RT purposes.
I have currently implemented a ringbuffer containing multiple instances of these structures; but this implementation suffers from the fact that when the writer has used all the structures present in the ringbuffer, there is no more place to change from structure... But the rest of the ringbuffer contains some data which don't have to be read by the reader but can't be re-used by the writer. As a consequence, the ringbuffer does not fit this purpose.
Any idea (name or pseudo-implementation) of a lock-free design? Thanks for having considered this problem.
Here's one. The keys are that there are three buffers and the reader reserves the buffer it is reading from. The writer writes to one of the other two buffers. The risk of collision is minimal. Plus, this expands. Just make your member arrays one element longer than the number of readers plus the number of writers.
class RingBuffer
{
RingBuffer():lastFullWrite(0)
{
//Initialize the elements of dataBeingRead to false
for(unsigned int i=0; i<DATA_COUNT; i++)
{
dataBeingRead[i] = false;
}
}
Data read()
{
// You may want to check to make sure write has been called once here
// to prevent read from grabbing junk data. Else, initialize the elements
// of dataArray to something valid.
unsigned int indexToRead = lastFullWriteIndex;
Data dataCopy;
dataBeingRead[indexToRead] = true;
dataCopy = dataArray[indexToRead];
dataBeingRead[indexToRead] = false;
return dataCopy;
}
void write( const Data& dataArg )
{
unsigned int writeIndex(0);
//Search for an unused piece of data.
// It's O(n), but plenty fast enough for small arrays.
while( true == dataBeingRead[writeIndex] && writeIndex < DATA_COUNT )
{
writeIndex++;
}
dataArray[writeIndex] = dataArg;
lastFullWrite = &dataArray[writeIndex];
}
private:
static const unsigned int DATA_COUNT;
unsigned int lastFullWrite;
Data dataArray[DATA_COUNT];
bool dataBeingRead[DATA_COUNT];
};
Note: The way it's written here, there are two copies to read your data. If you pass your data out of the read function through a reference argument, you can cut that down to one copy.
You're on the right track.
Lock free communication of fixed messages between threads/processes/processors
fixed size ring buffers can be used in lock free communications between threads, processes or processors if there is one producer and one consumer. Some checks to perform:
head variable is written only by producer (as an atomic action after writing)
tail variable is written only by consumer (as an atomic action after reading)
Pitfall: introduction of a size variable or buffer full/empty flag; these are typically written by both producer and consumer and hence will give you an issue.
I generally use ring buffers for this purpoee. Most important lesson I've learned is that a ring buffer of can never contain more than elements. This way a head and tail variable are written by producer respectively consumer.
Extension for large/variable size blocks
To use buffers in a real time environment, you can either use memory pools (often available in optimized form in real time operating systems) or decouple allocation from usage. The latter fits to the question, I believe.
If you need to exchange large blocks, I suggest to use a pool with buffer blocks and communicate pointers to buffers using a queue. So use a 3rd queue with buffer pointers. This way the allocates can be done in application (background) and you real time portion has access to a variable amount of memory.
Application
while (blockQueue.full != true)
{
buf = allocate block of memory from heap or buffer pool
msg = { .... , buf };
blockQueue.Put(msg)
}
Producer:
pBuf = blockQueue.Get()
pQueue.Put()
Consumer
if (pQueue.Empty == false)
{
msg=pQueue.Get()
// use info in msg, with buf pointer
// optionally indicate that buf is no longer used
}

Resources