ping pong DMA in kernel

ping pong DMA in kernel - linux

I am using a SoC to DMA data from the programmable logic side (PL) to the processor side (PS). My overall scheme is to allocate two separate dma buffers and ping-pong data to these buffers to the user-space application can access data without collisions. I have successfully done this using single dma transactions in a loop (in a kthread) with each transaction either on the first or second buffer. I was able to notify a poll() method of either the first or second buffer.
Now I would like to investigate scatter-gather and/or cyclic DMA.
struct scatterlist sg[2]
struct dma_async_tx_descriptor *chan_desc;
struct dma_slave_config config;
sg_init_table(sg, ARRAY_SIZE(sg));
addr_0 = (unsigned int *)dma_alloc_coherent(NULL, dma_size, &handle_0, GFP_KERNEL);
addr_1 = (unsigned int *)dma_alloc_coherent(NULL, dma_size, &handle_1, GFP_KERNEL);
sg_dma_address(&sg[0]) = handle_0;
sg_dma_len(&sg[0]) = dma_size;
sg_dma_address(&sg[1]) = handle_1;
sg_dma_len(&sg[1]) = dma_size;
memset(&config, 0, sizeof(config));
config.direction = DMA_DEV_TO_MEM;
config.dst_addr_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
config.src_maxburst = DMA_MAX_BURST_SIZE / DMA_SLAVE_BUSWIDTH_4_BYTES;
dmaengine_slave_config(dma_dev->chan, &config);
chan_desc = dmaengine_prep_slave_sg(dma_dev->chan, &sg[0], 2, DMA_DEV_TO_MEM, DMA_CTRL_ACK | DMA_PREP_INTERRUPT);
chan_desc->callback = sync_callback;
chan_desc->callback_param = dma_dev->cmp;
init_completion(dma_dev->cmp);
dmaengine_submit(chan_desc);
dma_async_issue_pending(dma_dev->chan);
dma_free_coherent(NULL, dma_size, addr_0, handle_0);
dma_free_coherent(NULL, dma_size, addr_1, handle_1);
This works fine for a single run through the scatterlist and then calls the call_back at sync_callback. My thought was to chain the scatterlist in a loop but I won't get the callback.
IS THERE A WAY to have a callback for each descriptor in the scatterlist? I wondered if I could use dmaengine_prep_slave_cyclic (calls callback after every transaction) but it looks to me like this is for a single buffer when reviewing dmaengine.h. Looking at DMA Engine API Guide it looks like there is another option dmaengine_prep_interleaved_dma using a dma_interleaved_template that sounds interesting but hard to find info about.
In the end I just want to signal the user space in some manner as to what buffer is ready.

Related

linux socket: lifetime of ancillary data for sendmsg

I use cmsg to activate timestamping on linux socket tx.
ssize_t sendWithOptions
(int sd, std::vector<uint8_t> &payload, uint32_t destIP, int flags)
{
msghdr msg { };
.... // filling standard
std::array<uint8_t, CMSG_LEN(sizeof(__u32))> buf;
msg.msg_control = buf.data();
msg.msg_controlen = buf.size();
auto cmsg { CMSG_FIRSTHDR ( &msg ) };
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SO_TIMESTAMPING;
cmsg->cmsg_len = buf.size();
*(reinterpret_cast<__u32>(CMSG_DATA (cmsg)) = static_cast<__u32>(flags);
return sendmsg ( sd, &msg, MSG_DONTWAIT );
}
Leaving the function, "buf" is automatically destroyed, but does sendmsg need this buffer to live longer?
Do I have a guarantee that the function does not need this buffer once it has returned the number of bytes sent.

Except for specific interfaces, it is generally the case that operating system calls do not rely on user-space to maintain data structures affecting their operation after they are finished. The exceptions will be spelled out in the manual pages.
With sendmsg, in particular, you can rely on the call to complete immediately - whether successful or not. It's fine therefore to use a dynamically allocated buffer as you're doing, and destroy it immediately after the call.
As an example of one exception, aio_write(2) is specifically intended to allow user-space to queue a write operation that will be completed asynchronously. For this call, the data is not consumed until it can be successfully written. Hence, you must not modify the data structures provided in the call until you have confirmed it is complete. That caveat is called out in the NOTES section of the manual page:
... The control block must not be changed while the write operation is in progress. The buffer area being written out must not be accessed during the operation or undefined results may occur. The memory areas involved must remain valid.
In summary: check the manual page for the system call. But most of the time, you don't need to worry about it.

Linux: How to mmap a sequence of physically contiguous areas into user space?

In my driver I have certain number of physically contiguous DMA buffers (e.g. 4MB long each) to receive data from a device. They are handled by hardware using the SG list. As the received data will be subjected to intensive processing, I don't want to switch off cache and I will use dma_sync_single_for_cpu after each buffer is filled by DMA.
To simplify data processing, I want those buffers to appear as a single huge, contiguous, circular buffer in the user space.
In case of a single buffer I simply use remap_pfn_range or dma_mmap_coherent. However, I can't use those functions multiple times to map consecutive buffers.
Of course, I can implement the fault operation in the vm_operations so that it finds the pfn of the corresponding page in the right buffer, and inserts it into the vma with vm_insert_pfn.
The acquisition will be really fast, so I can't handle mapping when the real data arrive. But this can be solved easily. To have all mapping ready before the data acquisition starts, I can simply read the whole mmapped buffer in my application before starting the acquisition, so that all pages are already inserted when the first data arrive.
Tha fault based trick should work, but maybe there is something more elegant? Just a single function, that may be called multiple times to build the whole mapping incrementally?
Additional difficulty is that the solution should be applicable (with minimal adjustments) to kernels starting from 2.6.32 to the newest one.
PS. I have seen that annoying post. Is there a danger that if the application attempts to write something to the mmapped buffer (just doing the in place processing of data), my carefully built mapping will be destroyed by COW?

Below is my solution that works for buffers allocated with dmam_alloc_noncoherent.
Allocation of the buffers:
[...]
for(i=0;i<DMA_NOFBUFS;i++) {
ext->buf_addr[i] = dmam_alloc_noncoherent(&my_dev->dev, DMA_BUFLEN, &my_dev->buf_dma_t[i],GFP_USER);
if(my_dev->buf_addr[i] == NULL) {
res = -ENOMEM;
goto err1;
}
//Make buffer ready for filling by the device
dma_sync_single_range_for_device(&my_dev->dev, my_dev->buf_dma_t[i],0,DMA_BUFLEN,DMA_FROM_DEVICE);
}
[...]
Mapping of the buffers
void swz_mmap_open(struct vm_area_struct *vma)
{
}
void swz_mmap_close(struct vm_area_struct *vma)
{
}
static int swz_mmap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
long offset;
char * buffer = NULL;
int buf_num = 0;
//Calculate the offset (according to info in https://lxr.missinglinkelectronics.com/linux+v2.6.32/drivers/gpu/drm/i915/i915_gem.c#L1195 it is better not ot use the vmf->pgoff )
offset = (unsigned long)(vmf->virtual_address - vma->vm_start);
buf_num = offset/DMA_BUFLEN;
if(buf_num > DMA_NOFBUFS) {
printk(KERN_ERR "Access outside the buffer\n");
return -EFAULT;
}
offset = offset - buf_num * DMA_BUFLEN;
buffer = my_dev->buf_addr[buf_num];
vm_insert_pfn(vma,(unsigned long)(vmf->virtual_address),virt_to_phys(&buffer[offset]) >> PAGE_SHIFT);
return VM_FAULT_NOPAGE;
}
struct vm_operations_struct swz_mmap_vm_ops =
{
.open = swz_mmap_open,
.close = swz_mmap_close,
.fault = swz_mmap_fault,
};
static int char_sgdma_wz_mmap(struct file *file, struct vm_area_struct *vma)
{
vma->vm_ops = &swz_mmap_vm_ops;
vma->vm_flags |= VM_IO | VM_RESERVED | VM_CAN_NONLINEAR | VM_PFNMAP;
swz_mmap_open(vma);
return 0;
}

Custom SPI driver to implement lseek

I am trying to implement a SPI driver for custom hardware. I have started with a copy of the spidev driver, which has support for almost everything I need.
We're using a protocol that has three parts: a command bit (read / write) an address, and an arbitrary amount of data.
I had assumed that simply adding lseek capabilities would be the best way to do this. "Seek" to the desired address, then read or write any number of bytes. I created a custom .llseek in the new driver's file_operations, but I have never seen that function even be called. I have tried using fseek(), lseek(), and pread() and none of those functions seem to call the new my_lseek() function. Every call reports "errno 29 ESPIPE Illegal Seek"
The device is defined in the board.c file:
static struct spi_board_info my_spi_board_info[] __initdata = {
[0] = {
.modalias = "myspi",
.bus_num = 1,
.chip_select = 0,
.max_speed_hz = 3000000,
.mode = SPI_MODE_0,
.controller_data = &spidev_mcspi_config,
}, ...
I suspect there might be something with the way that the dev files get created, mainly because the example that I found references filp->f_pos
static int myspi_llseek(struct file *filp, loff_t off, int whence)
{
...
newpos = filp->f_pos + off;
...
}
So my questions are: Is there a way to have this driver (lightly modified spidev) support the "seek" call? At what point does this get defined to return errno 29? Will I have to start from a new driver and not be able to rely on the spi_board_info() and spi_register_board_info() setup?
Only one driver in the /drivers/spi directory (spi-dw) references lseek, and they use the default_llseek implementation. There are a couple of "hacks" that we've come up with to get everything up and running, but I tend to be a person who wants to learn to get it done the right way.
Any suggestions are greatly appreciated! (PS, the kernel version is 3.4.48 for an OMAP Android system)

Spi driver dose not support any llseek or fseek functionality. It has these many call back functions.
struct spi_driver {
const struct spi_device_id *id_table;
int (*probe)(struct spi_device *spi);
int (*remove)(struct spi_device *spi);
void (*shutdown)(struct spi_device *spi);
int (*suspend)(struct spi_device *spi, pm_message_t mesg);
int (*resume)(struct spi_device *spi);
struct device_driver driver;
};
Now drivers/spi/spi-dw.c is register as a charter-driver(debugfs_create_file("registers", S_IFREG | S_IRUGO,
dws->debugfs, (void *)dws, &dw_spi_regs_ops);). So they implement to create a file in the debugfs filesystem. they implement lseek callback function.
static const struct file_operations dw_spi_regs_ops = {
.owner = THIS_MODULE,
.open = simple_open,
.read = dw_spi_show_regs,
.llseek = default_llseek,
};
The file_operations structure is defined in linux/fs.h, and holds pointers to functions defined by the driver that perform various operations on the device. Each field of the structure corresponds to the address of some function defined by the driver to handle a requested operation
lseek -: lseek is a system call that is used to change the location of the read/write pointer of a file descriptor.
SPI -: The "Serial Peripheral Interface" (SPI) is a synchronous four wire serial link used to connect microcontrollers to sensors, memory, and peripherals. SPI can not provide any lseek and fseek functionlity.
There are two type of SPI driver (https://www.kernel.org/doc/Documentation/spi/spi-summary)
Controller drivers ... controllers may be built into System-On-Chip
processors, and often support both Master and Slave roles.
These drivers touch hardware registers and may use DMA.
Or they can be PIO bitbangers, needing just GPIO pins.
Protocol drivers ... these pass messages through the controller
driver to communicate with a Slave or Master device on the
other side of an SPI link.
If you want to user read, write and llseek then you will have to register a charter-driver on top of SPI. Then you will able to achieve your acquirement.

Unclear logic behind pl011_tx_chars() in amba-pl011 Linux kernel module

I'm trying to understand how Linux driver for AMBA serial port (amba-pl011.c) sends characters in non-DMA mode. For port operations, this driver registers only following callbacks:
static struct uart_ops amba_pl011_pops = {
.tx_empty = pl011_tx_empty,
.set_mctrl = pl011_set_mctrl,
.get_mctrl = pl011_get_mctrl,
.stop_tx = pl011_stop_tx,
.start_tx = pl011_start_tx,
.stop_rx = pl011_stop_rx,
.enable_ms = pl011_enable_ms,
.break_ctl = pl011_break_ctl,
.startup = pl011_startup,
.shutdown = pl011_shutdown,
.flush_buffer = pl011_dma_flush_buffer,
.set_termios = pl011_set_termios,
.type = pl011_type,
.release_port = pl011_release_port,
.request_port = pl011_request_port,
.config_port = pl011_config_port,
.verify_port = pl011_verify_port,
.poll_init = pl011_hwinit,
.poll_get_char = pl011_get_poll_char,
.poll_put_char = pl011_put_poll_char };
As you can see, there's no character sending operation among them, namely, pl011_tx_chars() function is not listed there. Since pl011_tx_chars() is declared static, it is not exposed outside the module. I found that within the module it is called only from pl011_int() function which is an interrupt handler. It is called whenever UART011_TXIS occurs:
if (status & UART011_TXIS) pl011_tx_chars(uap);
The function pl011_tx_chars() itself writes characters from circular buffer to UART01x_DR port until the fifo queue size is reached (function returns then so more data will be written at the next interrupt) or until circular buffer is empty (pl011_stop_tx() is called then). As we can see, pl011_start_tx() and pl011_stop_tx() are listed in AMBA port operations (so they can be called as callbacks despite their local static declaration). Seems reasonable, thing is, these two function do something very simple:
static void pl011_stop_tx(struct uart_port *port)
{
struct uart_amba_port *uap = (struct uart_amba_port *)port;
uap->im &= ~UART011_TXIM;
writew(uap->im, uap->port.membase + UART011_IMSC);
pl011_dma_tx_stop(uap);
}
static void pl011_start_tx(struct uart_port *port)
{
struct uart_amba_port *uap = (struct uart_amba_port *)port;
if (!pl011_dma_tx_start(uap)) {
uap->im |= UART011_TXIM;
writew(uap->im, uap->port.membase + UART011_IMSC);
}
}
Since I don't have CONFIG_DMA_ENGINE set, pl011_dma_tx_start() and pl011_dma_tx_stop() are just stubs:
static inline void pl011_dma_tx_stop(struct uart_amba_port *uap)
{
}
static inline bool pl011_dma_tx_start(struct uart_amba_port *uap)
{
return false;
}
Seems like the only thing that pl011_start_tx() does is to arm UART011_TXIM interrupt while the only thing that pl011_stop_tx() does is to disarm it. Nothing initiates the transmission!
I looked at serial_core.c - it's the only file where start_tx operation is invoked, in four places (by the registered callback). The most promissing place is uart_write() function. It fills circular buffer with data and calls local static uart_start() function which is very simple:
static void __uart_start(struct tty_struct *tty)
{
struct uart_state *state = tty->driver_data;
struct uart_port *port = state->uart_port;
if (!uart_circ_empty(&state->xmit) && state->xmit.buf &&
!tty->stopped && !tty->hw_stopped)
port->ops->start_tx(port);
}
static void uart_start(struct tty_struct *tty)
{
struct uart_state *state = tty->driver_data;
struct uart_port *port = state->uart_port;
unsigned long flags;
spin_lock_irqsave(&port->lock, flags);
__uart_start(tty);
spin_unlock_irqrestore(&port->lock, flags);
}
As you can see, no one sends initial characters to the UART port, circular buffer is filled and everything is waiting for UART011_TXIS interrupt.
Is it possible that arming UART011_TXIM interrupt instantly emits UART011_TXIS? I looked into DDI0183.pdf (PrimeCell® UART (PL011) Technical Referecne Manual), Chapter 3: Programmers Model, section 3.4: Interrupts, subsection 3.4.3 UARTTXINTR. What it says is:
....
The transmit interrupt changes state when one of the following events occurs:
• If the FIFOs are enabled and the transmit FIFO reaches the programmed trigger
level. When this happens, the transmit interrupt is asserted HIGH. The transmit
interrupt is cleared by writing data to the transmit FIFO until it becomes greater
than the trigger level, or by clearing the interrupt.
• If the FIFOs are disabled (have a depth of one location) and there is no data
present in the transmitters single location, the transmit interrupt is asserted HIGH.
It is cleared by performing a single write to the transmit FIFO, or by clearing the
interrupt.
....
The note below is even more interesting:
....
The transmit interrupt is based on a transition through a level, rather than on the level
itself. When the interrupt and the UART is enabled before any data is written to the
transmit FIFO the interrupt is not set. The interrupt is only set once written data leaves
the single location of the transmit FIFO and it becomes empty.
....
The emphasis above is mine. I don't know if my English is not sufficient, but from the words above I can't find where it states that unlocking transmit interrupt can be used for triggering transmit routine. What am I missing?

The ARM docs say that the PL011 is a "16550-ish" UART. This sort of gets them off the hook for fully specifying its behavior and instead sends you to the 16550 docs, which state in the "FIFO interrupt mode operation" section...
When the XMIT FIFO and transmitter interrupts are enabled (FCR0e1,
IER1e1), XMIT interrupts will occur as follows: A. The transmitter
holding register interrupt (02) occurs when the XMIT FIFO is empty; it
is cleared as soon as the transmitter holding register is written to
(1 to 16 characters may be written to the XMIT FIFO while servicing
this interrupt) or the IIR is read.
So, it appears that if the FIFO and TX holding register are empty and you enable TX interrupts, you should immediately see a TX interrupt that kickstarts the sending process and fills the holding register and then the FIFO. Once those drain back down below the FIFO trigger, then another interrupt will be generated to keep the process going for as long as there is more buffered data to be sent.

Looking for a lock-free RT-safe single-reader single-writer structure

I'm looking for a lock-free design conforming to these requisites:
a single writer writes into a structure and a single reader reads from this structure (this structure exists already and is safe for simultaneous read/write)
but at some time, the structure needs to be changed by the writer, which then initialises, switches and writes into a new structure (of the same type but with new content)
and at the next time the reader reads, it switches to this new structure (if the writer multiply switches to a new lock-free structure, the reader discards these structures, ignoring their data).
The structures must be reused, i.e. no heap memory allocation/free is allowed during write/read/switch operation, for RT purposes.
I have currently implemented a ringbuffer containing multiple instances of these structures; but this implementation suffers from the fact that when the writer has used all the structures present in the ringbuffer, there is no more place to change from structure... But the rest of the ringbuffer contains some data which don't have to be read by the reader but can't be re-used by the writer. As a consequence, the ringbuffer does not fit this purpose.
Any idea (name or pseudo-implementation) of a lock-free design? Thanks for having considered this problem.

Here's one. The keys are that there are three buffers and the reader reserves the buffer it is reading from. The writer writes to one of the other two buffers. The risk of collision is minimal. Plus, this expands. Just make your member arrays one element longer than the number of readers plus the number of writers.
class RingBuffer
{
RingBuffer():lastFullWrite(0)
{
//Initialize the elements of dataBeingRead to false
for(unsigned int i=0; i<DATA_COUNT; i++)
{
dataBeingRead[i] = false;
}
}
Data read()
{
// You may want to check to make sure write has been called once here
// to prevent read from grabbing junk data. Else, initialize the elements
// of dataArray to something valid.
unsigned int indexToRead = lastFullWriteIndex;
Data dataCopy;
dataBeingRead[indexToRead] = true;
dataCopy = dataArray[indexToRead];
dataBeingRead[indexToRead] = false;
return dataCopy;
}
void write( const Data& dataArg )
{
unsigned int writeIndex(0);
//Search for an unused piece of data.
// It's O(n), but plenty fast enough for small arrays.
while( true == dataBeingRead[writeIndex] && writeIndex < DATA_COUNT )
{
writeIndex++;
}
dataArray[writeIndex] = dataArg;
lastFullWrite = &dataArray[writeIndex];
}
private:
static const unsigned int DATA_COUNT;
unsigned int lastFullWrite;
Data dataArray[DATA_COUNT];
bool dataBeingRead[DATA_COUNT];
};
Note: The way it's written here, there are two copies to read your data. If you pass your data out of the read function through a reference argument, you can cut that down to one copy.

You're on the right track.
Lock free communication of fixed messages between threads/processes/processors
fixed size ring buffers can be used in lock free communications between threads, processes or processors if there is one producer and one consumer. Some checks to perform:
head variable is written only by producer (as an atomic action after writing)
tail variable is written only by consumer (as an atomic action after reading)
Pitfall: introduction of a size variable or buffer full/empty flag; these are typically written by both producer and consumer and hence will give you an issue.
I generally use ring buffers for this purpoee. Most important lesson I've learned is that a ring buffer of can never contain more than elements. This way a head and tail variable are written by producer respectively consumer.
Extension for large/variable size blocks
To use buffers in a real time environment, you can either use memory pools (often available in optimized form in real time operating systems) or decouple allocation from usage. The latter fits to the question, I believe.
If you need to exchange large blocks, I suggest to use a pool with buffer blocks and communicate pointers to buffers using a queue. So use a 3rd queue with buffer pointers. This way the allocates can be done in application (background) and you real time portion has access to a variable amount of memory.
Application
while (blockQueue.full != true)
{
buf = allocate block of memory from heap or buffer pool
msg = { .... , buf };
blockQueue.Put(msg)
}
Producer:
pBuf = blockQueue.Get()
pQueue.Put()
Consumer
if (pQueue.Empty == false)
{
msg=pQueue.Get()
// use info in msg, with buf pointer
// optionally indicate that buf is no longer used
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string