How to combine multiple struct BIOs into a single struct request? - linux

I'm working on Linux kernel version 2.6.39.1, and am developing a block device driver. In this regard, I want to combine multiple struct bios into a single struct request, which is then added to the request_queue for processing by the device driver, namely -- scsi_request_fn().
I tried using the ->bi_next field of struct bio to link multiple struct bios that I have composed, thereby creating a linked list of struct bios. When I call submit_bio() to submit a bio to the block device layer for I/O, this BUG_ON() is triggered because the code expects bio->bi_next to be NULL.
Is there a way to link several struct bios into a single struct request before sending it to lower layers for servicing?

I'm not sure how to string multiple struct bio together, but you might want to take a look at the "task collector" implementation in libsas and the aic94xx driver for an alternate approach. There isn't much documentation, but the libsas documentation describes it as
Some hardware (e.g. aic94xx) has the capability to DMA more
than one task at a time (interrupt) from host memory. Task
Collector Mode is an optional feature for HAs which support
this in their hardware. (Again, it is completely optional
even if your hardware supports it.)
In Task Collector Mode, the SAS Layer would do natural
coalescing of tasks and at the appropriate moment it would
call your driver to DMA more than one task in a single HA
interrupt. DMBS may want to use this by insmod/modprobe
setting the lldd_max_execute_num to something greater than 1.
Effectively, this lets the block layer (a.k.a. BIO) remain unchanged, but multiple requests are accumulated at the driver layer and submitted together.

Thanks for the reply, #ctuffli. I've decided to use a structure similar to the one described here. Basically, I allocate a struct packet_data which would contain pointers to all struct bios that should be merged to form one single struct bio (and later on, one single struct request). In addition, I store some driver related information as well in this struct packet_data. Next, I allocate a new struct bio (lets call it "merged_bio"), copy all the pages from the list of original BIOs and then make the merged_bio->bi_private point to the struct packet_data. This last hack would allow me to keep track of the list of original BIOs, and also call bio_endio() to end I/O on all individual BIOs once the merged_bio has been successfully transferred.
Not sure if this is the smartest way to do this, but it does what I intended! :^)

Related

Linux PCIe DMA Driver (Xilinx XDMA)

I am currently working with the Xilinx XDMA driver (see here for source code: XDMA Source), and am attempting to get it to run (before you ask: I have contacted my technical support point of contact and the Xilinx forum is riddled with people having the same issue). However, I may have found a snag in Xilinx's code that might be a deal breaker for me. I am hoping there is something that I'm not considering.
First off, there are two primary modes of the driver, AXI-Memory Mapped (AXI-MM) and AXI-Streaming (AXI-ST). For my particular application, I require AXI-ST, since data will continuously be flowing from the device.
The driver is written to take advantage of scatter-gather lists. In AXI-MM mode, this works because reads are rather random events (i.e., there isn't a flow of data out of the device, instead the userspace application simply requests data when it needs to). As such, the DMA transfer is built up, the data is transfered, and the transfer is then torn down. This is a combination of get_user_pages(), pci_map_sg(), and pci_unmap_sg().
For AXI-ST, things get weird, and the source code is far from orthodox. The driver allocates a circular buffer where the data is meant to continuously flow into. This buffer is generally sized to be somewhat large (mine is set on the order of 32MB), since you want to be able to handle transient events where the userspace application forgot about the driver and can then later work off the incoming data.
Here's where things get wonky... the circular buffer is allocated using vmalloc32() and the pages from that allocation are mapped in the same way as the userspace buffer is in AXI-MM mode (i.e., using the pci_map_sg() interface). As a result, because the circular buffer is shared between the device and CPU, every read() call requires me to call pci_dma_sync_sg_for_cpu() and pci_dma_sync_sg_for_device(), which absolutely destroys my performance (I can not keep up with the device!), since this works on the entire buffer. Funny enough, Xilinx never included these sync calls in their code, so I first knew I had a problem when I edited their test script to attempt more than one DMA transfer before exiting and the resulting data buffer was corrupted.
As a result, I'm wondering how I can fix this. I've considered rewriting the code to build up my own buffer allocated using pci_alloc_consistent()/dma_alloc_coherent(), but this is easier said than done. Namely, the code is architected to assume using scatter-gather lists everywhere (there appears to be a strange, proprietary mapping between the scatter-gather list and the memory descriptors that the FPGA understands).
Are there any other API calls I should be made aware of? Can I use the "single" variants (i.e., pci dma_sync_single_for_cpu()) via some translation mechanism to not sync the entire buffer? Alternatively, is there perhaps some function that can make the circular buffer allocated with vmalloc() coherent?
Alright, I figured it out.
Basically, my assumptions and/or understanding of the kernel documentation regarding the sync API were totally incorrect. Namely, I was wrong on two key assumptions:
If the buffer is never written to by the CPU, you don't need to sync for the device. Removing this call doubled my read() throughput.
You don't need to sync the entire scatterlist. Instead, now in my read() call, I figure out what pages are going to be affected by the copy_to_user() call (i.e., what is going to be copied out of the circular buffer) and only sync those pages that I care about. Basically, I can call something like pci_dma_sync_sg_for_cpu(lro->pci_dev, &transfer->sgm->sgl[sgl_index], pages_to_sync, DMA_FROM_DEVICE) where sgl_index is where I figured the copy will start and pages_to_sync is how large the data is in number of pages.
With the above two changes my code now meets my throughput requirements.
I think XDMA was originally written for x86, in which case the sync functions do nothing.
It does not seem likely that you can use the single sync variants unless you modify the circular buffer. Replacing the circular buffer with a list of buffers to send seems like a good idea to me. You pre-allocate a number of such buffers and have a list of buffers to send and a free list for your app to reuse.
If you're using a Zynq FPGA, you could connect the DMA engine to the ACP port so that FPGA memory access will be coherent. Alternatively, you can map the memory regions as uncached/buffered instead of cached.
Finally, in my FPGA applications, I map the control registers and buffers into the application process and only implement mmap() and poll() in the driver, to give apps more flexibility in how they do DMA. I generally implement my own DMA engines.
Pete, I am the original developer of the driver code (before the X of XMDA came into place).
The ringbuffer was always an unorthodox thing and indeed meant for cache-coherent systems and disabled by default. It's initial purpose was to get rid of the DMA (re)start latency; even with full asynchronous I/O support (even with zero-latency descriptor chaining in some cases) we had use cases where this could not be guaranteed, and where a true hardware ringbuffer/cyclic/loop mode was required.
There is no equivalent to a ringbuffer API in Linux, so it's open-coded a bit.
I am happy to re-think the IP/driver design.
Can you share your fix?

Vulkan: Concurrent host-writes and device reads to separate parts of same VkMemory

To transfer my static data into the GPU, I'm thinking of having a single staging VkMemory object (ballpark 64MB), and using it as a rotating queue. However, I have multiple threads producing content (eg: rendering glyphs, loading files, procedural) and I'd like it if they could upload their data entirely by themselves (i.e. write plus submit Vulkan transfer commands).
I'm intending to keep the entire staging VkMemory permanently mapped (if this is dumb please say so) at least during loading (but perhaps longer if I want to stream data).
To achieve the above, once a thread's data is fully written/flushed to staging I'd like it to be able to immediately submit GPU transfer commands.
However, that means the GPU will be reading from one part of the VkMemory while other threads may be writing/flushing to it.
AFAIK I will also need to use image memory barriers for the transition from VK_IMAGE_LAYOUT_PREINITIALIZED to VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL.
I couldn't find anything on the spec explicitly saying this was legal or illegal, only that care should be taken to ensure synchronization. However, I didn't find enough detail for me to be sure one way or the other.
NOTE: The staging queue will need to ensure transfers have been completed before overwriting anything - I intend to keep a complimentary queue of VkFences for this.
Questions:
Is this OK?
Do I need to align each separate object to a page boundary? Or something else.
Am I correct in assuming that the image memory barrier (above) won't require the device to write to staging memory.
yes the spec talks about the region being read from and written to must be synced.
if the memory is not coherent then you must align the blocks being read from or written to to NonCoherentAtomSize
source: Vulkan spec under the note after the declaration of vkMapMemory
vkMapMemory does not check whether the device memory is currently in
use before returning the host-accessible pointer. The application must
guarantee that any previously submitted command that writes to this
range has completed before the host reads from or writes to that
range, and that any previously submitted command that reads from that
range has completed before the host writes to that region (see here
for details on fulfilling such a guarantee). If the device memory was
allocated without the VK_MEMORY_PROPERTY_HOST_COHERENT_BIT set, these
guarantees must be made for an extended range: the application must
round down the start of the range to the nearest multiple of
VkPhysicalDeviceLimits::nonCoherentAtomSize, and round the end of the
range up to the nearest multiple of
VkPhysicalDeviceLimits::nonCoherentAtomSize.
a layout transition may write to the memory however barriers will do their own syncing with regards to previous and subsequent memory accesses.

How to transfer data via DMA from RAM to RAM?

I want to write a kernel module that can transfer data via DMA from RAM to RAM. There are some posts that discuss this, but I don't really get it. Some say it is possible others say it isn’t.
If I understood ldd3 right, RAM to RAM copying isn‘t possible with the DMA API of linux, but the driver/dma/dmaengine.c provides a flag DMA_MEMCPY for a "DMA Transfer Type", so there should be a way.
Is this correct, can I use a dma engine to transfer data from one ram address to another?
If it is hardware dependent, how can I determine if my system supports dma memcpy?
As you correctly pointed out, DMA_MEMCPY should be used to perform RAM-to-RAM copy. It is described in Documentation/dmaengine/provider.txt. Here is just a related excerpt, please see the whole file for more details:
Supported transaction types
The next thing you need is to set which transaction types your device
(and driver) supports.
Our dma_device structure has a field called cap_mask that holds the
various types of transaction supported, and you need to modify this
mask using the dma_cap_set function, with various flags depending on
transaction types you support as an argument.
All those capabilities are defined in the dma_transaction_type enum,
in include/linux/dmaengine.h
Currently, the types available are:
DMA_MEMCPY
The device is able to do memory to memory copies
Just to summarize:
It depends on your DMA controller. Some are able to do RAM-to-RAM transaction, some aren't.
For example, for OMAP-based SoCs, DMA controller does this (drivers/dma/omap-dma.c file, in omap_dma_probe() function):
dma_cap_set(DMA_MEMCPY, od->ddev.cap_mask);
This way you can later check it (if your DMA controller is capable of RAM-to-RAM transactions) in your driver. See how it's done in drivers/dma/dmatest.c, in dmatest_add_channel() function:
if (dma_has_cap(DMA_MEMCPY, dma_dev->cap_mask)) {
If you need an example on how to use DMA API to perform RAM-to-RAM transaction, please see drivers/dma/dmatest.c.

SPI: Linux driver model

I'm new with SPI; the Linux kernel provides an API for declaring SPI busses and devices, and managing them according to the standard Linux driver model.
You can find the description of the struct spi_master here: https://www.kernel.org/doc/htmldocs/device-drivers/API-struct-spi-master.html
The description at the link above says that "Each device may be configured to use a different clock rate, since those shared signals are ignored unless the chip is selected". To put the sentence in a contest, I have to say that with "device" they mean SPI slave device, and with "those shared signals" they mean MOSI, MISO and SCK signals.
In fact, in the struct spi_device (https://www.kernel.org/doc/htmldocs/device-drivers/API-struct-spi-device.html) there is an attribute named max_speed_hz that is not present in the struct spi_master. So I can understand on the first part of the statement above: "Each device may be configured to use a different clock rate".
But, what does mean the second part? Does "since those shared signals are ignored unless the chip is selected" mean that I'm allowed to used different clock rates but only one at time by enabling/disabling the slaves with different rates?
Thank you for your help! Regards,
--
Matteo
#Matteo M.: I think you actually are not allowed to simultaeously setting SS1, SS2 and SS3 to zero and in this way enabling all three SPI slaves at the same moment in time. The reason is, that the SPI slaves, while receiving data on the MOSI line, simultaneosly send back data on the MISO line. If actually all three slaves would put data on the (shared) MOSI line, really bad things could happen, both regarding the data and the electrical flow.
SPI is a very loose "standard", there are not many rules to follow, which is good (and bad I guess). It is good because it is flexible. It is bad because it can be implemented differently depending on the specific hardware you are dealing with. Some devices support only half-duplex communication, which as you know requires coordination of when the bus can be driven. Select lines (chip enable, slave select, whatever you want to call them) provide a handy way to do this without using bits to identify which slave should get a message off the bus.
In the full-duplex mode, where data is dropped onto the bus from the master and from the slave on each clock pulse, the select line could be very much required to prevent bad things, as Wolfgang stated. I want to stress could be required; it is completely reasonable to have perhaps a master processor communicating with other processors that only drive the bus when responding to some specific bit pattern (like for instance, an "address")... More software/firmware? Yeah, but it is not stopping you.
So, if your 8-bit slaves are say, for instance 8-bit DACs you could indeed write the values chunks of the master data register. Independant select lines will enable you to do this without all those slaves driving the bus at once. Yeah, you have to shift the values from each slave into the master register one at a time, but that too is a completely reasonable design.
Unlike some more complex serial protocols SPI is actually really flexible; because it doesn't lock you into max word size or require any of the data you write to the bus to be comprised of things like addresses and offsets and stuff.

Application counters in Linux? (and OSX?)

I'm trying to figure out if there is a library that gives me something near the equivalent of Windows custom performance counters (described here http://geekswithblogs.net/.NETonMyMind/archive/2006/08/20/88549.aspx)
Basically, I'm looking for something that can be used to both track global counters within an application, and (ideally) something that presents that information via a well-defined interface to other applications/users. These are application statistics; stuff like memory and disk can be captured in other ways, but I'm looking to expose throughput/transactions/"widgets" handled during the lifetime of my application.
I've seen this question:
Concept of "Performance Counters" in Linux/Unix
and this one
Registry level counters in Linux accessible from Java
but neither is quite what I'm looking for. I don't want to write a static file (this is dynamic information after all; I should be able to get at it even if the disk is full etc.), and would rather avoid a homegrown set of code if at all possible. Ideally, at least on Linux, this data would (I think) be surfaced through /proc in some manner, though it's not clear to me if that can be done from userland (this is less important, as long as it is surfaced in some way to clients.)
But back to the crux of the question: is there any built-in or suitable 3rd-party library that gives me custom global (thread-safe, performant) counters suitable for application metrics that I can use on Linux and other *NIXy operating systems? (And can be interfaced from C/C++?)
In addition to #user964970 comment/solution, I suggest making it OS agnostic.
Use an OS agnostic API, like ACE or BOOST, to create your own library, supplying a named-semaphore write-protected-counter, placed inside a named-shared-memory segment.
This should be your library's API :
long * createCounter(const char * name); // Create a counter
// Will create a named semaphore and a named
// shared memory segment, holding the counter
// value. Will return pointer to counter
long * getCounter(const char * name); // Get existing counter pointer
// in the calling process' address space
long incCounter(const char * name); // increment existing counter

Resources