What is DMA mapping and DMA engine in context of linux kernel?
When DMA mapping API and DMA engine API can be used in Linux Device Driver?
Any real Linux Device Driver example as a reference would be great.
What is DMA mapping and DMA engine in context of linux kernel?
The kernel normally uses virtual address. Functions like kmalloc(), vmalloc() normally return virtual address. It can be stored in void*. Virtual memory system converts these addresses to physical addresses. These physical addresses are not actually useful to drivers. Drivers must use ioremap() to map the space and produce a virtual address.
CPU CPU Bus
Virtual Physical Address
Address Address Space
Space Space
+-------+ +------+ +------+
| | |MMIO | Offset | |
| | Virtual |Space | applied | |
C +-------+ --------> B +------+ ----------> +------+ A
| | mapping | | by host | |
+-----+ | | | | bridge | | +--------+
| | | | +------+ | | | |
| CPU | | | | RAM | | | | Device |
| | | | | | | | | |
+-----+ +-------+ +------+ +------+ +--------+
| | Virtual |Buffer| Mapping | |
X +-------+ --------> Y +------+ <---------- +------+ Z
| | mapping | RAM | by IOMMU
| | | |
| | | |
+-------+ +------+
If device supports DMA , the driver sets up buffer using kmalloc or similar interface which returns virtual address (X). The virtual
memory system maps X to a physical address (Y) in system RAM. The driver
can use virtual address X to access the buffer, but the device itself
cannot because DMA doesn't go through the CPU virtual memory system. In some system only Device can directly do DMA to physical address.
In some system IOMMU hardware is used to translate DMA address to physical address.Look at the figure above It translate Z to Y.
When DMA mapping API can be used in Linux Device
Driver?
Reason to use DMA mapping API is driver can return virtual address X to interface like dma_map_single(), which sets up any required IOMMU
mapping and returns the DMA address Z.The driver then tells the device to
do DMA to Z, and the IOMMU maps it to the buffer at address Y in system
RAM.
Reference is taken from this link.
Any real Linux Device Driver example as a reference would be great.
A simple PCI DMA example
Inside linux kernel you can look to drivers/dma for various real drivers.
dmaengine is a generic kernel framework for developing a DMA controller drivers.
You can read:
dmaengine provider
.
You can find numerous examples of dmaengine drivers under drivers/dma.
Related
I have an application running baremetal which controls a peripheral via CAN. In its original form, my application hands messages to the CAN "driver", which is actually a buffering layer. Incoming messages are pulled out of the hardware buffer and either:
pushed onto the queue by an interrupt handler, using a further CAN HAL.
sent to an emergency secondary handler, which runs in the context of the ISR.
Now, I am required to replace the buffering layer and HAL with a SocketCAN-based driver for the peripheral.
+---------------------+ | +---------------------+
| | | | |
| Peripheral driver | Emergency | | Peripheral driver |
| | ^ | | |
+----------------^----+ | | +----------------^----+
|Queue | | | |Queue |
| |Dequeue | | | |Dequeue
+-----v---------------+ +---------------------+ | +-----v---------------+
| | | | | | |
| CAN driver (queues) <-------- ISR | | | SocketCAN |
| | | | | | |
+---------------------+ +----------^----------+ | +---------------------+
| Tx | |
| | |
+-----v---------------+ | |
| | | |
| CAN HAL | Rx | |
| |------------------- |
+---------------------+ |
In the original setup, handling CAN reception in the ISR means that the emergency messages are dealt with as soon as possible. My understanding of SocketCAN is that it (or the World of Sockets, which I am not familiar with) handles the queuing of incoming frames, which means that emergency messages will have to wait until the peripheral driver has pulled out of the queue everything that arrived before the emergency.
Surely there is a way to handle certain messages first. How would I do that?
I'm not super familiar with socketcan but what you are asking is usually done through hardware filtering. However, socket-can doesn't support hardware CAN filters and, after reading some docs, I don't think you can get a filter identifier for a specific CAN message.
But you could theoretically open a second socket-can on the same device with a separate set of filters and treat these differently.
The answer I think I was looking for is that you must set up a signal and signal handler, or a handler for SIGIO, for the socket.
The topology would then again look like the original (left-hand side of the diagram).
I am running a test (UVM) with lot of components. It is a Top-Level test, however I am debugging an internal module and I am only interested in the signals of the interfaces connected to that module. Since it is a TL it takes long time since I get to the point in time I am interested in. Those signals are product of other modules but I am not interested in those right now.
At the moment I am using Questa sim, so I was wondering if there is a way of storing the events from those signals so that I can rerun again only those. Hence my intention is to change the module, recompile and directly use the inputs on the new version without having to actually run the whole test and wait that long.
Inside a big chip company I used to work at, they call it "Save and Restore". Not sure what your EDA vendor calls it. You should be able to take a "Vector Change Dump" or "VCD" file of the signal snapshot at the end of the bootup simulation and convert that to a bunch of 0-time puts on the wires. You may have to force the wires for a few clocks and then release the force's.
In regards to your comment about interacting with UVM testing infrastructure, I'm not exactly sure on the behavior of multiple puts or forces on one node. I would guess that the last one wins. However, forces are very very very node specific. The reset force will win and be latched into design if your if it is down stream. If your design looks like this, then the force <path> 0 from the reset code will win, because it is downstream:
+--------------------------------------------+
| TopDesign.sv +------------------------+ |
| | SubBlock.sv | |
| | | |
1 | 1 | 0 +--------------+ 0 | |
----->---------------->----> register Foo >-- | |
^ | | ^ | | | |
UVM Driver | | | +--------------+ | |
| +--|---------------------+ |
+-------------------|------------------------+
|0
Reset force
If your UVM infrastructure forces on an interface and then your reset initialization force is on a downstream node, which will synthesize to the same wire, then the downstream node force will win, because this will actually be flopped into the logic.
You still have to take care of initializing the UVM checkers or scoreboards into a post-reset state.
It is said that Java allows to run multiple threads in parallel. It also says that object creation is cheap so that I must always prefer creating new object to reusing them. But, to my knowledge, the objects are created in the global scope (to become a subject to GC). Now, comes the wonder, is parallelism stopped when any of the threads creates an object?
AFAIK, unmanaged languages create the objects on the thread stack so that threads keep running independently. They are all removed once you exit from the subprogram scope. That is you need not add the objects into global list and stop the machine to GC them later. You could do the same with Int/String-like immutable objects in Java, because thy cannot refer other objects creating circular dependencies, that need GC to cleanup. But, afaik, nothing is cleaned up at procedure exit in Java.
Allocation of small objects is quite cheap most of the time because of TLAB (Thread Local Allocation Buffer). Every single thread has a special area in Eden reserved for thread-local allocations called TLAB.
So, you need synchronization only for allocation a new TLAB when previous is filled. That synchronization is a CAS operation, so it is quite fast.
Eden Survivor 1/2
-------------------------------------------------
| T | | T | || | |
| L | | L | || | |
| A | | A | || | |
| B | | B | || | |
| | | | || | |
| 1 | | 2 | || | |
-------------------------------------------------
^ ^
| |
reserved for|thread-1 allocations
|
|
reserved for thread-2 allocations
Moreover, some optimizations can help you to avoid allocations in compiled code - Escape Analysis and Scalar Replacement. In some scenarios compiler can eliminate allocations by placing all the fields of an object on the stack.
Doubt:
If we execute a program, the following is the type of memory allocated to that program.
__________________
| |
| stack |
| |
------------------
| |
| <Un Allocated|
| space> |
------------------
| |
| |
| Heap |
| |
| |
__________________
| |
| data |
__________________
| text |
__________________
here the data segment places a vital role. All the initialized data and the uninitialized datas are present in data segment. But, I did not know about the order of storing the data in the data segment. For Ex, the initialized data, uninitialized data, read only and the read write data. I think the above are the four types are present in data segment.
so, in which order the data's will be placed in data segment. Like first intialized data which have the address less than all. And the next is uninitialized data's which have the higher address than the initialized data's like that.
Thanks in Advance.
The order of global variables in the data segment cannot be determined in advance - it is up to your linker and compiler. Normally linkers preserve the order in which variables appear in the linked object files, but this is not a hard requirement (for example, the linker could put double variables first and char last to conserve the required alignment bytes).
Uninitialized global data are generally present in the .bss segment, which is placed after the .data segment (in your picture, "above" it, since higher portions of your picture = larger addresses). The .bss segment is all zeros and only its size is stored in the executable. This way, we don't need to store long strings of zeros in the binary image.
I am relatively new to perl and even newer to threading in perl. I have a perl script that takes input from 3 different sources. (2 LDAP queries and a file that isn't always there) Because some parts can take longer than others so I decided to use threads and queues. During development, testing individual components of the script worked very well, but after putting it all together the performance seems to degrade.
Basic structure is this
2 threads:(Read file or Read AD entries) -> Queue1 -> 2 threads:(scrub data) -> Queue2 -> 3-4 threads(compare against existing local LDAP entries). Several threads report statistics back to the main script and once all threads are done an email is sent with all the stats and status of that run.
I am using dequeue_nb and I thought that would help but no luck.
The performance hit seems to be in the queues. While looking for tips to improve performance I've run into several articles saying perl threads are no good and to use coro, POE, Anyevent, IO:async, etc.
This doesn't seem like a "event" problem so I didn't think AnyEvent or POE would be the way to go by from what I'm seeing, coros only seems to use one CPU at a time so I'm not sure this would work either. I thought about using a combination of them but then my head started to hurt. Does anyone have any suggestions on how to either fix/troubleshoot my script or any suggestions how to implement one of the other modules?
A problem with parallelism is synchronization. It is a performance killer, it is bad, it is to be avoided if possible.
OPs architecture
Lets look at your architecture:
+--------------+--------------+
| Input 1 | Input 2 |
+--------------+--------------+
| QUEUE A |
+--------------+--------------+
| Scrub 1 | Scrub 2 |
+--------------+--------------+
| QUEUE B |
+---------+---------+---------+
| Compare | Compare | Compare |
+---------+---------+---------+
Discussion
Queue A has to be synchronized across four threads; Queue B across 5-6. Only one thread can access the Queue at any time, so most of the time your threads will be waiting, not working!
Parallel Pipeline Architecture
A somewhat different architecture could look like this:
+-----------+ +-----------+
| Input 1 | | Input 2 |
+-----------+ +-----------+
| QUEUE 1A | | QUEUE 2A |
+-----------+ +-----------+
| Scrub 1 | | Scrub 2 |
+-----------+ +-----------+
| QUEUE 1B | | QUEUE 2B |
+-----+-----+ +-----+-----+
| Cmp | Cmp | | Cmp | Cmp |
+-----+-----+ +-----+-----+
Discussion
Here, the A Queues are only affiliated with two threads (->less waiting), the B Queues only with three. This architecture should perform faster for similar input size/complexity. If Input 2 were considerably shorter, the whole Pipeline 2 would have run before Pipeline 1 is even half finished. It is, however, far better than Using a single process for each Pipeline.
The Lawn Sprinkler Architecture
Concept
An even better architecture would try to distribute the output of a process across multiple Queues. (The reverse, getting threads fetch their input from multiple queues is bad when a queue is empty.)
Each Queue write should go to a different queue:
+-----------+ +-----------+
| Input 1 | | Input 2 |
+-----------+ +-----------+
| \ / |
+-----------+ +-----------+
| QUEUE 1A | | QUEUE 2A |
+-----------+ +-----------+
| Scrub 1 | | Scrub 2 |
+-----------+ +-----------+
/ | \ \ / / | \
+-------+-------+-------+-------+
| Q. 1B | Q. 2B | Q. 3B | Q. 4B |
+-------+-------+-------+-------+
| Cmp | Cmp | Cmp | Cmp |
+-------+-------+-------+-------+
This makes sure each thread has the same workload, but it cannot make sure that all threads finish at the same time.
Discussion
All Queues are shared among 3 Threads. The problem is that two Threads will block each other when writing to a queue. If the time between Queue write accesses is significantly larger than the write duration, this should be no problem, else the second architecture can be mixed in.
So if this architecture makes sense depends on your exact requirements.
It is slower for evenly sized inputs, but performs better on irregular input.
Appendices
On implementing:
What framework is used is secondary to the architecture. If you only pass around text strings, I strongly advise using pipes. If you have to pass Perl data types or objects, you probably have to embrace the additional overhead of using a real Queue: When adding an unshared variable to a queue, a deep copy has to be made (see #Leon Timmermans answer) in addition to all the synchronization overhead.
On scalability:
Architecture 1 and 3 are not fixed in the number of threads. I strongly suggest using this flexibility to benchmark different compositions. A rule of thumb is that you should use n to 2n threads where n is the number of processors (or hardware threads). This can be seen as a maximal sensible number for the threads of one stage. Above that, you only get a memory penalty and no speedup. A performance saturation point may be reached earlier, when a stage can process the input faster than it is supplied.
What kind of data are you putting in the queues? AFAIK simple data is cheaper than complex structures, since it needs to be clones and copied at least twice. I've been planning to write a faster queue implementation (most of the work is already done actually), but haven't published that yet.