which is better to use in v4l2 framework, user pointer or mmap - linux

After going through these links,
https://linuxtv.org/downloads/v4l-dvb-apis/uapi/v4l/userp.html
https://linuxtv.org/downloads/v4l-dvb-apis/uapi/v4l/mmap.html
I understood that there are two ways to create a buffer in v4l2 framework
Userpointer buffer: buffer will be created in user space.
Memory buffer: Buffer will be created in kernel space.
I have bit confused, which one to use while doing v4l2 driver development. I mean, which is better approach in terms of performance and handling buffer?
I will be using DMS-SG for data transfer in my hardware.

It depends.. on your requirements.
Case: Visualization of the video stream.
In this case, you might want to write the video data directly to memory that is accessible to the video driver, saving a copy operation. You will also get the shortest camera-to-display time. In this case, a user pointer would be the go to.
Case: Recording of the video stream.
In this case, you do not care about the timely delivery, but you do care about not missing frames. In this case, you can use memory mapped acquisition with multiple buffers.
Case: Single image acquisition for processing.
In this case, both timely delivery and missing frames are both less important, so you could use either method, but buffered operation will give the fastest acquisition time, since there is always a buffer with recent image data available.

Related

Is it practical to use the "rude big hammer" approach to parallelize a MacOS/CoreAudio real-time audio callback?

First, some relevant background info: I've got a CoreAudio-based low-latency audio processing application that does various mixing and special effects on audio that is coming from an input device on a purpose-dedicated Mac (running the latest version of MacOS) and delivers the results back to one of the Mac's local audio devices.
In order to obtain the best/most reliable low-latency performance, this app is designed to hook in to CoreAudio's low-level audio-rendering callback (via AudioDeviceCreateIOProcID(), AudioDeviceStart(), etc) and every time the callback-function is called (from the CoreAudio's realtime context), it reads the incoming audio frames (e.g. 128 frames, 64 samples per frame), does the necessary math, and writes out the outgoing samples.
This all works quite well, but from everything I've read, Apple's CoreAudio implementation has an unwritten de-facto requirement that all real-time audio operations happen in a single thread. There are good reasons for this which I acknowledge (mainly that outside of SIMD/SSE/AVX instructions, which I already use, almost all of the mechanisms you might employ to co-ordinate parallelized behavior are not real-time-safe and therefore trying to use them would result in intermittently glitchy audio).
However, my co-workers and I are greedy, and nevertheless we'd like to do many more math-operations per sample-buffer than even the fastest single core could reliably execute in the brief time-window that is necessary to avoid audio-underruns and glitching.
My co-worker (who is fairly experienced at real-time audio processing on embedded/purpose-built Linux hardware) tells me that under Linux it is possible for a program to requisition exclusive access for one or more CPU cores, such that the OS will never try to use them for anything else. Once he has done this, he can run "bare metal" style code on that CPU that simply busy-waits/polls on an atomic variable until the "real" audio thread updates it to let the dedicated core know it's time to do its thing; at that point the dedicated core will run its math routines on the input samples and generate its output in a (hopefully) finite amount of time, at which point the "real" audio thread can gather the results (more busy-waiting/polling here) and incorporate them back into the outgoing audio buffer.
My question is, is this approach worth attempting under MacOS/X? (i.e. can a MacOS/X program, even one with root access, convince MacOS to give it exclusive access to some cores, and if so, will big ugly busy-waiting/polling loops on those cores (including the polling-loops necessary to synchronize the CoreAudio callback-thread relative to their input/output requirements) yield results that are reliably real-time enough that you might someday want to use them in front of a paying audience?)
It seems like something that might be possible in principle, but before I spend too much time banging my head against whatever walls might exist there, I'd like some input about whether this is an avenue worth pursuing on this platform.
can a MacOS/X program, even one with root access, convince MacOS to give it exclusive access to some cores
I don't know about that, but you can use as many cores / real-time threads as you want for your calculations, using whatever synchronisation methods you need to make it work, then pass the audio to your IOProc using a lock free ring buffer, like TPCircularBuffer.
But your question reminded me of a new macOS 11/iOS 14 API I've been meaning to try, the Audio Workgroups API (2020 WWDC Video).
My understanding is that this API lets you "bless" your non-IOProc real-time threads with audio real-time thread properties or at least cooperate better with the audio thread.
The documents distinguish between the threads working in parallel (this sounds like your case) and working asynchronously (this sounds like my proposal), I don't know which case is better for you.
I still don't know what happens in practice when you use Audio Workgroups, whether they opt you in to good stuff or opt you out of bad stuff, but if they're not the hammer you're seeking, they may have some useful hammer-like properties.

Circular buffer filling up faster than AVAudioSourceNode render block can read data from it

I am experimenting with AVAudioSourceNode, having connected it to the mixer node for output to the speaker. I am a bit of a newbie to iOS and audio programming so I apologize if this question is ignorant or unclear, but I will do my best to explain.
In the AVAudioSourceNode render block, I am attempting to retrieve received stream data that has been stored in a circular buffer (e.g. I currently use a basic implementation of a FIFO buffer but am considering moving to a TPCircularBuffer). I check to see if the buffer has enough bytes for me to fill the audiobuffer with, and if so I grab those bytes for output; if not, I either wait, or take what I can and fill the missing bytes with zeros.
In debugging, it appears I am running into a situation where the circular buffer is filling up a lot faster than the render block makes the call to access to the buffer to retrieve data from it. And understandably, after running OK for a few instants, once the circular buffer is full (I'm not even certain how large I should realistically make it but I guess that's another question), the output becomes garbage.
It is as if the acts of filling the circular buffer with streaming data (and probably other tasks as well) are taking priority over the calls made within the render block. I thought that audio operations involving the audio nodes would automatically be prioritized but it may be that I haven't done what is needed to make this happen.
I have read these threads:
iOS - Streaming and receiving audio from a device to another ends in one only sending and the other only receiving
Synchronising with Core Audio Thread
which appear to raise similar issues in substance, but a little more current guidance and explanation for my level of understanding and situation would be helpful and very much appreciated!
For playing, the audio system will only ask for data at the specified sample rate. If you fill a circular buffer at faster than that sample rate for an extended period of time, then it will overflow.
So you have to make sure your sample generator or incoming data stream complies with the sample rate for which the audio system is configured, no more and no less (other than strictly bounded bursting or latency jitter). The circular buffer needs to sized large enough to cover the maximum burst sizes plus maximum latency jitter plus any pre-fill plus a safety margin.
Another possible bug is trying to do to much inside the render block callback. Thus Apple recommends not using any code that requires memory management or locks or semaphores inside real-time audio callbacks.

find a video file in memory dump of a process

I have a player that plays encrypted video files and works like this:
I open an encrypted video file with it
it decrypts the video file and writes it to its memory
and plays the file from the memory after that
and I want to copy the decrypted video file from memory and play it with a usual video player like VLC so I tried to create its memory dump with task manager and hoped to find out the video file there. Sadly I don't know enough to find a video file in a large chunk of bits from memory. I tried to find mp4 patterns in a hex editor and done every solution that I find online but nothing worked for me so I hoped someone here maybe has an idea and willing to help me how to make it done.
I upload its memory dump here (after opening a short encrypted video with it)
Most probably, the software doesn't decode whole video file in one go, but instead in streaming fashion. This makes it impossible to catch a moment when the decoded video data is available in the memory dump.
If the player software is open source, compile it with debug symbols and run it under debugger. Otherwise, resort to reverse engineering.
I don't think the question is on-topic for StackOverflow in general, including but not limited to specifically reversing a software solution intended for digital rights management. However I would still leave an answer.
First of all, as comments suggest the topic in question is reversal of specific solution provided by a commercial provider. Ability to recover a media file from memory dump highly depends on implementation of this solution and methods the provider used to complicate the reversal. It is only the simplest and straightforward solution is easy to reverse and the more developer put in to cover traces, the harder - exponentially - is to reverse.
Even though there is a little chance to find the original file in full in memory (through memory dump analysis) it is unlikely to be possible for any media playback application, even such that does not do any decryption. Media playback is typically streaming: the data is loaded from disk, storage, network etc. as necessary for playback and not as a full download. Decryption needs to be applied to certain pieces of data needed momentarily, and then a decent DRM-enabled application would immediately erase the ephemeral clear data once it is no longer needed. That is, a memory dump would - at best - contain a ridiculously small amount of media data.
To capture/restore the original media file one would typically have to place himself as a middleman into some media streaming related process and be able to copy data as it is being streaming durign playback. A static memory dump is of little help here.

Linux PCIe DMA Driver (Xilinx XDMA)

I am currently working with the Xilinx XDMA driver (see here for source code: XDMA Source), and am attempting to get it to run (before you ask: I have contacted my technical support point of contact and the Xilinx forum is riddled with people having the same issue). However, I may have found a snag in Xilinx's code that might be a deal breaker for me. I am hoping there is something that I'm not considering.
First off, there are two primary modes of the driver, AXI-Memory Mapped (AXI-MM) and AXI-Streaming (AXI-ST). For my particular application, I require AXI-ST, since data will continuously be flowing from the device.
The driver is written to take advantage of scatter-gather lists. In AXI-MM mode, this works because reads are rather random events (i.e., there isn't a flow of data out of the device, instead the userspace application simply requests data when it needs to). As such, the DMA transfer is built up, the data is transfered, and the transfer is then torn down. This is a combination of get_user_pages(), pci_map_sg(), and pci_unmap_sg().
For AXI-ST, things get weird, and the source code is far from orthodox. The driver allocates a circular buffer where the data is meant to continuously flow into. This buffer is generally sized to be somewhat large (mine is set on the order of 32MB), since you want to be able to handle transient events where the userspace application forgot about the driver and can then later work off the incoming data.
Here's where things get wonky... the circular buffer is allocated using vmalloc32() and the pages from that allocation are mapped in the same way as the userspace buffer is in AXI-MM mode (i.e., using the pci_map_sg() interface). As a result, because the circular buffer is shared between the device and CPU, every read() call requires me to call pci_dma_sync_sg_for_cpu() and pci_dma_sync_sg_for_device(), which absolutely destroys my performance (I can not keep up with the device!), since this works on the entire buffer. Funny enough, Xilinx never included these sync calls in their code, so I first knew I had a problem when I edited their test script to attempt more than one DMA transfer before exiting and the resulting data buffer was corrupted.
As a result, I'm wondering how I can fix this. I've considered rewriting the code to build up my own buffer allocated using pci_alloc_consistent()/dma_alloc_coherent(), but this is easier said than done. Namely, the code is architected to assume using scatter-gather lists everywhere (there appears to be a strange, proprietary mapping between the scatter-gather list and the memory descriptors that the FPGA understands).
Are there any other API calls I should be made aware of? Can I use the "single" variants (i.e., pci dma_sync_single_for_cpu()) via some translation mechanism to not sync the entire buffer? Alternatively, is there perhaps some function that can make the circular buffer allocated with vmalloc() coherent?
Alright, I figured it out.
Basically, my assumptions and/or understanding of the kernel documentation regarding the sync API were totally incorrect. Namely, I was wrong on two key assumptions:
If the buffer is never written to by the CPU, you don't need to sync for the device. Removing this call doubled my read() throughput.
You don't need to sync the entire scatterlist. Instead, now in my read() call, I figure out what pages are going to be affected by the copy_to_user() call (i.e., what is going to be copied out of the circular buffer) and only sync those pages that I care about. Basically, I can call something like pci_dma_sync_sg_for_cpu(lro->pci_dev, &transfer->sgm->sgl[sgl_index], pages_to_sync, DMA_FROM_DEVICE) where sgl_index is where I figured the copy will start and pages_to_sync is how large the data is in number of pages.
With the above two changes my code now meets my throughput requirements.
I think XDMA was originally written for x86, in which case the sync functions do nothing.
It does not seem likely that you can use the single sync variants unless you modify the circular buffer. Replacing the circular buffer with a list of buffers to send seems like a good idea to me. You pre-allocate a number of such buffers and have a list of buffers to send and a free list for your app to reuse.
If you're using a Zynq FPGA, you could connect the DMA engine to the ACP port so that FPGA memory access will be coherent. Alternatively, you can map the memory regions as uncached/buffered instead of cached.
Finally, in my FPGA applications, I map the control registers and buffers into the application process and only implement mmap() and poll() in the driver, to give apps more flexibility in how they do DMA. I generally implement my own DMA engines.
Pete, I am the original developer of the driver code (before the X of XMDA came into place).
The ringbuffer was always an unorthodox thing and indeed meant for cache-coherent systems and disabled by default. It's initial purpose was to get rid of the DMA (re)start latency; even with full asynchronous I/O support (even with zero-latency descriptor chaining in some cases) we had use cases where this could not be guaranteed, and where a true hardware ringbuffer/cyclic/loop mode was required.
There is no equivalent to a ringbuffer API in Linux, so it's open-coded a bit.
I am happy to re-think the IP/driver design.
Can you share your fix?

How to query the available data size in Linux DVB kernel demux buffers?

I am using a Linux-DVB frontend/demux driver pair to get a program stream remuxed from live broadcast TS into user land. I am using the poll/read combination, however, to keep the context switching and kernel to user space copying penalties at minimum, I only want to read data if it is bigger than a certain size.
I couldn't find any way to query the available data size in the demux buffers nor any option to specify the poll notify size.
Is anybody aware of any such functionality? If not, is it unreasonable to have such a feature request on the DVB api?
Regards,

Resources