Does v4l2 support multi-map? - linux

I'm trying to share frames(images) that I receive from a USB camera(logitech c270) between two processes so that I can avoid a memcpy. I'm using memory mapping streaming I/O method described here and I can successfully get frames from the camera after using v4l2_mmap. However, I have another process(for image processing) which has to use the image buffers after the dequeue and signal the first process to queue the buffer again.
Searching online, I could find that opening a video device multiple times is allowed, but when I try to map(tried both v4l2_mmap and just mmap) in the second process after a successful v4l2_open, I get an EINVAL error.
I found this pdf which talks about implementing multi-map in v4l2(Not official) and was wondering if this is implemented. I have also tried using User pointer streaming I/O method, the document of which explicitly states that a shared memory can be implemented in this method, but I get an EINVAL when I request for buffers(According to the documentation in linuxtv.org this means the camera doesn't support User pointer streaming I/O).
Note: I want to keep the code modular, hence two processes. If this is not possible, doing all the work in a single process(multiple threads & global frame buffer) is still possible.
Using standard shared memory function calls is not possible as the two processes have to map to the video device file(/dev/video0) and I cannot have it under /dev/shm.

The main problem with multi-consumer mmap is that this needs to be implemented on the device driver side. That is: even if some devices might support multi-map, others might not.
So unless you can control the camera that is being used with your application, you will eventually come across one that does not, in which case your application would not work.
So in any case, your application should provide means to handle non multi-map devices.
Btw, you do not need multiple processes to keep your code modular.
Multiple processes have their merits (e.g. privilige separation, crash resilience,...), but might also encourage code duplication...

This may not be relevant now.....
You don't need to use the full monty multi consumer thing to do this. I have used Python to hand off the processing of the mmap buffers to multiple processes (python multi-threading only allows 1 thread at a time to execute)
If you're running multi-threaded then worker threads can pick up the buffer and process it independently when triggered by the master thread
Since the code is obviously very pythonesq I won't post it here as it wouldn't make sense in other languages as it uses python multi-processing support.

Related

Can CUDA unified memory be written to by another CPU thread?

I am writing a program that retrieves images from a camera and processes them with CUDA. In order to gain the best performance, I'm passing a CUDA unified memory buffer to the image acquisition library, which writes to the buffer in another thread.
This causes all sorts of weird results where to program hangs in library code that I do not have access to. If I use a normal memory buffer and then copy to CUDA, the problem is fixed. So I became suspicious that writing from another thread might not allowed, and googled as I did, I could not find a definitive answer.
So is accessing the unified memory buffer from another CPU thread is allowed or not?
There should be no problem writing to a unified memory buffer from multiple threads.
However, keep in mind the restrictions imposed when the concurrentManagedAccess device property is not true. In that case, when you have a managed buffer, and you launch a kernel, no CPU/host thread access of any kind is allowed, to that buffer, or any other managed buffer, until you perform a cudaDeviceSynchronize() after the kernel call.
In a multithreaded environment, this might take some explicit effort to enforce.
I think this is similar to this recital if that is also your posting. Note that TX2 should have this property set to false.
Note that this general rule in the non-concurrent case can be modified through careful use of streams. However the restrictions still apply to buffers attached to streams that have a kernel launched in them (or buffers not explicitly attached to any stream): when the property mentioned above is false, access by any CPU thread is not possible.
The motivation for this behavior is roughly as follows. The CUDA runtime does not know the relationship between managed buffers, regardless of where those buffers were created. A buffer created in one thread could easily have objects in it with embedded pointers, and there is nothing to prevent or restrict those pointers from pointing to data in another managed buffer. Even a buffer that was created later. Even a buffer that was created in another thread. The safe assumption is that any linkages could be possible, and therefore, without any other negotiation, the managed memory subsystem in the CUDA runtime must move all managed buffers to the GPU, when a kernel is launched. This makes all managed buffers, without exception, inaccessible to CPU threads (any thread, anywhere). In the normal program flow, access is restored at the next occurrence of a cudaDeviceSynchronize() call. Once the CPU thread that issues that call completes the call and moves on, then managed buffers are once again visible to (all) CPU threads. Another kernel launch (anywhere) repeats the process, and interrupts the accessibility. To repeat, this is the mechanism that is in effect when the concurrentManagedAccess property on the GPU is not true, and this behavior can be somewhat modified via the aforementioned stream attach mechanism.

Calling between multiple Matlab instances or threads

Basically, I need a way to call Matlab functions from an indefinitely-long separate thread.
First, I'm aware that I could use the TCPIP or UDP functionality to communicate between two instances of Matlab. I'll explain why that doesn't really help.
Background: I've written a Matlab class that acts as an interface for a USB device. Matlab was chosen because I need it to run on Mac/Linux/Windows, and the target users are only familiar with Matlab. Because of some inconsistencies in Matlab across platforms, I'm not using the BytesAvailableFcn or BytesAvailableFcnMode (I need as near realtime as possible, and with the aforementioned there can be delays up to 100s of milliseconds to send and receive data), and am instead sending and polling the port at a fixed interval using a timer. This introduces some overhead, and, if the user holds onto the main thread, the sending/receiving will also stop. Now, one of the most important function of the class is to set callbacks that are based on the input received from the device. The user sets their function and a given condition to match, and the object will call it automatically.
Problem: This object works well, completely in the background. However, as mentioned, it consumes some resources on the Matlab thread. I'm curious about making just the serial wrapper and callback functionality run on its own thread. However, if I compile it as a standalone application (for all 3 platforms) I believe my only solution will be TCPIP/UDP communication. Which then requires the object running on the main thread to poll the port in order to handle the callbacks in realtime - thus negating the benefit of moving it to a standalone application.
Any suggestions?
Threading in matlab is a nightmare. Doing anything in realtime, with the kind of latencies you're describing is not advised. Under the hood, Matlab uses Java for all it's platform independence. If you want to do this right, you'll write your app natively in Java, and call your java from Matlab (to deal with the fact that your users are incapable of installing a JRE, but can install matlab.)
That said, there is a better way to handle callbacks than what you are doing. My preferred architecture in this scenario is to have one thread service the hardware, and communicate with other threads via message queues (one for input, one for output, and one for command/control if you need to get super fancy.) Basically, the hardware thread then just focuses on servicing the queues. You have a second thread handle the callbacks. It reads the output queue of the hardware thread, and services the callbacks. I've never done this in matlab (see first paragraph) but it works very well in Java contexts.

Can single thread do everything that multithread can do?

My 1st Question: As per the title.
I am asking this because I came across a StackExchange question: What can multiple threads do that a single thread cannot?
In one of the solutions given in that link states that whatever multithread can do, it can be done by single thread as well.
However I don't think this is true. My argument is this: When we build a simple chat program with socket programming and run it via the command console. If the chat program is single threaded. The chat program is actually half-duplex. Which means we cannot listen and talk concurrently and each time only a party can talk and the other have to listen. In order for both parties to be able to talk and receive message concurrently, we have to implement it with multithreads.
My 2nd Question: Is my argument correct? Or did I miss out some points here, and therefore a single thread still can do everything multithread does?
Let's consider the computer as a whole, and more precisely that you chat application is bound with the kernel (or the whole os) as a piece we would call "the software".
Now consider that this "software" runs on a single core (say a i386).
Then you can figure out that, even if you wrote your chat application using threads (which is probably quite overkill), the software as a whole runs on a single CPU core, which means that at a very moment it performs one single thing even if there seem to be parallel things happening.
This is nothing more but a Turing machine (using a single tape) https://en.wikipedia.org/wiki/Turing_machine
The parallelism is an illusion caused by the kernel because it can switch between task fast enough. Just like a film seems to be continuous picture on screen, when actually there are just 24 images per seconds, and this is enough to fool our brain.
So I would say that anything a multithreaded program does, a single threaded could do.
Nevertheless, now we all use multi-core CPUs which can be seen at a certain point as running on multiple computers at the same time (parallel computing), thus you can probably find software that works on multi core and that would not run on a single threaded one.
A good example are device drivers (in kernel). If you have a poor implementation, on non preemptive kernel, you can create a busy loop that waits for an event indefinitely. This usually deadlock on single core (you prevent the kernel to schedule to another task, thus you prevent the event to be sent). But this can work on multi core as the event is usually eventually sent by the other thread running on an other core (hopefully).
I want to amend the existing answer (+1):
You absolutely can run multiple parallel IOs on a single thread. An IO is nothing more but a kernel data structure. When you start the IO the OS talks to the hardware and tells it to do something. Then, the CPU is free to do whatever it wants. The hardware calls back into the OS when it's done. It issues an interrupt which hijacks a CPU core to process the completion notification.
This is called async IO and all OS'es provide it.
In fact this is how socket programs with many connections run. They use async IO to multiplex high amounts of connections onto a small pool of threads.
The core reason why this argument is incorrect is subtle. While it's true that with only a single thread, or single core, or single network interface, that particular component can only be handling a send or a receive at any given time, if it's not the critical path, it does not make sense to describe the overall system as half duplex.
Consider a network link that is full-duplex and takes 1ms to move a chunk of data from one end to the other. Now imagine we have a device that puts data on the link or removes data from the link but cannot do both at the same time. So long as it takes much less than 1ms to process a send or a receive, this single file path that data in both directions must go through does not somehow make the link half-duplex. There will still be data moving in both directions at the same time.
In any realistic chat application, the CPU will not be the limiting factor. So it's inability to do more than one thing at a time can't make the system half-duplex. There can still be data moving in both directions at the same time.
For a typical chat application under typical load, the behavior of the system will not be significantly different whether implementation uses a single thread or has multiple threads with infinite CPU resources. The CPU just won't be the limiting factor.

"Multi-process" vs. "single-process multi-threading" for software modules communicating via messaging

We need to build a software framework (or middleware) that will enable messaging between different software components (or modules) running on a single machine. This framework will provide such features:
Communication between modules are through 'messaging'.
Each module will have its own message queue and message handler thread that will synchronously handle each incoming message.
With the above requirements, which of the following approach is the correct one (with its reasoning)?:
Implementing modules as processes, and messaging through shared memory
Implementing modules as threads in a single process, and messaging by pushing message objects to the destination module's message queue.
Of source, there are some apparent cons & pros:
In Option-2, if one module causes segmentation fault, the process (thus the whole application) will crash. And one module can access/mutate another module's memory directly, which can lead to difficult-to-debug runtime errors.
But with Option-1, you need to take care of the states where a module you need to communicate has just crashed. If there are N modules in the software, there can be 2^N many alive/crashed states of the system that affects the algorithms running on the modules.
Again in Option-1, sender cannot assume that the receiver has received the message, because it might have crashed at that moment. (But the system can alert all the modules that a particular module has crashed; that way, sender can conclude that the receiver will not be able to handle the message, even though it has successfully received it)
I am in favor of Option-2, but I am not sure whether my arguments are solid enough or not. What are your opinions?
EDIT: Upon requests for clarification, here are more specification details:
This is an embedded application that is going to run on Linux OS.
Unfortunately, I cannot tell you about the project itself, but I can say that there are multiple components of the project, each component will be developed by its own team (of 3-4 people), and it is decided that the communication between these components/modules are through some kind of messaging framework.
C/C++ will be used as programming language.
What the 'Module Interface API' will automatically provide to the developers of a module are: (1) An message/event handler thread loop, (2) a synchronous message queue, (3) a function pointer member variable where you can set your message handler function.
Here is what I could come up with:
Multi-process(1) vs. Single-process, multi-threaded(2):
Impact of segmentation faults: In (2), if one module causes segmentation fault, the whole application crashes. In (1), modules have different memory regions and thus only the module that cause segmentation fault will crash.
Message delivery guarantee: In (2), you can assume that message delivery is guaranteed. In (1) the receiving module may crash before the receival or during handling of the message.
Sharing memory between modules: In (2), the whole memory is shared by all modules, so you can directly send message objects. In (1), you need to use 'Shared Memory' between modules.
Messaging implementation: In (2), you can send message objects between modules, in (1) you need to use either of network socket, unix socket, pipes, or message objects stored in a Shared Memory. For the sake of efficiency, storing message objects in a Shared Memory seems to be the best choice.
Pointer usage between modules: In (2), you can use pointers in your message objects. The ownership of heap objects (accessed by pointers in the messages) can be transferred to the receiving module. In (1), you need to manually manage the memory (with custom malloc/free functions) in the 'Shared Memory' region.
Module management: In (2), you are managing just one process. In (1), you need to manage a pool of processes each representing one module.
Sounds like you're implementing Communicating Sequential Processes. Excellent!
Tackling threads vs processes first, I would stick to threads; the context switch times are faster (especially on Windows where process context switches are quite slow).
Second, shared memory vs a message queue; if you're doing full synchronous message passing it'll make no difference to performance. The shared memory approach involves a shared buffer that gets copied to by the sender and copied from by the reader. That's the same amount of work as is required for a message queue. So for simplicity's sake I would stick with the message queue.
in fact you might like to consider using a pipe instead of a message queue. You have to write code to make the pipe synchronous (they're normally asynchronous, which would be Actor Model; message queues can often be set to zero length which does what you want for it to be synchronous and properly CSP), but then you could just as easily use a socket instead. Your program can then become multi-machine distributed should the need arise, but you've not had to change the architecture at all. Also named pipes between processes is an equivalent option, so on platforms where process context switch times are good (e.g. linux) the whole thread vs process question goes away. So working a bit harder to use a pipe gives you very significant scalability options.
Regarding crashing; if you go the multiprocess route and you want to be able to gracefully handle the failure of a process you're going to have to do a bit of work. Essentially you will need a thread at each end of the messaging channel simply to monitor the responsiveness of the other end (perhaps by bouncing a keep-awake message back and forth between themselves). These threads need to feed status info into their corresponding main thread to tell it when the other end has failed to send a keep-awake on schedule. The main thread can then act accordingly. When I did this I had the monitor thread automatically reconnect as and when it could (e.g. the remote process has come back to life), and tell the main thread that too. This means that bits of my system can come and go and the rest of it just copes nicely.
Finally, your actual application processes will end up as a loop, with something like select() at the top to wait for message inputs from all the different channels (and monitor threads) that it is expecting to hear from.
By the way, this sort of thing is frustratingly hard to implement in Windows. There's just no proper equivalent of select() anywhere in any Microsoft language. There is a select() for sockets, but you can't use it on pipes, etc. like you can in Unix. The Cygwin guys had real problems implementing their version of select(). I think they ended up with a polling thread per file descriptor; massively inefficient.
Good luck!
Your question lacks a description of how the "modules" are implemented and what do they do, and possibly a description of the environment in which you are planning to implement all of this.
For example:
If the modules themselves have some requirements which makes them hard to implement as threads (e.g. they use non-thread-safe 3rd party libraries, have global variables, etc.), your message delivery system will also not be implementable with threads.
If you are using an environment such as Python which does not handle thread parallelism very well (because of its global interpreter lock), and running on Linux, you will not gain any performance benefits with threads over processes.
There are more things to consider. If you are just passing data between modules, who says your system needs to use either multiple threads or multiple processes? There are other architectures which do the same thing without either of them, such as event-driven with callbacks (a message receiver can register a callback with your system, which is invoked when a message generator generates a message). This approach will be absolutely the fastest in any case where parallelism isn't important and where receiving code can be invoked in the execution context of the caller.
tl;dr: you have only scratched the surface with your question :)

How do programs communicate with each other?

How do procceses communicate with each other? Using everything I've learnt to fo with programming so far, I'm unable to explain how sockets, file systems and other things to do with sending messages between programs work.
Btw I use a Linux based OS if your going to add anything OS specific. Thanks in advance. The question's been bugging me for ages. I'm also guessing the kernel has something to do with it.
In case of most IPC (InterProcess Communication) mechanisms, the general answer to your question is this: process A calls the kernel passing a pointer to a buffer with data to be transferred to process B, process B calls the kernel (or is already blocked on a call to the kernel) passing a pointer to a buffer to be filled with data from process A.
This general description is true for sockets, pipes, System V message queues, ordinary files etc. As you can see the cost of communication is high since it involves at least one context switch.
Signals constitute an asynchronous IPC mechanism in which one process can send a simple notification to another process triggering a handler registered by the second process (alternatively doing nothing, stopping or killing that process if no handler is registered, depending on the signal).
For transferring large amount of data one can use System V shared memory in which case two processes can access the same portion of main memory. Note that even in this case, one needs to employ a synchronization mechanism, like System V semaphores, which result in context switches as well.
This is why when processes need to communicate often, it is better to make them threads in a single process.

Resources