pipe() function in linux

pipe() function in linux - linux

My question regards pipe() function in linux: http://linux.die.net/man/2/pipe
My question is: "is there only ONE pipe in linux?". I mean, if I have multiple processes that write to pipe, is it the same pipe, meaning that once I read data from the pipe, I may get data from different processes in the same read() operation?

No. The pipe() function creates a new pipe with two ends.
What can happen is that the file descriptor can be duplicated. The dup, dup2 functions can do this. fork does it too.
If you somehow have two programs with duplicated pipe file descriptors then yes, both of them will show up in the pipe's output.
It is the same thing as a terminal window showing output from programs running in the foreground and background.

Read not only pipe(2), but also pipe(7) and most importantly Advanced Linux Programming
I mean, if I have multiple processes that write to pipe
generally, you should not make that possible...
is it the same pipe, meaning that once I read data from the pipe, I may get data from different processes in the same read() operation?
Yes, but you usually don't do that.

Related

Most efficient way to save and later send output from many child processes

I want to do the following on linux:
Spawn a child process, run it to completion, save it's stdout
and then later write that saved stdout to a file.
The issue is that I want to do step 1 a few thousand times with different processes in a thread pool before doing step 2.
What's the most efficient way of doing this?
The normal way of doing this would be to have a pipe that the child process writes to, and then call sendfile() to send it to the output file (saving the copy to/from userspace). But this won't work for a few reasons. First of all, it would require me to have thousands of fds open at a time, which isn't supported in all linux configurations. Secondly, it would cause the child processes to block when their pipes fill up, and I want them to run to completion.
I considered using memfd_create to create to stdout fd for the child process. That solves the pipe filling issue, but not the fd limit one. vmsplice looked promising: I could splice from a pipe to user memory but according to the man page:
vmsplice() really supports true splicing only from user memory to a
pipe. In the opposite direction, it actually just copies the data to
user space.
Is there a way of doing this without copying to/from userspace in the parent process, and without having a high number of fds open at once?

Passing a file descriptor to another process in Haskell

I've read that you can pass a file descriptor to another process there, which seems perfect for what I want. Any chance that's do-able in Haskell in any way ?
To be clear, I'm not forking and I can't pre-open the file, I actually need a way to pass a file descriptor (mainly stdin) from a bunch of processes to a daemon, to avoid having to keep processes up just to forward their input, that'd fill the process list quite fast and would probably eat ressources for no reason.
Thanks !

You can get the file descriptor of STDIN from the unix package and UNIX-domain sockets from network.
I've never tried passing a file descriptor between processes, but it should work the same in Haskell as any other language.

linux pipe data from file descriptor into a fifo

Lets say I know that a file descriptor fd is open for reading in my process. I would like to pipe data from this fd into a fifo that is available for reading outside my of process, in a way that avoids calling poll or select on fd and manually reading/forwarding data. Can this be done?

You mean ask the OS to do that behind the scenes on an ongoing basis from now on? Like an I/O redirection?
No, you can't do that.
You could spawn a thread that does nothing but read the file fd and write the pipe fd, though. To avoid the overhead of copying memory around in read(2) and write(2) system calls, you can use sendfile(out_fd, in_fd, NULL, 4096) to tell the kernel to copy a page from in_fd to out_fd. See the man page.
You might have better results with splice(2), since it's designed for use with files and pipes. sendfile(2) used to require out_fd to be a socket. (Designed for zero-copy sending static data on TCP sockets, e.g. from a web server.)
Linux does have asynchronous I/O, so you can queue up a read or write to happen in the background. That's not a good choice here, because you can't queue up a copy from one fd to another. (no async splice(2) or sendfile(2)). Even if there was, it would have a specific request size, not a fire-and-forget keep copying-forever. AFAIK, threads have become the preferred way to do async I/O, rather than the POSIX AIO facilities.

stdin, stdout and stderr are shared between?

I am trying to understand the behavior of the three streams - stdout, stdin and stderr. I couldn't get the answer from any textbook, so I came here.
I know that these three are stored in file descriptor table with file descriptors 0 (stdin), 1 (stdout) and 2 (stderr). I am also aware that these are not merely file descriptors but I/O streams which can be redirected. Ok, so how about sharing?
Consider the three cases:
When a fork() is called : The child process and parent process shares file descriptors, but do they have the same stdin, stdout and stderr ?
When a thread is created : Threads share file descriptors, but I/O streams?
When execl() is called : In this case the present process image is overwritten with new process image. If I do execl("./a.out", "a.out", NULL); , then will this new executable get a freshcopy of stdin, stderr and stdout?
All wise answers are welcome.

In order to understand what's going on, consider that these are communication channels across the process boundaries. I'll avoid calling them streams, because those are used in different contexts, but they are related.
Now, firstly, the filedescriptor is just an index into a process-specific table that represents these channels, they are basically a kind of opaque handle. However, here is the first answer: Since threads are part of a process, they also share these descriptors, so if you write from two threads into the same channel, it leaves through the same channel, so outside of the process, the two threads are indistinguishable.
Then, when fork() is called, the process is effectively copied. This is done with copy-on-write optimizations, but still, it means that they also have different tables representing these communication channels. The entry with index 2 in one process is not the same as the one with index 2 in the fork. This is the same for any structure that is inside the process, if you created a C FILE* or a C++ std::stream on top of one, it gets copied, too, along with the other data.
When execl() is called, the process still "owns" certain channels to the outside. These are assigned to it from the OS which manages processes. This means that the index 2 can still be used to communicate with the outside world. On startup, the runtime library will then create e.g. the FILE* for use in C for the three well-known channels stdin, stdout and stderr.
The question that remains is what happens when a process is forked to the channels to the outside. Here, the answer is simple, either the channel is closed or inherited, which can be configured on a per-channel base. If it is inherited, it remains usable in the child process. Anything written to an inherited channel will end up wherever the output from the parent process would have ended up, too.
Concerning the stdin of a forked process, I'm actually not sure, I think that the input is by default one of those that are closed, because sending input to multiple targets doesn't make sense. Also, I never found the need to actually process input from stdin in a child process, unless that input was specifically provided by the parent process (similar to pipes in the shell, although those are siblings rather than parent and child).
Note: I'm not sure if this description is clear, please don't hesitate to ask and I will try to improve things for your understanding.

Lets assume the converse.
If they did not share the same locations (that is essentially what a file descriptor is) then these scenarios will have to conjure up something? Is that possible - one would conclude from a deterministic machine it is not.
There lies the answer. Yes they share the same location.

How can independent processes running on the same Ubuntu Machine communicate with each other?

This question is meant to be language and connection method independent. Actually finding methods is the question.
I know that I can directly pipe two processes through a call like prog1 | prog2 in the shell, and I've read something about RPC and Sockets. But everything was a little too abstract to really get a grip on it. For example it's not clear to me, how the sockets are created and if each process needs to create a socket or if many processes can use the same socket to transfer messages to each other or if I can get rid of the sockets completely.
Can someone explain how Interprocess-Communication in Linux really works and what options I have?

Pipe
In a producer-consumer scenario, you can use pipes, it's an IPC. A pipe is just what the name suggest, it connects a sink and a source together. In he shell the source is the standard output and the sink the standard input, so cmd1 | cmd2 just connects the output of cmd1 to the input of cmd2.
Using a pipe, it creates you two file descriptors. You can use one for the sink and the other one for the source. Once the pipe is created, you fork and one process uses one oof the file descriptor while the other one uses the other.
Other IPC
IPCs are various: pipe (in memory), named pipe (through a file), socket, shared memory, semaphore, message queue, signals, etc. All have pro's and con's. There are a lot of litterature online and in books about them. Describing them all here would be difficult.
Basically you have to understand that each process has it's own memory, separated from other processes. So you need to find shared resources through which to exchange data. A resource can be "physical" like a network (for socket) or mass storage (for files) or "abstract" like a pipe or a signal.

If one of the process is producer and other is a consumer then you can go for shared memory communication. You need a semaphore for this. One process will lock the semaphore then write to the shared memory and other will lock the semaphore and read the value. Since you use semaphore dirty reads/writes will be taken care.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string