I've read that you can pass a file descriptor to another process there, which seems perfect for what I want. Any chance that's do-able in Haskell in any way ?
To be clear, I'm not forking and I can't pre-open the file, I actually need a way to pass a file descriptor (mainly stdin) from a bunch of processes to a daemon, to avoid having to keep processes up just to forward their input, that'd fill the process list quite fast and would probably eat ressources for no reason.
Thanks !
You can get the file descriptor of STDIN from the unix package and UNIX-domain sockets from network.
I've never tried passing a file descriptor between processes, but it should work the same in Haskell as any other language.
Related
My question regards pipe() function in linux: http://linux.die.net/man/2/pipe
My question is: "is there only ONE pipe in linux?". I mean, if I have multiple processes that write to pipe, is it the same pipe, meaning that once I read data from the pipe, I may get data from different processes in the same read() operation?
No. The pipe() function creates a new pipe with two ends.
What can happen is that the file descriptor can be duplicated. The dup, dup2 functions can do this. fork does it too.
If you somehow have two programs with duplicated pipe file descriptors then yes, both of them will show up in the pipe's output.
It is the same thing as a terminal window showing output from programs running in the foreground and background.
Read not only pipe(2), but also pipe(7) and most importantly Advanced Linux Programming
I mean, if I have multiple processes that write to pipe
generally, you should not make that possible...
is it the same pipe, meaning that once I read data from the pipe, I may get data from different processes in the same read() operation?
Yes, but you usually don't do that.
I am trying to understand the behavior of the three streams - stdout, stdin and stderr. I couldn't get the answer from any textbook, so I came here.
I know that these three are stored in file descriptor table with file descriptors 0 (stdin), 1 (stdout) and 2 (stderr). I am also aware that these are not merely file descriptors but I/O streams which can be redirected. Ok, so how about sharing?
Consider the three cases:
When a fork() is called : The child process and parent process shares file descriptors, but do they have the same stdin, stdout and stderr ?
When a thread is created : Threads share file descriptors, but I/O streams?
When execl() is called : In this case the present process image is overwritten with new process image. If I do execl("./a.out", "a.out", NULL); , then will this new executable get a freshcopy of stdin, stderr and stdout?
All wise answers are welcome.
In order to understand what's going on, consider that these are communication channels across the process boundaries. I'll avoid calling them streams, because those are used in different contexts, but they are related.
Now, firstly, the filedescriptor is just an index into a process-specific table that represents these channels, they are basically a kind of opaque handle. However, here is the first answer: Since threads are part of a process, they also share these descriptors, so if you write from two threads into the same channel, it leaves through the same channel, so outside of the process, the two threads are indistinguishable.
Then, when fork() is called, the process is effectively copied. This is done with copy-on-write optimizations, but still, it means that they also have different tables representing these communication channels. The entry with index 2 in one process is not the same as the one with index 2 in the fork. This is the same for any structure that is inside the process, if you created a C FILE* or a C++ std::stream on top of one, it gets copied, too, along with the other data.
When execl() is called, the process still "owns" certain channels to the outside. These are assigned to it from the OS which manages processes. This means that the index 2 can still be used to communicate with the outside world. On startup, the runtime library will then create e.g. the FILE* for use in C for the three well-known channels stdin, stdout and stderr.
The question that remains is what happens when a process is forked to the channels to the outside. Here, the answer is simple, either the channel is closed or inherited, which can be configured on a per-channel base. If it is inherited, it remains usable in the child process. Anything written to an inherited channel will end up wherever the output from the parent process would have ended up, too.
Concerning the stdin of a forked process, I'm actually not sure, I think that the input is by default one of those that are closed, because sending input to multiple targets doesn't make sense. Also, I never found the need to actually process input from stdin in a child process, unless that input was specifically provided by the parent process (similar to pipes in the shell, although those are siblings rather than parent and child).
Note: I'm not sure if this description is clear, please don't hesitate to ask and I will try to improve things for your understanding.
Lets assume the converse.
If they did not share the same locations (that is essentially what a file descriptor is) then these scenarios will have to conjure up something? Is that possible - one would conclude from a deterministic machine it is not.
There lies the answer. Yes they share the same location.
I'm working on a multithreaded application where multiple threads may want exclusive access to the same file. I'm looking for a way of serializing these operations. I was planning to use flock, lockf, or fcntl locking. However it appears that with these methods an attempt to lock a file by a second thread when a first thread already owns the lock will be granted, because the two threads are in the same process. This is according to the manpages for flock and fnctl (and I guess in linux lockf is implemented with fnctl). Also supported by this other question. So, are there other ways of locking a file in linux which works at a thread-level instead of a process-level?
Some alternatives that I came up with which I do not like are:
1) Use a lockfile (xxx.lock) opened with O_CREAT | O_EXCL flags. This call will succeed only in one thread if there is contention. The problem with this is that then other threads have to spin on the call until they achieve the lock, meaning that I have to _yield() or sleep() which makes me think this is not a great option.
2) Keep a mutex'ed list of all open files. When a thread wants to open/close a file it has to lock the list first. When opening a file, it searches the list to see if it's open. This sounds particularly inefficient because it requires a significant amount of work even if the file is not owned yet.
Are there other ways of doing this?
Edit:
I just discovered this text in my system's manpages which isn't in the online man pages:
If a process uses open(2) (or similar) to obtain more than one descriptor for the same file, these descriptors are treated independently by flock(). An attempt to lock the file using one of these file descriptors may be denied by a lock that the calling process has already placed via another descriptor.
I'm not happy about the words "may be denied", I'd prefer "will be denied" but I guess it's time to test that.
This question is meant to be language and connection method independent. Actually finding methods is the question.
I know that I can directly pipe two processes through a call like prog1 | prog2 in the shell, and I've read something about RPC and Sockets. But everything was a little too abstract to really get a grip on it. For example it's not clear to me, how the sockets are created and if each process needs to create a socket or if many processes can use the same socket to transfer messages to each other or if I can get rid of the sockets completely.
Can someone explain how Interprocess-Communication in Linux really works and what options I have?
Pipe
In a producer-consumer scenario, you can use pipes, it's an IPC. A pipe is just what the name suggest, it connects a sink and a source together. In he shell the source is the standard output and the sink the standard input, so cmd1 | cmd2 just connects the output of cmd1 to the input of cmd2.
Using a pipe, it creates you two file descriptors. You can use one for the sink and the other one for the source. Once the pipe is created, you fork and one process uses one oof the file descriptor while the other one uses the other.
Other IPC
IPCs are various: pipe (in memory), named pipe (through a file), socket, shared memory, semaphore, message queue, signals, etc. All have pro's and con's. There are a lot of litterature online and in books about them. Describing them all here would be difficult.
Basically you have to understand that each process has it's own memory, separated from other processes. So you need to find shared resources through which to exchange data. A resource can be "physical" like a network (for socket) or mass storage (for files) or "abstract" like a pipe or a signal.
If one of the process is producer and other is a consumer then you can go for shared memory communication. You need a semaphore for this. One process will lock the semaphore then write to the shared memory and other will lock the semaphore and read the value. Since you use semaphore dirty reads/writes will be taken care.
my question: in Linux (and in FreeBsd, and generally in UNIX) is it possible/legal to read single file descriptor simultaneously from two threads?
I did some search but found nothing, although a lot of people ask like question about reading/writing from/to socket fd at the same time (meaning reading when other thread is writing, not reading when other is reading). I also have read some man pages and got no clear answer on my question.
Why I ask it. I tried to implement simple program that counts lines in stdin, like wc -l. I actually was testing my home-made C++ io engine for overhead, and discovered that wc is 1.7 times faster. I trimmed down some C++ and came closer to wc speed but didn't reach it. Then I experimented with input buffer size, optimized it, but still wc is clearly a bit faster. Finally I created 2 threads which read same STDIN_FILENO in parallel, and this at last was faster than wc! But lines count became incorrect... so I suppose some junk comes from reads which is unexpected. Doesn't kernel care what process read?
Edit: I did some research and discovered just that calling read directly via syscall does not change anything. Kernel code seem to do some sync handling, but i didnt understand much (read_write.c)
That's undefined behavior, POSIX
says:
The read() function shall attempt to read nbyte bytes from the file
associated with the open file descriptor, fildes, into the buffer
pointed to by buf. The behavior of multiple concurrent reads on the
same pipe, FIFO, or terminal device is unspecified.
About accessing a single file descriptor concurrently (i.e. from multiple threads or even processes), I'm going to cite POSIX.1-2008 (IEEE Std 1003.1-2008), Subsection 2.9.7 Thread Interactions with Regular File Operations:
2.9.7 Thread Interactions with Regular File Operations
All of the following functions shall be atomic with respect to each other in the effects specified in POSIX.1-2008 when they operate on regular files or symbolic links:
[…] read() […]
If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them. […]
At first glance, this looks quite good. However, I hope you did not miss the restriction when they operate on regular files or symbolic links.
#jarero cites:
The behavior of multiple concurrent reads on the same pipe, FIFO, or terminal device is unspecified.
So, implicitly, we're agreeing, I assume: It depends on the type of the file you are reading. You said, you read from STDIN. Well, if your STDIN is a plain file, you can use concurrent access. Otherwise you shouldn't.
When used with a descriptor (fd), read() and write() rely on the internal state of the fd to know the "current offset" at which the read and write will occur. As a result, they aren't thread-safe.
To allow a single descriptor to be used by multiple threads simultaneously, pread() and pwrite() are provided. With those interfaces, the descriptor and the desired offset are specified, so the "current offset" in the descriptor isn't used.