Shared Memory and Process Sempahores (IPC)

Shared Memory and Process Sempahores (IPC) - linux

This is an extract from Advanced Liniux Programming:
Semaphores continue to exist even after all processes using them have terminated.
The last process to use a semaphore set must explicitly remove it to ensure that the
operating system does not run out of semaphores.To do so, invoke semctl with the
semaphore identifier, the number of semaphores in the set, IPC_RMID as the third argument,
and any union semun value as the fourth argument (which is ignored).The
effective user ID of the calling process must match that of the semaphore’s allocator
(or the caller must be root). Unlike shared memory segments, removing a semaphore
set causes Linux to deallocate immediately.
If a process allocate a shared memory, and many process use it and never set to delete it (with shmctl), if all them terminate, then the shared page continues being available. (We can see this with ipcs).
If some process did the shmctl, then when the last process deattached, then the system will deallocate the shared memory.
So far so good (I guess, if not, correct me).
What I dont understand from that quote I did, is that first it say:
"Semaphores continue to exist even after all processes using them have terminated."
and then:
"Unlike shared memory segments, removing a semaphore set causes Linux to deallocate immediately."

The two statements don't contradict each other...
The first statement says that the semaphore will continue to exist unless/until some program explicitly deletes it (i.e. it won't be auto-deleted when the last program stops using it).
The second statement says that when a program deletes a semaphore set, linux will deallocate the semaphore set immediately (as opposed to, say, waiting for all other programs to stop using it first)

Related

Kernel Programming - Mutexes

So I'm trying to use mutex_init(), mutex_lock(), mutex_unlock() for thread synchronization.
I am currently trying to schedule threads in a round robin fashion(but more than 1 thread could be running at a time) and I set the current state of a thread to TASK_INTERRUPTIBLE, followed by waking up another thread whose PID, I have in a list.
I need to iterate over this list for my logic.
As I understand it, I need to lock this list as I access its elements, or another thread might miss a new entry while I'm making changes to it. Also, as one mutex has locked a resource, no other mutex can unlock it, until the original mutex releases it.
But, I'm still not sure if I'm locking it correctly. (I release the lock before I call schedule(), and re-lock after that)
I declare a mutex locally within a thread and lock the list. After my current thread locks
mutex_lock(&lock);
and I iterate over the list, till I find something(or ends if it doesn't find anything), then unlocks.
mutex_unlock(&lock);
I assume locking while I iterate is legal. I have never seen examples of this though.
Also, is it normal for the process to have a state of (TASK_UNINTERRUPTIBLE) while it holds a mutex lock?
EDIT : I am adding some more information based on the answer below.
It is possible my program may be run on a virtual machine with a single core. Therefore, I do not want to risk infinite polling using spin_lock().
I am trying to maintain scheduling between threads that have a certain id. For example if there are 4 threads. 2 in set 'A' and 2 in set 'B'. I allow only 1 thread to run in each set. But I switch between threads in a given set. However, a thread in set 'A' should not switch to any thread in set 'B'
(I know the kernel scheduler wont be perfect, so an approximate switching will do).
My Reasoning for TASK_STATE's:
1) Initial thread that gets created is running.
2) If another thread in the same set is running (and this one hasn't executed for a given time). Set other thread to TASK_INTERRUPTIPLE, while calling schedule(); Note: There can be more than 2 threads in each set, but let's keep it simple by considering only 2 for now.
3) If it has executed for enough time, set this task to TASK_INTERRUPTIPLE, set the other task in the same set to TASK_RUNNING, while calling schedule();
All this logic happens while I am accessing certain data structures which are locked by a (now) Global Mutex. I unlock the mutex just before I call schedule(), and instantly re-lock afterward. After my logic part is done, I completely unlock the mutex.
Is there anything fundamentally wrong with the approach?

As I understand it, I need to lock this list as I access its elements
Yes, that is true. But if you use a mutex, you're going to be really sad because a call to lock/unlock is a call to the scheduler. Therefore, calling it from inside the scheduler should result in deadlock. What you need to do depends on if your processor is multi-core or (the mythical) single-core. (Is this a virtual system?) On a single-core processor you can disable interrupts. On a multi-core processor, disabling interrupts is not sufficient (it only disables interrupts for that one core, and another core may still be interrupted). The simplest thing to do on a multi-core is to use a spinlock. Unlike the mutex, both of these locking mechanisms can be unlocked from different threads.
I set the current state of a thread to TASK_INTERRUPTIBLE
Is the thread being taken off the CPU? If so, it's not running, so I suspect that TASK_INTERRUPTIBLE is the wrong state. It would be helpful if you could list the possible states for me or if you could describe what the state is supposed to indicate. Because to me "TASK_INTERRUPTIBLE" sounds like a running task.
I declare a mutex locally within a thread and lock the list
Local mutexes are a red flag! The resource you are locking should be protected by a mutex with the same scope. If the list is global, it should have a global mutex to protect it. Threads that want to use the list must first acquire its mutex. Of course, as I already talked about, you probably want to use a different kind of locking to protect the list of ready-to-run processes.
I assume locking while I iterate is legal
It is perfectly legal (assuming of course that your mutual exclusion scheme is bug-free). In fact, it's required. If another thread were allowed to, for example, remove a node from the list while you were reading it, you could end up dereferencing a deleted node.
Also, is it normal for the process to have a state of TASK_UNINTERRUPTIBLE while it holds a mutex lock?
No, not while it holds the lock if the process is currently running on a CPU. A mutex is available to user code. If holding a mutex made the process uninterruptible, that would mean that a process could hijack the system by simply locking a mutex and never releasing it. Now, you will find that the lock and unlock functions need to be uninterruptible on a single-core processor. However, it doesn't make sense to set the state for the process because it's actually the scheduler that must not be interrupted.

Does a PTHREAD mutex only avoid simultaneous access to a resource, or it does anything more?

Example:
A thread finishes writing to a shared variable, and then it unlocks it, but continues to use that variable's value (without changing it).
And immediately, another thread successfully unlocks() that mutex and reads the shared variable.
For my (mis-)understanding, some things could be happening on this situation:
On the WRITER thread:
A compiler optimization could make the write occur only at some later point
The written value could be retained in the current CPU core's cache, and flushed to the memory at some later point
On the READER thread:
The value of the variable may have been read before the mutex lock(), and because of some compiler optimization or just the usual work of the CPU cache, still be considered "already read from memory" and thus, not fetched from the memory again.
Thus, the value we have here is not the updated one from the other thread.
Does the pthread mutex lock/unlock() functions execute any code to "flush" the current cache to the memory and anything else needed to make sure the current thread is synchronized with everything else (I cannot think of anything else than the cache), or is it just not needed (at least in all known architectures)?
Because if all the mutexes do is just what the name does - mutual exclusion to it's reference - then, if I have thousands of threads dealing with the same data and from my algorithm's point of view, I already know that when one thread is using a variable, no other thread will try to use it at the same time, than it means I don't need a mutex? Or will my code be missing some low level and architecture-specific method(s) implemented inside the PTHREAD library to avoid the problems above?

The pthreads mutex lock and unlock functions are among the list of functions in POSIX "...that synchronize thread execution and also synchronize memory with respect to other threads". So yes, they do more than just interlock execution.
Whether or not they need to issue additional instructions to the hardware is of course architecture dependent (noting that almost every modern CPU architecture will at least happily reorder reads with respect to each other unless told otherwise), but in every case those functions must act as "compiler barriers" - that is, they ensure that the compiler won't reorder, coalesce or omit memory accesses in situations where it would otherwise be allowed to.
It is allowed to have multiple threads reading a shared value without mutual exclusion though - all you need to ensure is that both the writing and reading threads executed some synchronising function between the write and the read. For example, an allowable situation is to have many reading threads that defer reading the shared state until they have passed a barrier (pthread_barrier_wait()) and a writing thread that performs all its writes to the shared state before it passes the barrier. Reader-writer locks (pthread_rwlock_*) are also built around this idea.

Time waste of execv() and fork()

I am currently learning about fork() and execv() and I had a question regarding the efficiency of the combination.
I was shown the following standard code:
pid = fork();
if(pid < 0){
//handle fork error
}
else if (pid == 0){
execv("son_prog", argv_son);
//do father code
I know that fork() clones the entire process (copying the entire heap, etc) and that execv() replaces the current address space with that of the new program. With this in mind, doesn't it make it very inefficient to use this combination? We are copying the entire address space of a process and then immediately overwrite it.
So my question:
What is the advantage that is achieved by using this combo (instead of some other solution) that makes people still use this, even though we have waste?

What is the advantage that is achieved by using this combo (instead of some other solution) that makes people still use this even though we have waste?
You have to create a new process somehow. There are very few ways for a userspace program to accomplish that. POSIX used to have vfork() alognside fork(), and some systems may have their own mechanisms, such as Linux-specific clone(), but since 2008, POSIX specifies only fork() and the posix_spawn() family. The fork + exec route is more traditional, is well understood, and has few drawbacks (see below). The posix_spawn family is designed as a special purpose substitute for use in contexts that present difficulties for fork(); you can find details in the "Rationale" section of its specification.
This excerpt from the Linux man page for vfork() may be illuminating:
Under Linux, fork(2) is implemented using copy-on-write pages, so the only penalty incurred by fork(2) is the time and memory required to duplicate the parent’s page tables, and to create a unique task structure for the child. However, in the bad old days a fork(2) would require making a complete copy of the caller’s data space, often needlessly, since usually immediately afterwards an exec(3) is done. Thus, for greater efficiency, BSD introduced the vfork() system call, which did not fully copy the address space of the parent process, but borrowed the parent’s memory and thread of control until a call to execve(2) or an exit occurred. The parent process was suspended while the child was using its resources. The use of vfork() was tricky: for example, not modifying data in the parent process depended on knowing which variables are held in a register.
(Emphasis added)
Thus, your concern about waste is not well-founded for modern systems (not limited to Linux), but it was indeed an issue historically, and there were indeed mechanisms designed to avoid it. These days, most of those mechanisms are obsolete.

Another answer states:
However, in the bad old days a fork(2) would require making a complete copy of the caller’s data space, often needlessly, since usually immediately afterwards an exec(3) is done.
Obviously, one person's bad old days are a lot younger than others remember.
The original UNIX systems did not have the memory for running multiple processes and they did not have an MMU for keeping several processes in physical memory ready-to-run at the same logical address space: they swapped out processes to disk that it wasn't currently running.
The fork system call was almost entirely the same as swapping out the current process to disk, except for the return value and for not replacing the remaining in-memory copy by swapping in another process. Since you had to swap out the parent process anyway in order to run the child, fork+exec was not incurring any overhead.
It's true that there was a period of time when fork+exec was awkward: when there were MMUs that provided a mapping between logical and physical address space but page faults did not retain enough information that copy-on-write and a number of other virtual-memory/demand-paging schemes were feasible.
This situation was painful enough, not just for UNIX, that page fault handling of the hardware was adapted to become "replayable" pretty fast.

Not any longer. There's something called COW (Copy On Write), only when one of the two processes (Parent/Child) tries to write to a shared data, it is copied.
In the past:
The fork() system call copied the address space of the calling process (the parent) to create a new process (the child).
The copying of the parent's address space into the child was the most expensive part of the fork() operation.
Now:
A call to fork() is frequently followed almost immediately by a call to exec() in the child process, which replaces the child's memory with a new program. This is what the the shell typically does, for example. In this case, the time spent copying the parent's address space is largely wasted, because the child process will use very little of its memory before calling exec().
For this reason, later versions of Unix took advantage of virtual memory hardware to allow the parent and child to share the memory mapped into their respective address spaces until one of the processes actually modifies it. This technique is known as copy-on-write. To do this, on fork() the kernel would copy the address space mappings from the parent to the child instead of the contents of the mapped pages, and at the same time mark the now-shared pages read-only. When one of the two processes tries to write to one of these shared pages, the process takes a page fault. At this point, the Unix kernel realizes that the page was really a "virtual" or "copy-on-write" copy, and so it makes a new, private, writable copy of the page for the faulting process. In this way, the contents of individual pages aren't actually copied until they are actually written to. This optimization makes a fork() followed by an exec() in the child much cheaper: the child will probably only need to copy one page (the current page of its stack) before it calls exec().

It turns out all those COW page faults are not at all cheap when the process has a few gigabytes of writable RAM. They're all gonna fault once even if the child has long since called exec(). Because the child of fork() is no longer allowed to allocate memory even for the single threaded case (you can thank Apple for that one), arranging to call vfork()/exec() instead is hardly more difficult now.
The real advantage to the vfork()/exec() model is you can set the child up with an arbitrary current directory, arbitrary environment variables, and arbitrary fs handles (not just stdin/stdout/stderr), an arbitrary signal mask, and some arbitrary shared memory (using the shared memory syscalls) without having a twenty-argument CreateProcess() API that gets a few more arguments every few years.
It turned out the "oops I leaked handles being opened by another thread" gaffe from the early days of threading was fixable in userspace w/o process-wide locking thanks to /proc. The same would not be in the giant CreateProcess() model without a new OS version, and convincing everybody to call the new API.
So there you have it. An accident of design ended up far better than the directly designed solution.

It's not that expensive (relatively to spawning a process directly), especially with copy-on-write forks like you find in Linux , and it's kind of elegant for:
when you really just want to fork off a clone of the current process (I find this to be very useful for testing)
for when you need to do something just before loading the new executable
(redirect filedescriptors, play with signal masks/dispositions, uids, etc.)
POSIX now has posix_spawn that effectively allows you to combine fork/and-exec (possibly more efficiently than fork+exec; if it is more efficient, it'll usually be implemented through some cheaper but less robust fork (clone/vfork) followed by exec), but the way it achieves #2 is through a ton of relatively messy options, which can never be as complete and powerful and clean as just allowing you to run arbitrary code just before the new process image is loaded.

A process created by exec() et al, will inherit its file handles from the parent process (including stdin, stdout, stderr). If the parent changes these after calling fork() but before calling exec() then it can control the child's standard streams.

Do child processes copy entire arrays?

I'm writing a basic UNIX program that involves processes sending messages to each other. My idea to synchronize the processes is to simply have an array of flags to indicate whether or not a process has reached a certain point in the code.
For example, I want all the processes to wait until they've all been created. I also want them to wait until they've all finished sending messages to each other before they begin reading their pipes.
I'm aware that a process performs a copy-on-write operation when it writes to a previously defined variable.
What I'm wondering is, if I make an array of flags, will the pointer to that array be copied, or will the entire array be copied (thus making my idea useless).
I'd also like any tips on inter-process communication and process synchronization.
EDIT: The processes are writing to each other process' pipe. Each process will send the following information:
typedef struct MessageCDT{
pid_t destination;
pid_t source;
int num;
} Message;
So, just the source of the message and some random number. Then each process will print out the message to stdout: Something along the lines of "process 20 received 5724244 from process 3".

Unix processes have independent address spaces. This means that the memory in one is totally separate from the memory in another. When you call fork(), you get a new copy of the process. Immediately on return from fork(), the only thing different between the two processes is fork()'s return value. All of the data in the two processes are the same, but they are copies. Updating memory in one cannot be known by the other, unless you take steps to share the memory.
There are many choices for interprocess communication (IPC) in Unix, including shared memory, semaphores, pipes (named and unnamed), sockets, message queues and signals. If you Google these things you will find lots to read.
In your particular case, trying to make several processes wait until they all reach a certain point, I might use a semaphore or shared memory, depending on whether there is some master process that started them all or not.
If there is a master process that launches the others, then the master could setup the semaphore with a count equal to the number of processes to synchronize and then launch them. Each child could then decrement the semaphore value and wait for the semaphore value to reach zero.
If there is no master process, then I might create a shared memory segment that contains a count of processes and a flag for each process. But when you have two or more processes using shared memory, then you also need some kind of locking mechanism (probably a semaphore again) to ensure that two processes do not try to update the shared memory simultaneously.
Keep in mind that reading a pipe that nobody is writing to will block the reader until data appears. I don't know what your processes do, but perhaps that is synchronization enough? One other thing to consider if you have multiple processes writing to a given pipe, their data may become interleaved if the writes are larger than PIPE_BUF. The value and location of this macro are system dependent.
-Kevin

The entire array of flags will seem to be copied. It will not actually be copied until one process or another writes to it of course. But that's an implementation detail and transparent to the individual processes. As far as each process is concerned, they each get a copy of the array.
There are ways to make this not happen. You can use mmap with the MAP_SHARED option for the memory used for your flags. Then each sub-process will share the same region of memory. There's also Posix shared memory (which I, BTW, think is an awful hack). To find out about Posix shared memory, look at the shm_overview(7) man page.
But using memory in this way isn't really a good idea. On multi-core systems it's not always the case that when one process (or thread) writes to an area of shared memory that all other processes will see the value written right away. Frequently the value will hang out for awhile in the L2 cache and not be immediately flushed.
If you want to communicate using shared memory, you will have to used mutexes or the C++11 atomic operations to ensure that writes are properly seen by the other processes.

If I have a process, and I clone it, is the PID the same?

Just a quick question, if I clone a process, the PID of the cloned process is the same, yes ? fork() creates a child process where the PID differs, but everything else is the same. Vfork() creates a child process with the same PID. Exec works to change a process currently in execution to something else.
Am I correct in all of these statements ?

Not quite. If you clone a process via fork/exec, or vfork/exec, you will get a new process id. fork() will give you the new process with a new process id, and exec() replaces that process with a new process, but maintaining the process id.
From here:
The vfork() function differs from
fork() only in that the child process
can share code and data with the
calling process (parent process). This
speeds cloning activity significantly
at a risk to the integrity of the
parent process if vfork() is misused.

Neither fork() nor vfork() keep the same PID although clone() can in one scenario (*a). They are all different ways to achieve roughly the same end, the creation of a distinct child.
clone() is like fork() but there are many things shared by the two processes and this is often used to enable threading.
vfork() is a variant of clone in which the parent is halted until the child process exits or executes another program. It's more efficient in those cases since it doesn't involve copying page tables and such. Basically, everything is shared between the two processes for as long as it takes the child to load another program.
Contrast that last option with the normal copy-on-write where memory itself is shared (until one of the processes writes to it) but the page tables that reference that memory are copied. In other words, vfork() is even more efficient than copy-on-write, at least for the fork-followed-by-immediate-exec use case.
But, in most cases, the child has a different process ID to the parent.
*a Things become tricky when you clone() with CLONE_THREAD. At that stage, the processes still have different identifiers but what constitutes the PID begins to blur. At the deepest level, the Linux scheduler doesn't care about processes, it schedules threads.
A thread has a thread ID (TID) and a thread group ID (TGID). The TGID is what you get from getpid().
When a thread is cloned without CLONE_THREAD, it's given a new TID and it also has its TGID set to that value (i.e., a brand new PID).
With CLONE_THREAD, it's given a new TID but the TGID (hence the reported process ID) remains the same as the parent so they really have the same PID. However, they can distinguish themselves by getting the TID from gettid().
There's quite a bit of trickery going on there with regard to parent process IDs and delivery of signals (both to the threads within a group and the SIGCHLD to the parent), all which can be examined from the clone() man page.

It deserves some explanation. And it's simple as rain.
Consider this. A program has to do some things at the same time. Say, your program is printing "hello world!", each second, until somebody enters "hello, Mike", then, each second, it prints that string, waiting for John to change that in the future.
How do you write this the standard way? In your program, that basically prints "hello," you must create another branch that is waiting for user input.
You create two processes, one outputting those strings, and another one, waiting the user input. And, the only way to create a new process in UNIX was calling the system call fork(), like this:
ret = fork();
if(ret > 0) /* parent, continue waiting */
else /* child */
This scheme posed numerous problems. The user enters "Mike" but you have no simple way to pass that string to the parent process so that it'd be able to print that, because +each+ process has its own view of memory that isn't shared with the child.
When the processes are created by fork(), each one receives a copy of the memory existing at that moment, and if that memory really changes later, the mapping that was identical for those memory segments will be chaged at once (it's called a copy-on-write mechanism).
Another thingies to share between the child and the parent are, for example, opened file descriptors, descriptors of the shared memory, input/outpue stuff, etc., that also wouldn't survive after fork().
So. The very fork() call had to be alleviated, to include shared memory/signals etc. But how? This was the idea behind clone(). That call takes a flag indicating what exatly would you share with the child. For example, the memory, the signal handlers, etc. And if you call this with flag=0, this will be identical to fork(), up to the args they take. And when POSIX pthreads are created, that flag will reflect the attributes you have indicated in pthread_attr.
From the kernel point of view, there's no difference between the processes created such way, and no special semantics to differentiate the "processess". The kernel does not even know, what that "thread" is, it creates a new process, but it simply combines it as belogning to that process group that had the parent who called it, taking care what that process may do. So, you have different procesess (that share the same pid) combined in a process group each assigned with a different "TID" (that starts from PID of the parent).
Care to explain that clone() does exactly that. You may pass this whaterver you need (as the matter of fact, the old vfork() call will do). Are you going to share memory? Hanlers? You may tune everything, just be sure you don't clash with the pthreads library written right away around this very call.
An important thing, the kernel vesion is quite outrageous, it expects just 2 out of 4 parameters to be passed, the user stack, and options.

Since PID is an unique identifier for a process, there's no way to have two distinct process with the same PID.

Threads (which have the same visible 'pid') are implemented with the clone() call. When the flag CLONE_THREAD is supplied then the new process (a 'thread') share the Thread Group Identifier (TGID) with its creator process. getpid actually returns the TGID.
See the clone manpage for more details.
In summary the real PID, as seen by the kernel is always different. The visible PID is the same for threads.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string