What happens to allocated memory of other threads when forking - linux

I have a huge application that needs to fork itself at some point. The application is multithreaded and has about 200MB of allocated memory. What I want to do now to ensure that the data allocated by the process wont get duplicated is to start a new thread and fork inside of this thread. From what I have read, only the thread that calls fork will be duplicated, but what will happen to the allocated memory? Will that still be there? The purpose of this is to restart the application with other startup parameters, when its forked, it will call main with my new parameters, thus getting hopefully a new process of the same program. Now before you ask: I cannot assure that the binary of that process will still be in the same place as when I started the process, otherwise I could just fork and exec whats in /proc/self/exe.

Threads are execution units inside the big bag of resources that a process is. A process is the whole thing that you can access from any thread in the process: all the threads, all the file descriptors, all the other resources. So memory is absolutely not tied to a thread, and forking from a thread has no useful effect. Everything still needs to be copied over since the point of forking is creating a new process.
That said, Linux has some tricks to make it faster. Copying 2 gigabytes worth of RAM is neither fast or efficient. So when you fork, Linux actually gives the new process the same memory (at first), but it uses the virtual memory system to mark it as copy-on-write: as soon as one process needs to write to that memory, the kernel intercepts it and allocates distinct memory so that the other process isn't affected.

Related

Is a coroutine a kind of thread that is managed by the user-program itself (rather than managed by the kernel)?

In my opinion,
Kernel is an alias for a running program whose program text is in the kernel area and can access all memory spaces;
Process is an alias for a running program whose program has an independent memory space in the user memory area. Which process can get the use of the CPU is completely managed by the kernel;
Thread is an alias for a running program whose program-text is in the memory space of a process and completely shares the memory space with another thread of the same process. Which thread can get the use of the CPU is completely managed by the kernel;
Coroutine is an alias for a running program whose program-text is in the memory space of a process.And it is a user thread that the process decides itself (not the kernel) how to use, and the kernel is only responsible for allocating CPU resources to the process.
Since the process itself has no right to schedule like the kernel, the coroutine can only be concurrent but not parallel.
Am I correct in saying Above?
process is an alias for a running program...
The modern way to think of a process is to think of it as a container for a collection of threads and the resources that those threads need to execute.
Every process (except for "zombie" processes that can exist in some systems) must have at least one thread. It also has a virtual address space, open file handles, and maybe sockets and other resources that are shared by the threads.
Thread is an alias for a running program...
The problem with saying that is, "running program" sounds too much like "process," and a thread is most definitely not a process. (E.g., a thread can only exist in a process.)
A computer scientist might tell you that a thread is one particular execution of the application's code. I like to think of a thread as an independent agent who executes the code.
coroutine...is a user thread...
I'm going to mostly leave that one alone. "Coroutine" seems to mean something different from the highly formalized, and not particularly useful coroutines that I learned about more than forty years ago. What people call "coroutines" today seem to have somewhat in common with what I call "green threads," but there are details of how and when and why they are used that I don't yet understand.
Green threads (a.k.a., "user mode threads") simply are threads that the kernel doesn't know about. They are pretty much just like the threads that the kernel does know about except, the kernel scheduler never preempts them because, Duh! it doesn't know about them. Context switches between green threads can only happen at specific points where the application allows it (e.g., by calling a yield() function or, by calling some library function that is a documented yield point.)
kernel is an alias for a running program...
The kernel also is most definitely not a process.
I don't know every detail about every operating system, but the bulk of kernel code does not run independently of the applications that the kernel serves. It only runs when an application thread enters kernel mode by making a system call. The thread that runs the kernel code still belongs to the application process, but the code that determines the thread's behavior at that point is written or chosen by the kernel developers, not by the application developers.

What are the disadvantages of threads over process?

-Interview Question
I was asked the disadvantages of thread. And what are the scenario where we shouldn't use thread instead use process?
I couldn't think much except invalid memory access in some case.
Threads, spawned by the same process, all share the same memory. Processes all run in their own memory context.
In Linux (I don't know what the behavior under Windows is like) a newly spawned child process will usually received a copy of certain part the parent process' memory context an therefore is more expensive memory-wise at runtime and CPU-time/MMU wise at creation. Also context switching - (off)loading the process from or to the CPU (this happens, when a process or thread has nothing to do and is pushed to a queue in favor of processes or threads with actual work) - might be more expensive with a process.
On the other hand processes might be much more secure since their memory is isolated from the memory of their sibling processes.

Node.js Clustering- Forking, how much memory is actually used?

http://stackabuse.com/setting-up-a-node-js-cluster/
says
" To be clear, forking in Node is very different than a POISIX fork in that it doesn't actually clone the current process, but it does start up a new V8 instance.
Although this is one of the easiest ways to multi-thread, it should be used with caution. Just because you're able to spawn 1,000 workers doesn't mean you should. Each worker takes up system resources, so only spawn those that are really needed. The Node docs state that since each child process is a new V8 instance, you need to expect a 30ms startup time for each and at least 10mb of memory per instance."
But https://nodejs.org/api/cluster.html says
"There is no routing logic in Node.js, or in your program, and no shared state between the workers. Therefore, it is important to design your program such that it does not rely too heavily on in-memory data objects for things like sessions and login."
If the workers (forked processes) aren't actually clones of the master process, then how is it that there is also no shared state?
I was under the impression that if the master process has a one gigabyte JSON string, then all the child processes would also have clones of that one gigabyte JSON string. So with two children there would be 3gb of memory used. What actually happens?
On Linux et al. fork() uses copy-on-write semantics, i.e. all of the memory pages of the forked process are shared (not copied) and only those pages that the process wants to modify are copied before the modifications are done. So it's possible to use very little memory even if you have a lot of forked processes, if your modified data is close together, i.e. it uses a small number of actual memory pages.
See:
http://man7.org/linux/man-pages/man2/fork.2.html
https://en.wikipedia.org/wiki/Fork_(system_call)
https://en.wikipedia.org/wiki/Copy-on-write
http://obvious.services.net/2011/01/history-of-copy-on-write-memory.html

What happens if a process keeps creating threads?

What happens if a process keeps creating threads especially when the number of threads exceeds the limit of the OS? What will Windows and Linux do?
If the threads aren't doing any work (i.e. you don't start them), then on Windows you're subject to resource limitations as pointed out in the blog post that Hans linked. A Linux system, too, will have some limit on the number of threads it can create; after all, your computer doesn't have infinite virtual memory, so at some point the call to create a thread is going to fail.
If the threads are actually doing work, what usually happens is that the system starts thrashing. Each thread (including the program's main thread) gets a small timeslice (typically measured in tens of milliseconds), and then it gets swapped out for the next available thread. With so many threads, their working sets are large enough to occupy all available RAM, so every thread context switch requires that the currently running thread is written to virtual memory (disk), and the next available thread is read from disk. So the system spends more time doing thread context switches than it does actually running the threads.
The threads will continue to execute, but very very slowly, and eventually you will run out of virtual memory. However, it's likely that it would take an exceedingly long time to create that many threads. You would probably give up and shut the machine off.
Most often, a machine that's suffering from this type of thrashing acts exactly like a machine that's stuck in an infinite loop on all cores. Even pressing Control+Break (or similar) won't take effect immediately because the thread that's handling that signal has to be in memory and running in order to process it. And after the thread does respond to such a signal, it takes an exceedingly long time for it to terminate all of the threads and clean up virtual memory.

In Linux, when a process is about to be swapped or terminated, what state should its threads be in?

By swapped and terminated, I mean, if the process is about to be swapped to a swap space or terminated(by OOM killer) to free up memory.
What algorithm does the linux kernel follow?
For instance, Process A needs extra memory and Process B has been chosen to be swapped or killed(if swap space is already occupied), but process B still has a blocking thread.
a.) Does process B gets swapped or killed regardless of the blocking thread?
b.) If not, how is this kind of case handled?
If my example is an unlikely case, any insights would be appreciated.
Yeah - you need to read up on paged virtual memory, as suggested by #CL. Processes are not swapped out in their entirety and swapping!=termination.
If the OS needs to terminate a process, either because of a specific API request or because of its OOM algorithm, the OS stops all its threads first. Blocked threads are easy to 'stop' because they are not running anyway - it's only necessary to change their state to ensure that they are never run again. Thread/s that are actually running on cores have to be stopped by means of an inter-core comms driver that can hardware-interrupt the cores running the threads. Once all threads are not running, the resources, including all user-space memory, allocated to the process can be freed and OS thread/process management structs released. The process then no longer exists.

Resources