Memory Model of Process and thread - multithreading

All:
I am pretty new to OS, most beginner tutorials about Process and Thread are talking about their relationship from conceptual perspective. But what I am wondering if some one could use some simple example to illustrate how does the process generated and arranged in memory, how the threads generated inside process and how to be arranged in memory and how a task has been executed?
And I also wonder: so the process itself actually is not a working unit, it is the threading which actually run the task, the process is just a namespace container which contains all threads and shared variables, is that right?
Thanks

Related

Can we say that a thread is a process?

Can we say that any individual thread (which is defined as an instance of a process) is a process itself?
Can we say that any individual thread (which is defined as an instance of a process) is a process itself?
No. Threads and processes are fundamentally different things. A "thread" is a context of execution which takes a sequence of computational steps. A "process" is a container that typically consists of things like a view of memory, file descriptors, and so on and can contain one or more threads.
These concepts sometimes get confused because many systems in the past had a one-to-one correspondence between threads and processes, that is, each process had exactly one thread. As a result, they called the thing that got scheduled for execution a "process".
Later, when support for processes with more than one thread was added, that meant creating more than one thing that gets scheduled for execution, and those were called "processes". This has mostly been cleaned up, but you will still seem systems, code, and papers from that era that do not quite align with modern usage because of these kinds of transitions.

(Python3) Can I spawn a single/many child process(es) inside a thread of a multithreaded program?

I have a use case, where a program spawns multiple threads, viz. one for network communication, one for modifying a couple of JSON files, another for querying and writing to a database. These are spawned in multiple threads because all of them are I/O bound tasks.
The code for Network Comm thread, JSON file handler and database handler will be written by me. The database handling can be significantly optimized if use multiple processes as I have multi-core machine.
I want to understand from Python perspective, how will spawning multiple processes inside a thread will work (if it works)?
After some more searching, I found a page which closely answers my own question.
As described in this post, it is not a good idea to start a process from a thread. The acquired mutexes in the thread will be duplicated with no way to be freed in child process. Also there are many data race conditions that can happen.
However, I like the Solomon's idea, posted in the comments (to my question) and I will try to go ahead with it Or, may be change my architecture.

Forking vs Threading

I have used threading before in my applications and know its concepts well, but recently in my operating system lecture I came across fork(). Which is something similar to threading.
I google searched difference between them and I came to know that:
Fork is nothing but a new process that looks exactly like the old or the parent process but still it is a different process with different process ID and having it’s own memory.
Threads are light-weight process which have less overhead
But, there are still some questions in my mind.
When should you prefer fork() over threading and vice-verse?
If I want to call an external application as a child, then should I use fork() or threads to do it?
While doing google search I found people saying it is bad thing to call a fork() inside a thread. why do people want to call a fork() inside a thread when they do similar things?
Is it True that fork() cannot take advantage of multiprocessor system because parent and child process don't run simultaneously?
The main difference between forking and threading approaches is one of operating system architecture. Back in the days when Unix was designed, forking was an easy, simple system that answered the mainframe and server type requirements best, as such it was popularized on the Unix systems. When Microsoft re-architected the NT kernel from scratch, it focused more on the threading model. As such there is today still a notable difference with Unix systems being efficient with forking, and Windows more efficient with threads. You can most notably see this in Apache which uses the prefork strategy on Unix, and thread pooling on Windows.
Specifically to your questions:
When should you prefer fork() over threading and vice-verse?
On a Unix system where you're doing a far more complex task than just instantiating a worker, or you want the implicit security sandboxing of separate processes.
If I want to call an external application as a child, then should I use fork() or threads to do it?
If the child will do an identical task to the parent, with identical code, use fork. For smaller subtasks use threads. For separate external processes use neither, just call them with the proper API calls.
While doing google search I found people saying it is bad thing to call a fork() inside a thread. why do people want to call a fork() inside a thread when they do similar things?
Not entirely sure but I think it's computationally rather expensive to duplicate a process and a lot of subthreads.
Is it True that fork() cannot take advantage of multiprocessor system because parent and child process don't run simultaneously?
This is false, fork creates a new process which then takes advantage of all features available to processes in the OS task scheduler.
A forked process is called a heavy-weight process, whereas a threaded process is called light-weight process.
The following are the difference between them:
A forked process is considered a child process whereas a threaded process is called a sibling.
Forked process shares no resource like code, data, stack etc with the parent process whereas a threaded process can share code but has its own stack.
Process switching requires the help of OS but thread switching it is not required
Creating multiple processes is a resource intensive task whereas creating multiple thread is less resource intensive task
Each process can run independently whereas one thread can read/write another threads data.
Thread and process lecture
fork() spawns a new copy of the process, as you've noted. What isn't mentioned above is the exec() call which often follows. This replaces the existing process with a new process (a new executable) and as such, fork()/exec() is the standard means of spawning a new process from an old one.
e.g. that's how your shell will invoke a process from the command line. You specify your process (ls, say) and the shell forks and then execs ls.
Note that this operates at a very different level from threading. Threading runs multiple lines of execution intra-process. Forking is a means of creating new processes.
As #2431234123412341234123 said, on Linux thanks to COW, processes are not much heavier than threads and boils down to their usage. COW - copy on write means that a memory page of the forked process gets copied only when forked process makes changes to it, otherwise OS keeps redirecting it to pages of the parent process.
From a programming use case, let us say in the heap memory you have a big data structure a 2d array[2000000][100] (200 mb), and the page size of the kernel is around 4 kb. When the process is forked, no new memory for this array will be allocated. If one particular row (100 bytes) is changed (in either parent process or child), only the corresponding page (4 kb or 8kb if it is overlapping in two pages) will be copied and updated for the forked thread.
Other memory portions of memory work in forked processes same as threads (code is same, registers and call stack are separate).
On Windows as #Niels Keurentjes said, thrads might be better from a performance view, but on Linux it is more of use case.

QProcess, QEventLoop - of any use for parallel-processing

I wonder whether I could use QEventLoop (QProcess?) to parallelize multiple calls to same function with Qt. What is precisely the difference with QtConcurrent or QThread? What is a process and an event loop more precisely? I read that QCoreApplication must exec() as early as possible in main() method, so that I wonder why it is different from main Thread.
could you point as some efficient reference to processes and thread with Qt? I came through the official doc and those things remain unclear.
Thanks and regards.
Process and thread are not Qt-specific concepts. You can search for "process vs. thread" anywhere for that distinction to be explained. For instance: What resources are shared between threads?
Though related concepts, spawning a new process is a more "heavyweight" form of parallelism than spawning a new thread within your existing process. Processes are protected from each other by default, while threads of execution within a process can read and write each other's memory directly. The protection you get from spawning processes comes at a greater run-time cost...and since independent processes can't read each other's memory, you have to share data between them using methods of inter-process communication.
Odds are that you want threads, because they're simpler to use in a case where one is writing all the code in a program. Given all the complexities in multithreaded programming, I'd suggest looking at a good book or reading some websites to start with. See: What are some good resources for learning threaded programming?
But if you want to dive in and just get a feel for how threading in Qt looks, you can spend time looking at the examples:
http://qt-project.org/doc/qt-4.8/examples-threadandconcurrent.html
QtConcurrent is an abstraction library that makes it easier to implement some kinds of parallel programming patterns. It's built on top of the QThread abstractions, and there's nothing it can do that you couldn't code yourself by writing to QThread directly. But it might make your code easier to write and less prone to errors.
As for an event loop...that is merely a generic term for how any given thread of execution in your program waits for work items to process, processes them, and can decide when it is no longer needed. If a thread's job were merely to start up, do some math, and exit...then it wouldn't need an event loop. But starting and stopping a thread takes time and churns resources. So typically threads live for longer periods of time, and have an event loop that knows how to wait for events it needs to respond to.
If you build on top of QtConcurrent, you won't have to worry about an event loop in your worker threads because they are managed automatically in a thread pool. The word count example is pretty simple to see:
http://qt-project.org/doc/qt-4.8/qtconcurrent-wordcount-main-cpp.html

Why not use a full list of runnable threads, as opposed to just threads that are runnable but not running?

I'm considering the concepts behind multiprocessing, and I'm trying to come up with some reason why a ready list is used that contains all runnable threads that aren't running, as opposed to a list of all runnable threads with the head of the data structure being the running thread(s)?
Thanks for your opinions.
EDIT: Let me clarify. As far as I know, thread packages use a ready list to identify those processes that are ready to run, while the running process is identified by a separate variable. Why don't they just include the running processes in the ready list data structure with the running thread at the head of the structure, making the thread package all inclusive. Would multiprocessing cause problems in this design scheme?
Because a thread can only run on one processor (core) at a time. The list (queue, really) of threads that are ready to run is used primarily by the scheduler when it's looking for what thread it should run; if a thread is already running on one CPU, it can't be run on another CPU at the same time, so the scheduler does not want to look at it (at that time -- sometime later when it's not running and eligible to run again, it will care about it again...)

Resources