How does the kernel separate threads from processes

How does the kernel separate threads from processes - linux

Suppose I have a browser process like Firefox, that has pid = 123. Firefox has 5 opened tabs each running in a separate thread, so in total it has 5 threads.
So I want to know in depth, how the kernel will separate the process into the thread to execute in struct task_struct or in the thread_info.
Like struct task_struct is a task descriptor of the task list.
where does struct task_struct contain a reference or a link to these five threads.
Does the struct thread_struct of a process like Firefox contain reference to all the 5 thread
OR
each thread is treated like a process inside the Linux kernel.

Unlike Windows, Linux does not have an implementation of "threads" in the kernel. The kernel gives us what are sometimes called "lightweight processes", which are a generalization of the concepts of "processes" and "threads", and can be used to implement either.
It may be confusing when you read kernel code and see things like thread_struct on the one hand, and pid (process ID) on the other. In reality, both are one and the same. Don't be confused by the terminology.
Each lightweight process has a completely different thread_info and task_struct (with embedded thread_struct). You seem to think that the task_struct of one lightweight process should have pointers to the task_structs of other (userspace) "threads" in the same (userspace) "process". This is not the case. Inside the kernel, each "thread" is a separate process, and the scheduler deals with each one separately.
Linux has a system call called clone which is used to create new lightweight processes. When you call clone, you must provide various flags which indicate what will be shared between the new process and the existing process. They can share their address space, or they can each have a different address space. They can share their open files, or they can each have their own list of open files. They can share their signal handlers, or they can each have their own signal handlers. They can be in the same "thread group", or they can be in different thread groups. And so on...
Although "threads" and "processes" are the same thing in Linux, you can implement what we normally think of as "processes" by using clone to create processes which do not share their address space, open files, signal handlers, etc.
You can also implement what we normally think of as "threads" by using clone to create processes which DO share their address space, open files, signal handlers, etc.
If you look at the definition of task_struct, you will find that it has pointers to other structs such as mm_struct (address space), files_struct (open files), sighand_struct (signal handlers), and so on. When you clone a new "process", all of these structs will be copied. When you clone a new "thread", these structs will be shared between the new and old task_structs -- they will both point to the same mm_struct, the same files_struct, and so on. Either way, you are just providing different flags to clone to tell it what to copy, and what to share.
I just mentioned "thread groups" above, so you might wonder about that. In short, each "thread" in a "process" has its own PID, but they all share the same TGID (thread group ID). The TGIDs are all equal to the PID of the first program thread. Userspace "PIDs", like those shown in ps, or in /proc, are actually "TGIDs" in the kernel. Naturally, clone has a flag to determine whether a new lightweight process will have a new TGID (thus putting it in a new "thread group") or not.
UNIX processes also have "parents" and "children". There are pointers in a Linux task_struct which implement the parent-child relationships. And, as you might have guessed, clone has a flag to determine what the parent of a new lightweight process will be. It can either be the process which called clone, OR the parent of the process which called clone. Can you figure out which is used when creating a "process", and which is used when creating a "thread"?
Look at the manpage for clone; it will be very educational. Also try strace on a program which uses pthreads to see clone in use.
(A lot of this was written from memory; others should feel free to edit in corrections as necessary)

Related

Process Control Block , Process Descriptor in Linux and task_struct?

I am having trouble understanding the difference between a Process Control Block and Process Descriptor in Linux?
I have seen both of these structures referred to as a task_struct, and the terms seem to be used interchangeably - what is the difference between the two?
Many thanks for your help!

Neither of those terms ("Process Control Block" or "Process Descriptor") are considered "terms of art" in Linux kernel development. Of course, there is no official Linux kernel glossary so people are free to call things whatever makes sense to them.
In contrast, however, task_struct is a specific C structure that is used by the linux kernel to maintain state about a task. A task in Linux corresponds roughly to a thread.
Each user process has at least one thread so each process maps to one or more task_structs. More particularly, a process is one or more tasks that happen to share certain resources -- file descriptors, address space / memory map, signal handling, process and process group IDs, etc. Each thread in a process has its own individual version of certain other resources: registers/execution context, scheduling parameters, and so forth.
It's quite common for a process to have only a single thread. In that case, you could consider a process to be represented by a single task_struct.

Fork a fresh Linux process with all attributes reset. Attributes are fds, signal handlers, and everything else in the task_struct

TL;DR How do I fork a fresh process without inheriting all the attributes (file descriptors, memory maps, working directory, fancy new kernel features, ...) from the parent?
The two traditional ways on a Linux system to create new processes are fork and clone. The libc wrappers are very thin wrappers around the raw syscalls fork and clone. Having a look at all other syscalls (disregarding vfork), no other system call seems to spawn a new process.
For this question, lets define new process as the creation of a new task_struct in the kernel.
Question 1) Is it correct that fork, vfork, and clone are the only system calls which create a new process? (Considering kernel 4.x)
A process has attributes, namely everything which is stored in the task_struct. I selected the name "attributes" as it is used in the execve man page. Attributes include file descriptors, signal handlers, seccomp context, capabilities, memory mappings, the complete virtual memory setup ... It is a decade-old problem that Linux programs may leak file descriptors into their children. But since fork and clone copy the task_struct of the parent, more than just file descriptors are leaked: namely everything.
Let's define a fresh process as a new process where all process attributes are not inherited from the parent but sane default values are chosen. For example, pwd is the home of the user, no filedescriptors except 0,1,2 are inherited, there are no mapped memory areas, a fresh stack is used, ...
Question 2) Is it possible to get a fresh process on Linux with just one system call?
Question 3) Is it possible to get a fresh process on some BSD or POSIX system with just one system call?
The intention behind my question is that I don't want to leak anything to my child. But Linux adds new attributes to the task_struct from time to time. I don't want to clean up in userspace because that cleanup would depend on the kernel version. Also, I want to create a fresh process from a high-level language, for example Haskell, where the runtime (which is not under my control) has polluted the parent process with many attributes. This also depends on the version of the language runtime. In short, I don't know which attributes are used and which attributes need to be cleaned in userspace.
My idea of a fresh process sounds dangerous from a security point of view: Linux relies on the concept that seccomp filters and capability bounding sets are always passed to the children. That means, a process cannot increase its permissions by creating new process. A fresh process would subvert this security concept.
Question 4) What is the best way to get a fresh process on Linux (possibly with some cleanup in the userspace)?
Question 5) Are there different answers to Question 4 depending on whether I want to execve in the new fresh process?

Question 4) What is the best way to get a fresh process on Linux (possibly with some cleanup in the userspace)?
One of the way is to create a special process at the very beginning of the program (before opening files, changing signal handlers and so on). Then you may use this process as a factory, asking it to create new processes for you.
Because factory process is created at the very beginning, it will be "fresh process", and processes created by it will also be "fresh".
But this way you cannot overcome security aspects. From the other side, security is needed exactly for the purpose that it cannot be overcome.
By the way, Linux kernel itself uses special thread ("kthreadd") for create kernel threads.
Disadvantages of this approach is that new processes will have same start function. But you want stack of the new process to be "fresh", don't you?
Question 5) Are there different answers to Question 4 depending on whether I want to execve in the new fresh process?
execve() by itself creates nearly fresh process. Literally, it inherits from the parent only opened file descriptors. But I know no simple way for automatically close these descriptors in the child.

Is linux fork insecure

I was reading this article
It says that the fork create a copy of itself and fork man also says so
. The entire virtual address space of the parent is replicated in the child
Does this mean child process can read all my process memory state ?
Can child process dump the entire parent memory state and it can be analysed to extract parent variable and its value. ?
But the article also says that two process cannot ready each other data.
So i am confused ?

Yes, the child process can read a pristine copy of all of the parent process state (but when writing, only its own address space is affected) just after a fork(2). However, most of the time, the child would eventually use execve(2) to start a new program, and that would "clear" and replace the copy of the original parent's address space (by a fresh address space). Notice that execve and mmap(2) (see also shared memory in shm_overview(7)...) are the common ways to change the address space in virtual memory of some process (and how the kernel handles page faults).
The kernel uses (and sets up the MMU for) lazy copy on write machinery to make the child's address space a copy of the parent's one, so fork is quite efficient in practice.
Read also proc(5), then type the follow commands:
cat /proc/self/maps
cat /proc/$$/maps
sudo cat /proc/1/maps
and understand what is happening
Read also the wikipage on fork, and the Advanced Linux Programming book.
There is no insecurity, because if the child is changing some data (e.g. a variable, a heap or stack location, ...) it does not affect the parent process.
If the program doing the fork is keeping some password in some virtual memory location, the child process would be able to read that location as long as it is executing the same program. Once the child did a successful execve (which is the common situation, and what any shell is doing) the previous address space is gone and replaced by a new one, described in the ELF executable of that exec-ed program.
There is no "lie" or "insecurity" in that Unix model. But contrarily to several other operating systems, Unix & POSIX have two separate system calls for creating a new process (fork) and executing a new program (execve). Other systems might have some single spawn operation mixing the two abilities. posix_spawn is often implemented by a mixture of fork & execve (and so are system(3) & popen(3), also using waitpid(2) & /bin/sh....).
The advantage of that Unix approach (having separated fork & execve) is that after the fork and before the execve in the child you can do a lot of useful things (e.g. closing useless file descriptors, ...). Operating Systems not separating the two features may need to have a quite complex spawning primitive.
There are rare occasions where a fork is not followed by some execve. Some MPI implementations might do that, and you might also do that. But then you know that you are able to read all the parent's address space thru your own copy - so what you felt was an insecurity is becoming a useful feature. In the old days you had the obsolete vfork which blocked the parents. There is not need to use it today; actually, fork is often implemented thru clone(2) which you should not use directly in practice (see futex(7)...) but only thru POSIX pthreads. But thinking of fork as a magical cloner of your process might help.
When coding (even in C) don't forget to test against failure of fork and of execve. See perror(3)
PS. the fork syscall is as difficult to understand as the multiverse idea. Both are "forking" the time!

When you call fork(), the new process will get access to the copy of the parent process memory (i.e. variables, file descriptors etc).
This is in contrast with threads, where all threads share the same memory space, i.e. variable modified in one thread will get a new value in all other threads.
So if, after forking, parent process modifies memory, the child process will not see that change - because the memory has been copied, the child process' memory would not get altered.

Thread control block in linux

What is the structure used for saving thread state like PC, SP and registers during thread context switch in linux? The equivalent of TCB in freebsd. If possible please point to the source file here.
Note that PCB itself is not enough, as we have PC, SP etc. per thread not per process.

It's actually task_struct. In Linux, a task can be a thread, a process, or something in between. A thread is just the name you give to a task that shares most things (VMA's, file descriptors, etc...) with other tasks.
This is much in line with the idea that a thread is just a particular kind of process, and can be handled via the same functions, etc... Plan 9's rfork() and Linux's clone() allow to create a process with a customizable level of sharing, so you end up using the same machinery to create processes and threads.

Perhaps you want setcontext and friends (but your code won't be very portable, and tricky to get right)?
Or are you talking from inside the kernel? Then perhaps task_struct could be what you look for??

If I have a process, and I clone it, is the PID the same?

Just a quick question, if I clone a process, the PID of the cloned process is the same, yes ? fork() creates a child process where the PID differs, but everything else is the same. Vfork() creates a child process with the same PID. Exec works to change a process currently in execution to something else.
Am I correct in all of these statements ?

Not quite. If you clone a process via fork/exec, or vfork/exec, you will get a new process id. fork() will give you the new process with a new process id, and exec() replaces that process with a new process, but maintaining the process id.
From here:
The vfork() function differs from
fork() only in that the child process
can share code and data with the
calling process (parent process). This
speeds cloning activity significantly
at a risk to the integrity of the
parent process if vfork() is misused.

Neither fork() nor vfork() keep the same PID although clone() can in one scenario (*a). They are all different ways to achieve roughly the same end, the creation of a distinct child.
clone() is like fork() but there are many things shared by the two processes and this is often used to enable threading.
vfork() is a variant of clone in which the parent is halted until the child process exits or executes another program. It's more efficient in those cases since it doesn't involve copying page tables and such. Basically, everything is shared between the two processes for as long as it takes the child to load another program.
Contrast that last option with the normal copy-on-write where memory itself is shared (until one of the processes writes to it) but the page tables that reference that memory are copied. In other words, vfork() is even more efficient than copy-on-write, at least for the fork-followed-by-immediate-exec use case.
But, in most cases, the child has a different process ID to the parent.
*a Things become tricky when you clone() with CLONE_THREAD. At that stage, the processes still have different identifiers but what constitutes the PID begins to blur. At the deepest level, the Linux scheduler doesn't care about processes, it schedules threads.
A thread has a thread ID (TID) and a thread group ID (TGID). The TGID is what you get from getpid().
When a thread is cloned without CLONE_THREAD, it's given a new TID and it also has its TGID set to that value (i.e., a brand new PID).
With CLONE_THREAD, it's given a new TID but the TGID (hence the reported process ID) remains the same as the parent so they really have the same PID. However, they can distinguish themselves by getting the TID from gettid().
There's quite a bit of trickery going on there with regard to parent process IDs and delivery of signals (both to the threads within a group and the SIGCHLD to the parent), all which can be examined from the clone() man page.

It deserves some explanation. And it's simple as rain.
Consider this. A program has to do some things at the same time. Say, your program is printing "hello world!", each second, until somebody enters "hello, Mike", then, each second, it prints that string, waiting for John to change that in the future.
How do you write this the standard way? In your program, that basically prints "hello," you must create another branch that is waiting for user input.
You create two processes, one outputting those strings, and another one, waiting the user input. And, the only way to create a new process in UNIX was calling the system call fork(), like this:
ret = fork();
if(ret > 0) /* parent, continue waiting */
else /* child */
This scheme posed numerous problems. The user enters "Mike" but you have no simple way to pass that string to the parent process so that it'd be able to print that, because +each+ process has its own view of memory that isn't shared with the child.
When the processes are created by fork(), each one receives a copy of the memory existing at that moment, and if that memory really changes later, the mapping that was identical for those memory segments will be chaged at once (it's called a copy-on-write mechanism).
Another thingies to share between the child and the parent are, for example, opened file descriptors, descriptors of the shared memory, input/outpue stuff, etc., that also wouldn't survive after fork().
So. The very fork() call had to be alleviated, to include shared memory/signals etc. But how? This was the idea behind clone(). That call takes a flag indicating what exatly would you share with the child. For example, the memory, the signal handlers, etc. And if you call this with flag=0, this will be identical to fork(), up to the args they take. And when POSIX pthreads are created, that flag will reflect the attributes you have indicated in pthread_attr.
From the kernel point of view, there's no difference between the processes created such way, and no special semantics to differentiate the "processess". The kernel does not even know, what that "thread" is, it creates a new process, but it simply combines it as belogning to that process group that had the parent who called it, taking care what that process may do. So, you have different procesess (that share the same pid) combined in a process group each assigned with a different "TID" (that starts from PID of the parent).
Care to explain that clone() does exactly that. You may pass this whaterver you need (as the matter of fact, the old vfork() call will do). Are you going to share memory? Hanlers? You may tune everything, just be sure you don't clash with the pthreads library written right away around this very call.
An important thing, the kernel vesion is quite outrageous, it expects just 2 out of 4 parameters to be passed, the user stack, and options.

Since PID is an unique identifier for a process, there's no way to have two distinct process with the same PID.

Threads (which have the same visible 'pid') are implemented with the clone() call. When the flag CLONE_THREAD is supplied then the new process (a 'thread') share the Thread Group Identifier (TGID) with its creator process. getpid actually returns the TGID.
See the clone manpage for more details.
In summary the real PID, as seen by the kernel is always different. The visible PID is the same for threads.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string