Process-specific data in kernel - linux

Say I have some process calling file device operation like read. Before this read the process also called a syscall(defined by me), providing me with some information relevant to the read(and possibly other future reads done by this process). What is the best way of achieving this sort of information flow in the kernel? Is there any good way to store process-specific information other than making some pid-indexed list?
I'd like the syscall information stored in kernel to be inherited by children of that process too. Would it be possible to achieve that without (somehow) traversing the process child-parent tree(and that wouldn't give me the inheritance I want because after forking I don't want changes in parent to affect the child)?

Just like we have init_task variable which gives the starting address of the runqueue and which can be accessible anywhere in the user as well as kernel space, you can add a variable which will be set to the appropriate value by your system call and then accessed by your read(appropriate) methods.

Related

Is there a way to force a process to share address space from another process?

This is a purely theoretical question. As far as I'm concerned each process have different addressing space and each thread inside one process share the same memory space?
Is there a way, especially in some UNIX system to change that behavior. To be more clear, to make two processes share the same address space?
Or to make two threads from the same process to have different address space?
Yes. Google gvisor or rump for examples of how to do this. Short story is you start with a Mother process, which forks() to create new children. These children are managed by ptrace() which isolates them from the kernel. The mother process then manipulates the address space(s) of the children as it sees fit; making them identical is one option.
There is usually a bit of a bootstrapping trick involved, so when a child calls fork(), the Mother forks and execs a known binary (re: aspace layout), then proceeds to clone the original fork()ers aspace into the new one.
I think you should probably read this.
Shared memory — is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs.

Time waste of execv() and fork()

I am currently learning about fork() and execv() and I had a question regarding the efficiency of the combination.
I was shown the following standard code:
pid = fork();
if(pid < 0){
//handle fork error
}
else if (pid == 0){
execv("son_prog", argv_son);
//do father code
I know that fork() clones the entire process (copying the entire heap, etc) and that execv() replaces the current address space with that of the new program. With this in mind, doesn't it make it very inefficient to use this combination? We are copying the entire address space of a process and then immediately overwrite it.
So my question:
What is the advantage that is achieved by using this combo (instead of some other solution) that makes people still use this, even though we have waste?
What is the advantage that is achieved by using this combo (instead of some other solution) that makes people still use this even though we have waste?
You have to create a new process somehow. There are very few ways for a userspace program to accomplish that. POSIX used to have vfork() alognside fork(), and some systems may have their own mechanisms, such as Linux-specific clone(), but since 2008, POSIX specifies only fork() and the posix_spawn() family. The fork + exec route is more traditional, is well understood, and has few drawbacks (see below). The posix_spawn family is designed as a special purpose substitute for use in contexts that present difficulties for fork(); you can find details in the "Rationale" section of its specification.
This excerpt from the Linux man page for vfork() may be illuminating:
Under Linux, fork(2) is implemented using copy-on-write pages, so the only penalty incurred by fork(2) is the time and memory required to duplicate the parent’s page tables, and to create a unique task structure for the child. However, in the bad old days a fork(2) would require making a complete copy of the caller’s data space, often needlessly, since usually immediately afterwards an exec(3) is done. Thus, for greater efficiency, BSD introduced the vfork() system call, which did not fully copy the address space of the parent process, but borrowed the parent’s memory and thread of control until a call to execve(2) or an exit occurred. The parent process was suspended while the child was using its resources. The use of vfork() was tricky: for example, not modifying data in the parent process depended on knowing which variables are held in a register.
(Emphasis added)
Thus, your concern about waste is not well-founded for modern systems (not limited to Linux), but it was indeed an issue historically, and there were indeed mechanisms designed to avoid it. These days, most of those mechanisms are obsolete.
Another answer states:
However, in the bad old days a fork(2) would require making a complete copy of the caller’s data space, often needlessly, since usually immediately afterwards an exec(3) is done.
Obviously, one person's bad old days are a lot younger than others remember.
The original UNIX systems did not have the memory for running multiple processes and they did not have an MMU for keeping several processes in physical memory ready-to-run at the same logical address space: they swapped out processes to disk that it wasn't currently running.
The fork system call was almost entirely the same as swapping out the current process to disk, except for the return value and for not replacing the remaining in-memory copy by swapping in another process. Since you had to swap out the parent process anyway in order to run the child, fork+exec was not incurring any overhead.
It's true that there was a period of time when fork+exec was awkward: when there were MMUs that provided a mapping between logical and physical address space but page faults did not retain enough information that copy-on-write and a number of other virtual-memory/demand-paging schemes were feasible.
This situation was painful enough, not just for UNIX, that page fault handling of the hardware was adapted to become "replayable" pretty fast.
Not any longer. There's something called COW (Copy On Write), only when one of the two processes (Parent/Child) tries to write to a shared data, it is copied.
In the past:
The fork() system call copied the address space of the calling process (the parent) to create a new process (the child).
The copying of the parent's address space into the child was the most expensive part of the fork() operation.
Now:
A call to fork() is frequently followed almost immediately by a call to exec() in the child process, which replaces the child's memory with a new program. This is what the the shell typically does, for example. In this case, the time spent copying the parent's address space is largely wasted, because the child process will use very little of its memory before calling exec().
For this reason, later versions of Unix took advantage of virtual memory hardware to allow the parent and child to share the memory mapped into their respective address spaces until one of the processes actually modifies it. This technique is known as copy-on-write. To do this, on fork() the kernel would copy the address space mappings from the parent to the child instead of the contents of the mapped pages, and at the same time mark the now-shared pages read-only. When one of the two processes tries to write to one of these shared pages, the process takes a page fault. At this point, the Unix kernel realizes that the page was really a "virtual" or "copy-on-write" copy, and so it makes a new, private, writable copy of the page for the faulting process. In this way, the contents of individual pages aren't actually copied until they are actually written to. This optimization makes a fork() followed by an exec() in the child much cheaper: the child will probably only need to copy one page (the current page of its stack) before it calls exec().
It turns out all those COW page faults are not at all cheap when the process has a few gigabytes of writable RAM. They're all gonna fault once even if the child has long since called exec(). Because the child of fork() is no longer allowed to allocate memory even for the single threaded case (you can thank Apple for that one), arranging to call vfork()/exec() instead is hardly more difficult now.
The real advantage to the vfork()/exec() model is you can set the child up with an arbitrary current directory, arbitrary environment variables, and arbitrary fs handles (not just stdin/stdout/stderr), an arbitrary signal mask, and some arbitrary shared memory (using the shared memory syscalls) without having a twenty-argument CreateProcess() API that gets a few more arguments every few years.
It turned out the "oops I leaked handles being opened by another thread" gaffe from the early days of threading was fixable in userspace w/o process-wide locking thanks to /proc. The same would not be in the giant CreateProcess() model without a new OS version, and convincing everybody to call the new API.
So there you have it. An accident of design ended up far better than the directly designed solution.
It's not that expensive (relatively to spawning a process directly), especially with copy-on-write forks like you find in Linux , and it's kind of elegant for:
when you really just want to fork off a clone of the current process (I find this to be very useful for testing)
for when you need to do something just before loading the new executable
(redirect filedescriptors, play with signal masks/dispositions, uids, etc.)
POSIX now has posix_spawn that effectively allows you to combine fork/and-exec (possibly more efficiently than fork+exec; if it is more efficient, it'll usually be implemented through some cheaper but less robust fork (clone/vfork) followed by exec), but the way it achieves #2 is through a ton of relatively messy options, which can never be as complete and powerful and clean as just allowing you to run arbitrary code just before the new process image is loaded.
A process created by exec() et al, will inherit its file handles from the parent process (including stdin, stdout, stderr). If the parent changes these after calling fork() but before calling exec() then it can control the child's standard streams.

Is linux fork insecure

I was reading this article
It says that the fork create a copy of itself and fork man also says so
. The entire virtual address space of the parent is replicated in the child
Does this mean child process can read all my process memory state ?
Can child process dump the entire parent memory state and it can be analysed to extract parent variable and its value. ?
But the article also says that two process cannot ready each other data.
So i am confused ?
Yes, the child process can read a pristine copy of all of the parent process state (but when writing, only its own address space is affected) just after a fork(2). However, most of the time, the child would eventually use execve(2) to start a new program, and that would "clear" and replace the copy of the original parent's address space (by a fresh address space). Notice that execve and mmap(2) (see also shared memory in shm_overview(7)...) are the common ways to change the address space in virtual memory of some process (and how the kernel handles page faults).
The kernel uses (and sets up the MMU for) lazy copy on write machinery to make the child's address space a copy of the parent's one, so fork is quite efficient in practice.
Read also proc(5), then type the follow commands:
cat /proc/self/maps
cat /proc/$$/maps
sudo cat /proc/1/maps
and understand what is happening
Read also the wikipage on fork, and the Advanced Linux Programming book.
There is no insecurity, because if the child is changing some data (e.g. a variable, a heap or stack location, ...) it does not affect the parent process.
If the program doing the fork is keeping some password in some virtual memory location, the child process would be able to read that location as long as it is executing the same program. Once the child did a successful execve (which is the common situation, and what any shell is doing) the previous address space is gone and replaced by a new one, described in the ELF executable of that exec-ed program.
There is no "lie" or "insecurity" in that Unix model. But contrarily to several other operating systems, Unix & POSIX have two separate system calls for creating a new process (fork) and executing a new program (execve). Other systems might have some single spawn operation mixing the two abilities. posix_spawn is often implemented by a mixture of fork & execve (and so are system(3) & popen(3), also using waitpid(2) & /bin/sh....).
The advantage of that Unix approach (having separated fork & execve) is that after the fork and before the execve in the child you can do a lot of useful things (e.g. closing useless file descriptors, ...). Operating Systems not separating the two features may need to have a quite complex spawning primitive.
There are rare occasions where a fork is not followed by some execve. Some MPI implementations might do that, and you might also do that. But then you know that you are able to read all the parent's address space thru your own copy - so what you felt was an insecurity is becoming a useful feature. In the old days you had the obsolete vfork which blocked the parents. There is not need to use it today; actually, fork is often implemented thru clone(2) which you should not use directly in practice (see futex(7)...) but only thru POSIX pthreads. But thinking of fork as a magical cloner of your process might help.
When coding (even in C) don't forget to test against failure of fork and of execve. See perror(3)
PS. the fork syscall is as difficult to understand as the multiverse idea. Both are "forking" the time!
When you call fork(), the new process will get access to the copy of the parent process memory (i.e. variables, file descriptors etc).
This is in contrast with threads, where all threads share the same memory space, i.e. variable modified in one thread will get a new value in all other threads.
So if, after forking, parent process modifies memory, the child process will not see that change - because the memory has been copied, the child process' memory would not get altered.

do_mmap_pgoff for other processes

In a linux kernel syscall, I want to map a region of memory in a similar manner as calling mmap from user mode. If I wanted to map the region for the current process, I could simply use do_mmap_pgoff. Instead, however, I want to map the region in a different process while running in kernel mode. do_mmap_pgoff assumes/knows it is mapping for the current process and does not allow for anything else.
What I am planning on doing is replicating do_mmap_pgoff to take extra arguments specifying the task_struct and mm_struct of whatever process I want to map. However, this is very undesirable as I must manually traverse through many functions in the kernel source and essentially make duplicates of those functions so that they no longer assume they are doing work on behalf of current.
Is there a better way to map memory in a process other than current while operating in kernel mode?
It's no surprise that those functions in kernel source assume that they change the mapping of the current process, and that it hasn't changed in the 20 years Linux exists. There's a reason why processes don't change memory mappings of other processes.
It's very "un-UNIXy".
If you elaborate on what you are trying to accomplish then perhaps people can suggest a more UNIX-y way for it.
Anyway, to focus on the question at hand, if you wouldn't like to perform hefty modifications to mm/* code, then I suggest you implement a workaround:
Find a context in which you can make your kernel code run in the context of the target process. For example, in a modular way - a /sys or /proc file. Or, in a non-modular way: modify a system call that is being called frequently, or another code path - for example the signal handling code.
Implement an "RPC", the source process can queue a request on the change of mapping in a Then, it can sleep until the target process enters that context and picks up on the request, waking up the source process when it is done modifying its own mapping. This is effectively an emulation of a "remote" call to do_mmap_pgoff(), and it can be implemented using mechanisms exposed in linux/wait.h.

Access data from FUSE filesystem

Is there any way that I can access the data created by my FUSE filesystem process?
e.g.
in prefix_write() I store some data in memory and would like to access those data from another process.
Shared memory should work. But I'm looking for a more elegant solution, such as a custom field in fuse_operations that I access as a function from other processes. But as far as I know, the fields in fuse_operations need to be from POSIX, so it's probably impossible to do so. Please correct me if I'm wrong.
thanks
The other process that you are speaking of, is it a process forked by another process. If yes then it should be pretty easy to send data. Before forking create a pipe and then fork, so the fd's returned by the pipe are inherited by the child process. You can then use these fd's for bi-directional data transfer.
If your use case is not this, then can you illustrate why do you want a foreign process to access another processes data?

Resources