Prevent file descriptors inheritance during Linux fork

Prevent file descriptors inheritance during Linux fork - linux

How do you prevent a file descriptor from being copy-inherited across fork() system calls (without closing it, of course)?
I am looking for a way to mark a single file descriptor as NOT to be (copy-)inherited by children at fork(), something like a FD_CLOEXEC-like hack but for forks (so a FD_DONTINHERIT feature if you like). Anybody did this? Or looked into this and has a hint for me to start with?
Thank you
UPDATE:
I could use libc's __register_atfork
__register_atfork(NULL, NULL, fdcleaner, NULL)
to close the fds in child just before fork() returns. However, the FDs are still being copied so this sounds like a silly hack to me. Question is how to skip the dup()-ing in child of unneeded FDs.
I'm thinking of some scenarios when a fcntl(fd, F_SETFL, F_DONTINHERIT) would be needed:
fork() will copy an event FD (e.g. epoll()); sometimes this isn't wanted, for example FreeBSD is marking the kqueue() event FD as being of a KQUEUE_TYPE and these types of FDs won't be copied across forks (the kqueue FDs are skipped explicitly from being copied, if one wants to use it from a child it must fork with shared FD table)
fork() will copy 100k unneeded FDs to fork a child for doing some CPU-intensive tasks (suppose the need for a fork() is probabilistically very low and programmer won't want to maintain a pool of children for something that normally wouldn't happen)
Some descriptors we want to be copied (0, 1, 2), some (most of them?) not. I think full FD table duping is here for historic reasons but I am probably wrong.
How silly does this sound:
patch fcntl() to support the dontinherit flag on file descriptors (not sure if the flag should be kept per-FD or in a FD table fd_set, like the close-on-exec flags are being kept
modify dup_fd() in kernel to skip copying of dontinherit FDs, same as FreeBSD does for kq FDs
consider the program
#include <stdio.h>
#include <unistd.h>
#include <err.h>
#include <stdlib.h>
#include <fcntl.h>
#include <time.h>
static int fds[NUMFDS];
clock_t t1;
static void cleanup(int i)
{
while(i-- >= 0) close(fds[i]);
}
void clk_start(void)
{
t1 = clock();
}
void clk_end(void)
{
double tix = (double)clock() - t1;
double sex = tix/CLOCKS_PER_SEC;
printf("fork_cost(%d fds)=%fticks(%f seconds)\n",
NUMFDS,tix,sex);
}
int main(int argc, char **argv)
{
pid_t pid;
int i;
__register_atfork(clk_start,clk_end,NULL,NULL);
for (i = 0; i < NUMFDS; i++) {
fds[i] = open("/dev/null",O_RDONLY);
if (fds[i] == -1) {
cleanup(i);
errx(EXIT_FAILURE,"open_fds:");
}
}
t1 = clock();
pid = fork();
if (pid < 0) {
errx(EXIT_FAILURE,"fork:");
}
if (pid == 0) {
cleanup(NUMFDS);
exit(0);
} else {
wait(&i);
cleanup(NUMFDS);
}
exit(0);
return 0;
}
of course, can't consider this a real bench but anyhow:
root#pinkpony:/home/cia/dev/kqueue# time ./forkit
fork_cost(100 fds)=0.000000ticks(0.000000 seconds)
real 0m0.004s
user 0m0.000s
sys 0m0.000s
root#pinkpony:/home/cia/dev/kqueue# gcc -DNUMFDS=100000 -o forkit forkit.c
root#pinkpony:/home/cia/dev/kqueue# time ./forkit
fork_cost(100000 fds)=10000.000000ticks(0.010000 seconds)
real 0m0.287s
user 0m0.010s
sys 0m0.240s
root#pinkpony:/home/cia/dev/kqueue# gcc -DNUMFDS=100 -o forkit forkit.c
root#pinkpony:/home/cia/dev/kqueue# time ./forkit
fork_cost(100 fds)=0.000000ticks(0.000000 seconds)
real 0m0.004s
user 0m0.000s
sys 0m0.000s
forkit ran on a Dell Inspiron 1520 Intel(R) Core(TM)2 Duo CPU T7500 # 2.20GHz with 4GB RAM; average_load=0.00

If you fork with the purpose of calling an exec function, you can use fcntl with FD_CLOEXEC to have the file descriptor closed once you exec:
int fd = open(...);
fcntl(fd, F_SETFD, FD_CLOEXEC);
Such a file descriptor will survive a fork but not functions of the exec family.

No. Close them yourself, since you know which ones need to be closed.

There's no standard way of doing this to my knowledge.
If you're looking to implement it properly, probably the best way to do it would be to add a system call to mark the file descriptor as close-on-fork, and to intercept the sys_fork system call (syscall number 2) to act on those flags after calling the original sys_fork.
If you don't want to add a new system call, you might be able to get away with intercepting sys_ioctl (syscall number 54) and just adding a new command to it for marking a file description close-on-fork.
Of course, if you can control what your application is doing, then it might be better to maintain user-level tables of all file descriptors you want closed on fork and call your own myfork instead. This would fork, then go through the user-level table closing those file descriptors so marked.
You wouldn't have to fiddle around in the Linux kernel then, a solution that's probably only necessary if you don't have control over the fork process (say, if a third party library is doing the fork() calls).

Related

Linux Kernel: invoke call back function in user space from kernel space

I am writing Linux user space application. where I want to invoke registered callback function in user space area from the kernel space.
i.e. interrupt arriving on GPIO pin(switch press event) and registered function getting called in user space.
is there any method is available to do this.
Thanks

I found below code after lot of digging and perfectly works for me.
Handling interrupts from GPIO
In many cases, a GPIO input can be configured to generate an interrupt when it
changes state, which allows you to wait for the interrupt rather than polling in
an inefficient software loop. If the GPIO bit can generate interrupts, the file edge
exists. Initially, it has the value none , meaning that it does not generate interrupts.
To enable interrupts, you can set it to one of these values:
• rising: Interrupt on rising edge
• falling: Interrupt on falling edge
• both: Interrupt on both rising and falling edges
• none: No interrupts (default)
You can wait for an interrupt using the poll() function with POLLPRI as the event. If
you want to wait for a rising edge on GPIO 48, you first enable interrupts:
#echo 48 > /sys/class/gpio/export
#echo rising > /sys/class/gpio/gpio48/edge
Then, you use poll() to wait for the change, as shown in this code example:
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <poll.h>>
int main(void) {
int f;
struct pollfd poll_fds [1];
int ret;
char value[4];
int n;
f = open("/sys/class/gpio/gpio48", O_RDONLY);
if (f == -1) {
perror("Can't open gpio48");
return 1;
}
poll_fds[0].fd = f;
poll_fds[0].events = POLLPRI | POLLERR;
while (1) {
printf("Waiting\n");
ret = poll(poll_fds, 1, -1);
if (ret > 0) {
n = read(f, &value, sizeof(value));
printf("Button pressed: read %d bytes, value=%c\n", n, value[0]);
}
}
return 0;
}

Have to implement a handler in a kernel module that triggers e.g. a char device. From user space it could be accessed by polling (e.g. ioctl() calls). It seems that it is the only way at the moment.

How to make mprotect() to make forward progress after handling pagefaulte exception? [duplicate]

I want to write a signal handler to catch SIGSEGV.
I protect a block of memory for read or write using
char *buffer;
char *p;
char a;
int pagesize = 4096;
mprotect(buffer,pagesize,PROT_NONE)
This protects pagesize bytes of memory starting at buffer against any reads or writes.
Second, I try to read the memory:
p = buffer;
a = *p
This will generate a SIGSEGV, and my handler will be called.
So far so good. My problem is that, once the handler is called, I want to change the access write of the memory by doing
mprotect(buffer,pagesize,PROT_READ);
and continue normal functioning of my code. I do not want to exit the function.
On future writes to the same memory, I want to catch the signal again and modify the write rights and then record that event.
Here is the code:
#include <signal.h>
#include <stdio.h>
#include <malloc.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/mman.h>
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
char *buffer;
int flag=0;
static void handler(int sig, siginfo_t *si, void *unused)
{
printf("Got SIGSEGV at address: 0x%lx\n",(long) si->si_addr);
printf("Implements the handler only\n");
flag=1;
//exit(EXIT_FAILURE);
}
int main(int argc, char *argv[])
{
char *p; char a;
int pagesize;
struct sigaction sa;
sa.sa_flags = SA_SIGINFO;
sigemptyset(&sa.sa_mask);
sa.sa_sigaction = handler;
if (sigaction(SIGSEGV, &sa, NULL) == -1)
handle_error("sigaction");
pagesize=4096;
/* Allocate a buffer aligned on a page boundary;
initial protection is PROT_READ | PROT_WRITE */
buffer = memalign(pagesize, 4 * pagesize);
if (buffer == NULL)
handle_error("memalign");
printf("Start of region: 0x%lx\n", (long) buffer);
printf("Start of region: 0x%lx\n", (long) buffer+pagesize);
printf("Start of region: 0x%lx\n", (long) buffer+2*pagesize);
printf("Start of region: 0x%lx\n", (long) buffer+3*pagesize);
//if (mprotect(buffer + pagesize * 0, pagesize,PROT_NONE) == -1)
if (mprotect(buffer + pagesize * 0, pagesize,PROT_NONE) == -1)
handle_error("mprotect");
//for (p = buffer ; ; )
if(flag==0)
{
p = buffer+pagesize/2;
printf("It comes here before reading memory\n");
a = *p; //trying to read the memory
printf("It comes here after reading memory\n");
}
else
{
if (mprotect(buffer + pagesize * 0, pagesize,PROT_READ) == -1)
handle_error("mprotect");
a = *p;
printf("Now i can read the memory\n");
}
/* for (p = buffer;p<=buffer+4*pagesize ;p++ )
{
//a = *(p);
*(p) = 'a';
printf("Writing at address %p\n",p);
}*/
printf("Loop completed\n"); /* Should never happen */
exit(EXIT_SUCCESS);
}
The problem is that only the signal handler runs and I can't return to the main function after catching the signal.

When your signal handler returns (assuming it doesn't call exit or longjmp or something that prevents it from actually returning), the code will continue at the point the signal occurred, reexecuting the same instruction. Since at this point, the memory protection has not been changed, it will just throw the signal again, and you'll be back in your signal handler in an infinite loop.
So to make it work, you have to call mprotect in the signal handler. Unfortunately, as Steven Schansker notes, mprotect is not async-safe, so you can't safely call it from the signal handler. So, as far as POSIX is concerned, you're screwed.
Fortunately on most implementations (all modern UNIX and Linux variants as far as I know), mprotect is a system call, so is safe to call from within a signal handler, so you can do most of what you want. The problem is that if you want to change the protections back after the read, you'll have to do that in the main program after the read.
Another possibility is to do something with the third argument to the signal handler, which points at an OS and arch specific structure that contains info about where the signal occurred. On Linux, this is a ucontext structure, which contains machine-specific info about the $PC address and other register contents where the signal occurred. If you modify this, you change where the signal handler will return to, so you can change the $PC to be just after the faulting instruction so it won't re-execute after the handler returns. This is very tricky to get right (and non-portable too).
edit
The ucontext structure is defined in <ucontext.h>. Within the ucontext the field uc_mcontext contains the machine context, and within that, the array gregs contains the general register context. So in your signal handler:
ucontext *u = (ucontext *)unused;
unsigned char *pc = (unsigned char *)u->uc_mcontext.gregs[REG_RIP];
will give you the pc where the exception occurred. You can read it to figure out what instruction it
was that faulted, and do something different.
As far as the portability of calling mprotect in the signal handler is concerned, any system that follows either the SVID spec or the BSD4 spec should be safe -- they allow calling any system call (anything in section 2 of the manual) in a signal handler.

You've fallen into the trap that all people do when they first try to handle signals. The trap? Thinking that you can actually do anything useful with signal handlers. From a signal handler, you are only allowed to call asynchronous and reentrant-safe library calls.
See this CERT advisory as to why and a list of the POSIX functions that are safe.
Note that printf(), which you are already calling, is not on that list.
Nor is mprotect. You're not allowed to call it from a signal handler. It might work, but I can promise you'll run into problems down the road. Be really careful with signal handlers, they're tricky to get right!
EDIT
Since I'm being a portability douchebag at the moment already, I'll point out that you also shouldn't write to shared (i.e. global) variables without taking the proper precautions.

You can recover from SIGSEGV on linux. Also you can recover from segmentation faults on Windows (you'll see a structured exception instead of a signal). But the POSIX standard doesn't guarantee recovery, so your code will be very non-portable.
Take a look at libsigsegv.

You should not return from the signal handler, as then behavior is undefined. Rather, jump out of it with longjmp.
This is only okay if the signal is generated in an async-signal-safe function. Otherwise, behavior is undefined if the program ever calls another async-signal-unsafe function. Hence, the signal handler should only be established immediately before it is necessary, and disestablished as soon as possible.
In fact, I know of very few uses of a SIGSEGV handler:
use an async-signal-safe backtrace library to log a backtrace, then die.
in a VM such as the JVM or CLR: check if the SIGSEGV occurred in JIT-compiled code. If not, die; if so, then throw a language-specific exception (not a C++ exception), which works because the JIT compiler knew that the trap could happen and generated appropriate frame unwind data.
clone() and exec() a debugger (do not use fork() – that calls callbacks registered by pthread_atfork()).
Finally, note that any action that triggers SIGSEGV is probably UB, as this is accessing invalid memory. However, this would not be the case if the signal was, say, SIGFPE.

There is a compilation problem using ucontext_t or struct ucontext (present in /usr/include/sys/ucontext.h)
http://www.mail-archive.com/arch-general#archlinux.org/msg13853.html

Calling "clone()" on linux but it seems to malfunction

A simple test program, I expect it will "clone" to fork a child process, and each process can execute till its end
#include<stdio.h>
#include<sched.h>
#include<unistd.h>
#include<sys/types.h>
#include<errno.h>
int f(void*arg)
{
pid_t pid=getpid();
printf("child pid=%d\n",pid);
}
char buf[1024];
int main()
{
printf("before clone\n");
int pid=clone(f,buf,CLONE_VM|CLONE_VFORK,NULL);
if(pid==-1){
printf("%d\n",errno);
return 1;
}
waitpid(pid,NULL,0);
printf("after clone\n");
printf("father pid=%d\n",getpid());
return 0;
}
Ru it:
$g++ testClone.cpp && ./a.out
before clone
It didn't print what I expected. Seems after "clone" the program is in unknown state and then quit. I tried gdb and it prints:
Breakpoint 1, main () at testClone.cpp:15
(gdb) n-
before clone
(gdb) n-
waiting for new child: No child processes.
(gdb) n-
Single stepping until exit from function clone#plt,-
which has no line number information.
If I remove the line of "waitpid", then gdb prints another kind of weird information.
(gdb) n-
before clone
(gdb) n-
Detaching after fork from child process 26709.
warning: Unexpected waitpid result 000000 when waiting for vfork-done
Cannot remove breakpoints because program is no longer writable.
It might be running in another process.
Further execution is probably impossible.
0x00007fb18a446bf1 in clone () from /lib64/libc.so.6
ptrace: No such process.
Where did I get wrong in my program?

You should never call clone in a user-level program -- there are way too many restrictions on what you are allowed to do in the cloned process.
In particular, calling any libc function (such as printf) is a complete no-no (because libc doesn't know that your clone exists, and have not performed any setup for it).
As K. A. Buhr points out, you also pass too small a stack, and the wrong end of it. Your stack is also not properly aligned.
In short, even though K. A. Buhr's modification appears to work, it doesn't really.
TL;DR: clone, just don't use it.

The second argument to clone is a pointer to the child's stack. As per the manual page for clone(2):
Stacks grow downward on all processors that run Linux (except the HP PA processors), so child_stack usually points to the topmost address of the memory space set up for the child stack.
Also, 1024 bytes is a paltry amount for a stack. The following modified version of your program appears to run correctly:
// #define _GNU_SOURCE // may be needed if compiled as C instead of C++
#include <stdio.h>
#include <sched.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <errno.h>
int f(void*arg)
{
pid_t pid=getpid();
printf("child pid=%d\n",pid);
return 0;
}
char buf[1024*1024]; // *** allocate more stack ***
int main()
{
printf("before clone\n");
int pid=clone(f,buf+sizeof(buf),CLONE_VM|CLONE_VFORK,NULL);
// *** in previous line: pointer is to *end* of stack ***
if(pid==-1){
printf("%d\n",errno);
return 1;
}
waitpid(pid,NULL,0);
printf("after clone\n");
printf("father pid=%d\n",getpid());
return 0;
}
Also, #Employed Russian is right -- you probably shouldn't use clone except if you're trying to have some fun. Either fork or vfork are more sensible interfaces to clone whenever they meet your needs.

ptrace one thread from another

Experimenting with the ptrace() system call, I am trying to trace another thread of the same process. According to the man page, both the tracer and the tracee are specific threads (not processes), so I don't see a reason why it should not work. So far, I have tried the following:
use PTRACE_TRACEME from the clone()d child: the call succeeds, but does not do what I want, probably because the parent of the to-be-traced thread is not the thread that called clone()
use PTRACE_ATTACH or PTRACE_SEIZE from the parent thread: this always fails with EPERM, even if the process runs as root and with prctl(PR_SET_DUMPABLE, 1)
In all cases, waitpid(-1, &status, __WALL) fails with ECHILD (same when passing the child pid explicitly).
What should I do to make it work?
If it is not possible at all, is it by desing or a bug in the kernel (I am using version 3.8.0). In the former case, could you point me to the right bit of the documentation?

As #mic_e pointed out, this is a known fact about the kernel - not quite a bug, but not quite correct either. See the kernel mailing list thread about it. To provide an excerpt from Linus Torvalds:
That "new" (last November) check isn't likely going away. It solved
so many problems (both security and stability), and considering that
(a) in a year, only two people have ever even noticed
(b) there's a work-around as per above that isn't horribly invasive
I have to say that in order to actually go back to the old behaviour,
we'd have to have somebody who cares deeply, go back and check every
single special case, deadlock, and race.
The solution is to actually start the process that is being traced in a subprocess - you'll need to make the ptracing process be the parent of the other.
Here's an outline of doing this based on another answer that I wrote:
// this number is arbitrary - find a better one.
#define STACK_SIZE (1024 * 1024)
int main_thread(void *ptr) {
// do work for main thread
}
int main(int argc, char *argv[]) {
void *vstack = malloc(STACK_SIZE);
pid_t v;
if (clone(main_thread, vstack + STACK_SIZE, CLONE_PARENT_SETTID | CLONE_FILES | CLONE_FS | CLONE_IO, NULL, &v) == -1) { // you'll want to check these flags
perror("failed to spawn child task");
return 3;
}
long ptv = ptrace(PTRACE_SEIZE, v, NULL, NULL);
if (ptv == -1) {
perror("failed monitor sieze");
return 1;
}
// do actual ptrace work
}

Linux synchronization with FIFO waiting queue

Are there locks in Linux where the waiting queue is FIFO? This seems like such an obvious thing, and yet I just discovered that pthread mutexes aren't FIFO, and semaphores apparently aren't FIFO either (I'm working on kernel 2.4 (homework))...
Does Linux have a lock with FIFO waiting queue, or is there an easy way to make one with existing mechanisms?

Here is a way to create a simple queueing "ticket lock", built on pthreads primitives. It should give you some ideas:
#include <pthread.h>
typedef struct ticket_lock {
pthread_cond_t cond;
pthread_mutex_t mutex;
unsigned long queue_head, queue_tail;
} ticket_lock_t;
#define TICKET_LOCK_INITIALIZER { PTHREAD_COND_INITIALIZER, PTHREAD_MUTEX_INITIALIZER }
void ticket_lock(ticket_lock_t *ticket)
{
unsigned long queue_me;
pthread_mutex_lock(&ticket->mutex);
queue_me = ticket->queue_tail++;
while (queue_me != ticket->queue_head)
{
pthread_cond_wait(&ticket->cond, &ticket->mutex);
}
pthread_mutex_unlock(&ticket->mutex);
}
void ticket_unlock(ticket_lock_t *ticket)
{
pthread_mutex_lock(&ticket->mutex);
ticket->queue_head++;
pthread_cond_broadcast(&ticket->cond);
pthread_mutex_unlock(&ticket->mutex);
}

If you are asking what I think you are asking the short answer is no. Threads/processes are controlled by the OS scheduler. One random thread is going to get the lock, the others aren't. Well, potentially more than one if you are using a counting semaphore but that's probably not what you are asking.
You might want to look at pthread_setschedparam but it's not going to get you where I suspect you want to go.
You could probably write something but I suspect it will end up being inefficient and defeat using threads in the first place since you will just end up randomly yielding each thread until the one you want gets control.
Chances are good you are just thinking about the problem in the wrong way. You might want to describe your goal and get better suggestions.

I had a similar requirement recently, except dealing with multiple processes. Here's what I found:
If you need 100% correct FIFO ordering, go with caf's pthread ticket lock.
If you're happy with 99% and favor simplicity, a semaphore or a mutex can do really well actually.
Ticket lock can be made to work across processes:
You need to use shared memory, process-shared mutex and condition variable, handle processes dying with the mutex locked (-> robust mutex) ... Which is a bit overkill here, all I need is the different instances don't get scheduled at the same time and the order to be mostly fair.
Using a semaphore:
static sem_t *sem = NULL;
void fifo_init()
{
sem = sem_open("/server_fifo", O_CREAT, 0600, 1);
if (sem == SEM_FAILED) fail("sem_open");
}
void fifo_lock()
{
int r;
struct timespec ts;
if (clock_gettime(CLOCK_REALTIME, &ts) == -1) fail("clock_gettime");
ts.tv_sec += 5; /* 5s timeout */
while ((r = sem_timedwait(sem, &ts)) == -1 && errno == EINTR)
continue; /* Restart if interrupted */
if (r == 0) return;
if (errno == ETIMEDOUT) fprintf(stderr, "timeout ...\n");
else fail("sem_timedwait");
}
void fifo_unlock()
{
/* If we somehow end up with more than one token, don't increment the semaphore... */
int val;
if (sem_getvalue(sem, &val) == 0 && val <= 0)
if (sem_post(sem)) fail("sem_post");
usleep(1); /* Yield to other processes */
}
Ordering is almost 100% FIFO.
Note: This is with a 4.4 Linux kernel, 2.4 might be different.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string