How to prevent page faults after child exits? - linux

A nice way of creating a snapshot of a process is to use fork() to create a child process. The memory of the child process will be a copy of the parent process.
Instead of eagerly copying all the memory, the OS simply marks the pages as copy-on-write: the pages will be cloned if the event of one of the processes writing to it. This saves both time and space, which is great.
In the event the child process exits, the copy-on-write behavior should be deactivated. However, I'm getting page faults for the whole array -- is there any way of optimizing these page faults? e.g. similar to how MAP_POPULATE avoids page faults for the initial access to the pages of a mapped region.
Below there is a simple benchmark that demonstrates the behavior I'm asking about. I check for page faults via perf stat -e minor-faults,major-faults ./a.out.
If no child process is created (WITH_CHILD set to false) I have very few page faults (around 125 and constant). However, just by creating and reaping the child process, I get page faults in everything (around 131260, proportional to array size). As the pages are mapped by a single process, I wouldn't expect any page faults to happen! Why do they?
This is a follow-up of Kernel copying CoW pages after child process exit.
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <array>
#include <cassert>
#include <cstring>
#include <iostream>
#define ARRAY_SIZE 536870912 // 512MB
#define WITH_CHILD true
using inttype = uint64_t;
constexpr uint64_t NUM_ELEMS() {
return ARRAY_SIZE / sizeof(inttype);
}
int main() {
// allocate array
void *arraybuf = mmap(nullptr, ARRAY_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
assert(arraybuf != nullptr);
std::array<inttype, NUM_ELEMS()> *array =
new (arraybuf) std::array<inttype, NUM_ELEMS()>();
#if WITH_CHILD
// spawn checkpointing process
int pid = fork();
assert(pid != -1);
// child process -- do nothing, just exit
if (pid == 0) {
exit(0);
}
// wait for child thread to exit
assert(waitpid(pid, nullptr, 0) == pid);
#endif
// write to array -- this shouldnt generate page faults, right? :(
std::fill(array->begin(), array->end(), 0);
// cleanup
munmap(array, ARRAY_SIZE);
}

Related

Calling "clone()" on linux but it seems to malfunction

A simple test program, I expect it will "clone" to fork a child process, and each process can execute till its end
#include<stdio.h>
#include<sched.h>
#include<unistd.h>
#include<sys/types.h>
#include<errno.h>
int f(void*arg)
{
pid_t pid=getpid();
printf("child pid=%d\n",pid);
}
char buf[1024];
int main()
{
printf("before clone\n");
int pid=clone(f,buf,CLONE_VM|CLONE_VFORK,NULL);
if(pid==-1){
printf("%d\n",errno);
return 1;
}
waitpid(pid,NULL,0);
printf("after clone\n");
printf("father pid=%d\n",getpid());
return 0;
}
Ru it:
$g++ testClone.cpp && ./a.out
before clone
It didn't print what I expected. Seems after "clone" the program is in unknown state and then quit. I tried gdb and it prints:
Breakpoint 1, main () at testClone.cpp:15
(gdb) n-
before clone
(gdb) n-
waiting for new child: No child processes.
(gdb) n-
Single stepping until exit from function clone#plt,-
which has no line number information.
If I remove the line of "waitpid", then gdb prints another kind of weird information.
(gdb) n-
before clone
(gdb) n-
Detaching after fork from child process 26709.
warning: Unexpected waitpid result 000000 when waiting for vfork-done
Cannot remove breakpoints because program is no longer writable.
It might be running in another process.
Further execution is probably impossible.
0x00007fb18a446bf1 in clone () from /lib64/libc.so.6
ptrace: No such process.
Where did I get wrong in my program?
You should never call clone in a user-level program -- there are way too many restrictions on what you are allowed to do in the cloned process.
In particular, calling any libc function (such as printf) is a complete no-no (because libc doesn't know that your clone exists, and have not performed any setup for it).
As K. A. Buhr points out, you also pass too small a stack, and the wrong end of it. Your stack is also not properly aligned.
In short, even though K. A. Buhr's modification appears to work, it doesn't really.
TL;DR: clone, just don't use it.
The second argument to clone is a pointer to the child's stack. As per the manual page for clone(2):
Stacks grow downward on all processors that run Linux (except the HP PA processors), so child_stack usually points to the topmost address of the memory space set up for the child stack.
Also, 1024 bytes is a paltry amount for a stack. The following modified version of your program appears to run correctly:
// #define _GNU_SOURCE // may be needed if compiled as C instead of C++
#include <stdio.h>
#include <sched.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <errno.h>
int f(void*arg)
{
pid_t pid=getpid();
printf("child pid=%d\n",pid);
return 0;
}
char buf[1024*1024]; // *** allocate more stack ***
int main()
{
printf("before clone\n");
int pid=clone(f,buf+sizeof(buf),CLONE_VM|CLONE_VFORK,NULL);
// *** in previous line: pointer is to *end* of stack ***
if(pid==-1){
printf("%d\n",errno);
return 1;
}
waitpid(pid,NULL,0);
printf("after clone\n");
printf("father pid=%d\n",getpid());
return 0;
}
Also, #Employed Russian is right -- you probably shouldn't use clone except if you're trying to have some fun. Either fork or vfork are more sensible interfaces to clone whenever they meet your needs.

How to kill a process in a system call?

I found out that sys_kill can be used to kill process from a system call, but when i compile the following code, i get the following error:
error: implicit declaration of function ‘sys_kill’ [-Werror=implicit-function-declaration]
long kill = sys_kill(pid,SIGKILL);
#define _POSIX_SOURCE
#include <linux/kernel.h>
#include <linux/unistd.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/sched.h>
#include <linux/cred.h>
asmlinkage long sys_killa(pid_t pid)
{
printk(KERN_INFO "Current UID = %u\n",get_current_user()->uid);
printk(KERN_WARNING "The process to be killed is %d \n", pid);
long kill = sys_kill(pid,SIGKILL);
printk(KERN_WARNING "sys kill returned %ld\n", kill);
return 0;
}
It is often not possible to call the entry point of a system call from the kernel, since the API is for use from user space. Sometimes the functionality is provided in code close to the implementation code. It is usually made available throughout the kernel via an EXPORT_SYMBOL() macro.
For the kill() system call there is the internal kernel function kill_pid, with the declaration
int kill_pid(struct pid *pid, int sig, int priv)
You need to pass a struct pointer to the process, signal number, and boolean 1. Look at other code making this call for how to do so.

Ptrace reset a breakpoint

I am having trouble resetting a process after I have hit a breakpoint with Ptrace. I am essentially wrapping this code in python.
I am running this on 64 bit Ubuntu.
I understand the concept of resetting the data at the location and decrementing the instruction pointer, but after I get the trap signal and do that, my process is not finishing.
Code snippet:
# Continue to bp
res = libc.ptrace(PTRACE_CONT,pid,0,0)
libc.wait(byref(wait_status))
if _wifstopped(wait_status):
print('Breakpoint hit. Signal: %s' % (strsignal(_wstopsig(wait_status))))
else:
print('Error process failed to stop')
exit(1)
# Reset Instruction pointer
data = get_registers(pid)
print_rip(data)
data.rip -= 1
res = set_registers(pid,data)
# Verify rip
print_rip(get_registers(pid))
# Reset Instruction
out = set_text(pid,c_ulonglong(addr),c_ulonglong(initial_data))
if out != 0:
print_errno()
print_text(c_ulonglong(addr),c_ulonglong(get_text(c_void_p(addr))))
And I run a PTRACE_DETACH right after returning from this code.
When I run this, it hits the breakpoint the parent process returns successfully, but the child does not resume and finish its code.
If I comment out the call to the breakpoint function it just attaches ptrace to the process and then detaches it, and the program runs fine.
The program itself is just a small c program that prints 10 times to a file.
Full code is in this paste
Is there an error anyone sees with my breakpoint code?
I ended up writing a C program that was as exact a duplicate of the python code as possible:
#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>
#include <syscall.h>
#include <sys/ptrace.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/reg.h>
#include <sys/user.h>
#include <unistd.h>
#include <errno.h>
#include <time.h>
void set_unset_bp(pid){
int wait_status;
struct user_regs_struct regs;
unsigned long long addr = 0x0000000000400710;
unsigned long long data = ptrace(PTRACE_PEEKTEXT,pid,(void *)addr,0);
printf("Orig data: 0x%016x\n",data);
unsigned long long trap = (data & 0xFFFFFFFFFFFFFF00) | 0xCC;
ptrace(PTRACE_POKETEXT,pid,(void *)addr,(void *)trap);
ptrace(PTRACE_CONT,pid,0,0);
wait(&wait_status);
if(WIFSTOPPED(wait_status)){
printf("Signal recieved: %s\n",strsignal(WSTOPSIG(wait_status)));
}else{
perror("wait");
}
ptrace(PTRACE_POKETEXT,pid,(void *)addr,(void *)data);
ptrace(PTRACE_GETREGS,pid,0,&regs);
regs.rip -=1;
ptrace(PTRACE_SETREGS,pid,0,&regs);
data = ptrace(PTRACE_PEEKTEXT,pid,(void *)addr,0);
printf("Data after resetting bp data: 0x%016x\n",data);
ptrace(PTRACE_CONT,pid,0,0);
}
int main(void){
//Fork child process
extern int errno;
int pid = fork();
if(pid ==0){//Child
ptrace(PTRACE_TRACEME,0,0,0);
int out = execl("/home/chris/workspace/eliben-debugger/print","/home/chris/workspace/eliben-debugger/print",0);
if(out != 0){
printf("Error Value is: %s\n", strerror(errno));
}
}else{ //Parent
wait(0);
printf("Got stop signal, we just execv'd\n");
set_unset_bp(pid);
printf("Finished setting and unsetting\n");
wait(0);
printf("Got signal, detaching\n");
ptrace(PTRACE_DETACH,pid,0,0);
wait(0);
printf("Parent exiting after waiting for child to finish\n");
}
exit(0);
}
After comparing the output to my Python output I noticed that according to python my original data was 0xfffffffffffe4be8 and 0x00000000fffe4be8.
This lead me to believe that my return data was getting truncated to a 32 bit value.
I changed my get and set methods to something like this, setting the return type to a void pointer:
def get_text(addr):
restype = libc.ptrace.restype
libc.ptrace.restype = c_void_p
out = libc.ptrace(PTRACE_PEEKTEXT,pid,addr, 0)
libc.ptrace.restype = restype
return out
def set_text(pid,addr,data):
return libc.ptrace(PTRACE_POKETEXT,pid,addr,data)
Can't tell you how it works yet, but I was able to get the child process executing successfully after the trap.

What are some conditions that may cause fork() or system() calls to fail on Linux?

And how can one find out whether any of them are occuring, and leading to an error returned by fork() or system()? In other words, if fork() or system() returns with an error, what are some things in Linux that I can check to diagnose why that particular error is happening?
For example:
Just plain out of memory (results in errno ENOMEM) - check memory use with 'free' etc.
Not enough memory for kernel to copy page tables and other accounting information of parent process (results in errno EAGAIN)
Is there a global process limit? (results in errno EAGAIN also?)
Is there a per-user process limit? How can I find out what it is?
...?
And how can one find out whether any of them are occuring?
Check the errno value if the result (return value) is -1
From the man page on Linux:
RETURN VALUE
On success, the PID of the child process is returned in the parent, and 0 is returned in the child. On failure, -1 is returned in the parent, no child process is created, and errno is set appropriately.
ERRORS
EAGAIN
fork() cannot allocate sufficient memory to copy the parent's page tables and allocate a task structure for the child.
EAGAIN
It was not possible to create a new process because the caller's RLIMIT_NPROC resource limit was encountered. To exceed this limit, the process must have either the CAP_SYS_ADMIN or the CAP_SYS_RESOURCE capability.
ENOMEM
fork() failed to allocate the necessary kernel structures because memory is tight.
CONFORMING TO
SVr4, 4.3BSD, POSIX.1-2001.
nproc in /etc/security/limits.conf can limit the number of processes per user.
You can check for failure by examining the return from fork. A 0 means you are in the child, a positive number is the pid of the child and means you are in the parent, and a negative number means the fork failed. When fork fails it sets the external variable errno. You can use the functions in errno.h to examine it. I normally just use perror to print the error (with some text prepended to it) to stderr.
#include <stdio.h>
#include <errno.h>
#include <unistd.h>
int main(int argc, char** argv) {
pid_t pid;
pid = fork();
if (pid == -1) {
perror("Could not fork: ");
return 1;
} else if (pid == 0) {
printf("in child\n");
return 0;
};
printf("in parent, child is %d\n", pid);
return 0;
}

Return code when OS kills your process

I've wanted to test if with multiply processes I'm able to use more than 4GB of ram on 32bit O.S (mine: Ubuntu with 1GB ram).
So I've written a small program that mallocs slightly less then 1GB, and do some action on that array, and ran 5 instances of this program vie forks.
The thing is, that I suspect that O.S killed 4 of them, and only one survived and displayed it's "PID: I've finished").
(I've tried it with small arrays and got 5 printing, also when I look at the running processes with TOP, I see only one instance..)
The weird thing is this - I've received return code 0 (success?) in ALL of the instances, including the ones that were allegedly killed by O.S.
I didn't get any massage stating that processes were killed.
Is this return code normal for this situation?
(If so, it reduces my trust in 'return codes'...)
thanks.
Edit: some of the answers suggested possible errors in the small program, so here it is. the larger program that forks and saves return codes is larger, and I have trouble uploading it here, but I think (and hope) it's fine.
Also I've noticed that if instead of running it with my forking program, I run it with terminal using './a.out & ./a.out & ./a.out & ./a.out &' (when ./a.out is the binary of the small program attached)
I do see some 'Killed' messages.
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#define SMALL_SIZE 10000
#define BIG_SIZE 1000000000
#define SIZE BIG_SIZE
#define REAPETS 1
int
main()
{
pid_t my_pid = getpid();
char * x = malloc(SIZE*sizeof(char));
if (x == NULL)
{
printf("Malloc failed!");
return(EXIT_FAILURE);
}
int x2=0;
for(x2=0;x2<REAPETS;++x2)
{
int y;
for(y=0;y<SIZE;++y)
x[y] = (y+my_pid)%256;
}
printf("%d: I'm over.\n",my_pid);
return(EXIT_SUCCESS);
}
Well, if your process is unable to malloc() the 1GB of memory, the OS will not kill the process. All that happens is that malloc() returns NULL. So depending on how you wrote your code, it's possible that the process could return 0 anyway - if you wanted it to return an error code when a memory allocation fails (which is generally good practice), you'd have to program that behavior into it.
What signal was used to kill the processes?
Exit codes between 0 and 127, inclusive, can be used freely, and codes above 128 indicate that the process was terminated by a signal, where the exit code is
128 + the number of the signal used
A process' return status (as returned by wait, waitpid and system) contains more or less the following:
Exit code, only applies if process terminated normally
whether normal/abnormal termination occured
Termination signal, only applies if process was terminated by a signal
The exit code is utterly meaningless if your process was killed by the OOM killer (which will apparently send you a SIGKILL signal)
for more information, see the man page for the wait command.
This code shows how to get the termination status of a child:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
int
main (void)
{
pid_t pid = fork();
if (pid == -1)
{
perror("fork()");
}
/* parent */
else if (pid > 0)
{
int status;
printf("Child has pid %ld\n", (long)pid);
if (wait(&status) == -1)
{
perror("wait()");
}
else
{
/* did the child terminate normally? */
if(WIFEXITED(status))
{
printf("%ld exited with return code %d\n",
(long)pid, WEXITSTATUS(status));
}
/* was the child terminated by a signal? */
else if (WIFSIGNALED(status))
{
printf("%ld terminated because it didn't catch signal number %d\n",
(long)pid, WTERMSIG(status));
}
}
}
/* child */
else
{
sleep(10);
exit(0);
}
return 0;
}
Have you checked the return value from fork()? There's a good chance that if fork() can't allocate enough memory for the new process' address space, then it will return an error (-1). A typical way to call fork() is:
pid_t pid;
switch(pid = fork())
{
case 0:
// I'm the child process
break;
case -1:
// Error -- check errno
fprintf(stderr, "fork: %s\n", strerror(errno));
break;
default:
// I'm the parent process
}
Exit code is only "valid" when WIFEXITED macro evaluates to true. See man waitpid(2).
You can use WIFSIGNALED macro to see if your program has been signaled.

Resources