ptrace: STEPINTO returns breakpoint signal - linux

Yet another ptrace question:
The process is paused for the initial SIGTRAP, then i call ptrace with STEPINTO to step the next instruction. After the ptrace call, i call waitpid(...) and sure enough it does receive a SIGTRAP signal! Great!
however, when i examine the siginfo_t structure by calling
ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo)
the fields are as follows:
si_signo = SIGTRAP
si_code = 1
however, 1 is the signal code for a breakpoint. No breakpoint was hit in this case:
/* `si_code' values for SIGTRAP signal. */
enum
{
TRAP_BRKPT = 1, /* Process breakpoint. */
# define TRAP_BRKPT TRAP_BRKPT
TRAP_TRACE /* Process trace trap. */
# define TRAP_TRACE TRAP_TRACE
};
What's going on?
All calls involved return success (0) and errno is 0, too.
edit: so what happens when i hit a real 0xCC breakpoint?
A SIGTRAP is fired and the si_code field is 0x80 ! 0x80 isnt even defined for SIGTRAP! :(

Related

where is the context switching finally happening in the linux kernel source?

In linux, process scheduling occurs after all interrupts (timer interrupt, and other interrupts) or when a process relinquishes CPU(by calling explicit schedule() function). Today I was trying to see where context switching occurs in linux source (kernel version 2.6.23)
(I think I checked this several years ago but I'm not sure now..I was looking at sparc arch then.)
I looked it up from the main_timer_handler(in arch/x86_64/kernel/time.c), but couldn't find it.
Finally I found it in ./arch/x86_64/kernel/entry.S.
ENTRY(common_interrupt)
XCPT_FRAME
interrupt do_IRQ
/* 0(%rsp): oldrsp-ARGOFFSET */
ret_from_intr:
cli
TRACE_IRQS_OFF
decl %gs:pda_irqcount
leaveq
CFI_DEF_CFA_REGISTER rsp
CFI_ADJUST_CFA_OFFSET -8
exit_intr:
GET_THREAD_INFO(%rcx)
testl $3,CS-ARGOFFSET(%rsp)
je retint_kernel
...(omit)
GET_THREAD_INFO(%rcx)
jmp retint_check
#ifdef CONFIG_PREEMPT
/* Returning to kernel space. Check if we need preemption */
/* rcx: threadinfo. interrupts off. */
ENTRY(retint_kernel)
cmpl $0,threadinfo_preempt_count(%rcx)
jnz retint_restore_args
bt $TIF_NEED_RESCHED,threadinfo_flags(%rcx)
jnc retint_restore_args
bt $9,EFLAGS-ARGOFFSET(%rsp) /* interrupts off? */
jnc retint_restore_args
call preempt_schedule_irq
jmp exit_intr
#endif
CFI_ENDPROC
END(common_interrupt)
At the end of the ISR is a call to preempt_schedule_irq! and the preempt_schedule_irq is defined in kernel/sched.c as below(it calls schedule() in the middle).
/*
* this is the entry point to schedule() from kernel preemption
* off of irq context.
* Note, that this is called and return with irqs disabled. This will
* protect us against recursive calling from irq.
*/
asmlinkage void __sched preempt_schedule_irq(void)
{
struct thread_info *ti = current_thread_info();
#ifdef CONFIG_PREEMPT_BKL
struct task_struct *task = current;
int saved_lock_depth;
#endif
/* Catch callers which need to be fixed */
BUG_ON(ti->preempt_count || !irqs_disabled());
need_resched:
add_preempt_count(PREEMPT_ACTIVE);
/*
* We keep the big kernel semaphore locked, but we
* clear ->lock_depth so that schedule() doesnt
* auto-release the semaphore:
*/
#ifdef CONFIG_PREEMPT_BKL
saved_lock_depth = task->lock_depth;
task->lock_depth = -1;
#endif
local_irq_enable();
schedule();
local_irq_disable();
#ifdef CONFIG_PREEMPT_BKL
task->lock_depth = saved_lock_depth;
#endif
sub_preempt_count(PREEMPT_ACTIVE);
/* we could miss a preemption opportunity between schedule and now */
barrier();
if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
goto need_resched;
}
So I found where the scheduling occurs, but my question is, "where in the source code does the actually context switching happen?". For context switching, the stack, mm settings, registers should be switched and the PC (program counter) should be set to the new task. Where can I find the source code for that? I followed schedule() --> context_switch() --> switch_to(). Below is the context_switch function which calls switch_to() function.(kernel/sched.c)
/*
* context_switch - switch to the new MM and the new
* thread's register state.
*/
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
struct mm_struct *mm, *oldmm;
prepare_task_switch(rq, prev, next);
mm = next->mm;
oldmm = prev->active_mm;
/*
* For paravirt, this is coupled with an exit in switch_to to
* combine the page table reload and the switch backend into
* one hypercall.
*/
arch_enter_lazy_cpu_mode();
if (unlikely(!mm)) {
next->active_mm = oldmm;
atomic_inc(&oldmm->mm_count);
enter_lazy_tlb(oldmm, next);
} else
switch_mm(oldmm, mm, next);
if (unlikely(!prev->mm)) {
prev->active_mm = NULL;
rq->prev_mm = oldmm;
}
/*
* Since the runqueue lock will be released by the next
* task (which is an invalid locking op but in the case
* of the scheduler it's an obvious special-case), so we
* do an early lockdep release here:
*/
#ifndef __ARCH_WANT_UNLOCKED_CTXSW
spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
#endif
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev); // <---- this line
barrier();
/*
* this_rq must be evaluated again because prev may have moved
* CPUs since it called schedule(), thus the 'rq' on its stack
* frame will be invalid.
*/
finish_task_switch(this_rq(), prev);
}
The 'switch_to' is an assembly code under include/asm-x86_64/system.h.
my question is, is the processor switched to the new task inside the 'switch_to()' function? Then, are the codes 'barrier(); finish_task_switch(this_rq(), prev);' run at some other time later? By the way, this was in interrupt context, so if to_switch() is just the end of this ISR, who finishes this interrupt? Or, if the finish_task_switch runs, how is CPU occupied by the new task?
I would really appreciate if someone could explain and clarify things to me.
Almost all of the work for a context switch is done by the normal SYSCALL/SYSRET mechanism. The process pushes its state on the stack of "current" the current running process. Calling do_sched_yield just changes the value of current, so the return just restores the state of a different task.
Preemption gets trickier, since it doesn't happen at a normal boundary. The preemption code has to save and restore all of the task state, which is slow. That's why non-RT kernels avoid doing preemption. The arch-specific switch_to code is what saves all the prev task state and sets up the next task state so that SYSRET will run the next task correctly. There are no magic jumps or anything in the code, it is just setting up the hardware for userspace.

How to make mprotect() to make forward progress after handling pagefaulte exception? [duplicate]

I want to write a signal handler to catch SIGSEGV.
I protect a block of memory for read or write using
char *buffer;
char *p;
char a;
int pagesize = 4096;
mprotect(buffer,pagesize,PROT_NONE)
This protects pagesize bytes of memory starting at buffer against any reads or writes.
Second, I try to read the memory:
p = buffer;
a = *p
This will generate a SIGSEGV, and my handler will be called.
So far so good. My problem is that, once the handler is called, I want to change the access write of the memory by doing
mprotect(buffer,pagesize,PROT_READ);
and continue normal functioning of my code. I do not want to exit the function.
On future writes to the same memory, I want to catch the signal again and modify the write rights and then record that event.
Here is the code:
#include <signal.h>
#include <stdio.h>
#include <malloc.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/mman.h>
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
char *buffer;
int flag=0;
static void handler(int sig, siginfo_t *si, void *unused)
{
printf("Got SIGSEGV at address: 0x%lx\n",(long) si->si_addr);
printf("Implements the handler only\n");
flag=1;
//exit(EXIT_FAILURE);
}
int main(int argc, char *argv[])
{
char *p; char a;
int pagesize;
struct sigaction sa;
sa.sa_flags = SA_SIGINFO;
sigemptyset(&sa.sa_mask);
sa.sa_sigaction = handler;
if (sigaction(SIGSEGV, &sa, NULL) == -1)
handle_error("sigaction");
pagesize=4096;
/* Allocate a buffer aligned on a page boundary;
initial protection is PROT_READ | PROT_WRITE */
buffer = memalign(pagesize, 4 * pagesize);
if (buffer == NULL)
handle_error("memalign");
printf("Start of region: 0x%lx\n", (long) buffer);
printf("Start of region: 0x%lx\n", (long) buffer+pagesize);
printf("Start of region: 0x%lx\n", (long) buffer+2*pagesize);
printf("Start of region: 0x%lx\n", (long) buffer+3*pagesize);
//if (mprotect(buffer + pagesize * 0, pagesize,PROT_NONE) == -1)
if (mprotect(buffer + pagesize * 0, pagesize,PROT_NONE) == -1)
handle_error("mprotect");
//for (p = buffer ; ; )
if(flag==0)
{
p = buffer+pagesize/2;
printf("It comes here before reading memory\n");
a = *p; //trying to read the memory
printf("It comes here after reading memory\n");
}
else
{
if (mprotect(buffer + pagesize * 0, pagesize,PROT_READ) == -1)
handle_error("mprotect");
a = *p;
printf("Now i can read the memory\n");
}
/* for (p = buffer;p<=buffer+4*pagesize ;p++ )
{
//a = *(p);
*(p) = 'a';
printf("Writing at address %p\n",p);
}*/
printf("Loop completed\n"); /* Should never happen */
exit(EXIT_SUCCESS);
}
The problem is that only the signal handler runs and I can't return to the main function after catching the signal.
When your signal handler returns (assuming it doesn't call exit or longjmp or something that prevents it from actually returning), the code will continue at the point the signal occurred, reexecuting the same instruction. Since at this point, the memory protection has not been changed, it will just throw the signal again, and you'll be back in your signal handler in an infinite loop.
So to make it work, you have to call mprotect in the signal handler. Unfortunately, as Steven Schansker notes, mprotect is not async-safe, so you can't safely call it from the signal handler. So, as far as POSIX is concerned, you're screwed.
Fortunately on most implementations (all modern UNIX and Linux variants as far as I know), mprotect is a system call, so is safe to call from within a signal handler, so you can do most of what you want. The problem is that if you want to change the protections back after the read, you'll have to do that in the main program after the read.
Another possibility is to do something with the third argument to the signal handler, which points at an OS and arch specific structure that contains info about where the signal occurred. On Linux, this is a ucontext structure, which contains machine-specific info about the $PC address and other register contents where the signal occurred. If you modify this, you change where the signal handler will return to, so you can change the $PC to be just after the faulting instruction so it won't re-execute after the handler returns. This is very tricky to get right (and non-portable too).
edit
The ucontext structure is defined in <ucontext.h>. Within the ucontext the field uc_mcontext contains the machine context, and within that, the array gregs contains the general register context. So in your signal handler:
ucontext *u = (ucontext *)unused;
unsigned char *pc = (unsigned char *)u->uc_mcontext.gregs[REG_RIP];
will give you the pc where the exception occurred. You can read it to figure out what instruction it
was that faulted, and do something different.
As far as the portability of calling mprotect in the signal handler is concerned, any system that follows either the SVID spec or the BSD4 spec should be safe -- they allow calling any system call (anything in section 2 of the manual) in a signal handler.
You've fallen into the trap that all people do when they first try to handle signals. The trap? Thinking that you can actually do anything useful with signal handlers. From a signal handler, you are only allowed to call asynchronous and reentrant-safe library calls.
See this CERT advisory as to why and a list of the POSIX functions that are safe.
Note that printf(), which you are already calling, is not on that list.
Nor is mprotect. You're not allowed to call it from a signal handler. It might work, but I can promise you'll run into problems down the road. Be really careful with signal handlers, they're tricky to get right!
EDIT
Since I'm being a portability douchebag at the moment already, I'll point out that you also shouldn't write to shared (i.e. global) variables without taking the proper precautions.
You can recover from SIGSEGV on linux. Also you can recover from segmentation faults on Windows (you'll see a structured exception instead of a signal). But the POSIX standard doesn't guarantee recovery, so your code will be very non-portable.
Take a look at libsigsegv.
You should not return from the signal handler, as then behavior is undefined. Rather, jump out of it with longjmp.
This is only okay if the signal is generated in an async-signal-safe function. Otherwise, behavior is undefined if the program ever calls another async-signal-unsafe function. Hence, the signal handler should only be established immediately before it is necessary, and disestablished as soon as possible.
In fact, I know of very few uses of a SIGSEGV handler:
use an async-signal-safe backtrace library to log a backtrace, then die.
in a VM such as the JVM or CLR: check if the SIGSEGV occurred in JIT-compiled code. If not, die; if so, then throw a language-specific exception (not a C++ exception), which works because the JIT compiler knew that the trap could happen and generated appropriate frame unwind data.
clone() and exec() a debugger (do not use fork() – that calls callbacks registered by pthread_atfork()).
Finally, note that any action that triggers SIGSEGV is probably UB, as this is accessing invalid memory. However, this would not be the case if the signal was, say, SIGFPE.
There is a compilation problem using ucontext_t or struct ucontext (present in /usr/include/sys/ucontext.h)
http://www.mail-archive.com/arch-general#archlinux.org/msg13853.html

Process wait using linux system call wait

I am trying to create a process using fork system call and then wait on the child process. I have used the following:
waitpid (pid, &status, 0);
1) The first problem is that the status is 8 bit shifted to the left e.g., if the child process returns 1, the waitpid function returns the value of the status in the status variable to be 256. Please let me know why it is doing that.
2) According to the manual, the waitpid waits for the child process to change state. but then it also says:
"The wait() system call suspends execution of the calling process until
one of its children terminates. The call wait(&status) is equivalent
to:
waitpid(-1, &status, 0);"
I am a bit confused here whether the waitpid and the wait calls wait for state change or for child process termination. Kindly clearify this point.
What does the zero in the third arguement specifies?
3) If i put the child process in sleep state, doesn't the state of the child process changes to be in waiting state by waiting for e.g., 5 secs?
Following is my program:
int main(int argc, char ** argv)
{
pid_t pid = fork();
pid_t ppp;
if (pid==0)
{
sleep(8);
printf ("\n I am the first child and my id is %d \n", getpid());
printf ("The first child process is now exiting now exiting\n\n");
exit (1);
}
else {
int status = 13;
printf ("\nI am now waiting for the child process %d\n", pid);
waitpid (pid, &status, 0);
printf ("\n the status returned by the exiting child is %d\n", status>>8);
}
printf("\nI am now exiting");
exit(0);
}
Thanks
The status parameter encodes more than just the exit code of the child. From man waitpid:
WIFEXITED(status)
returns true if the child terminated normally, that is, by calling exit(3) or _exit(2), or by returning from main().
WEXITSTATUS(status)
returns the exit status of the child. This consists of the least significant 8 bits of the status argument that the child specified in a call to exit(3) or _exit(2) or as the argument for a return statement in main(). This macro should only be employed if WIFEXITED returned true.
main waitpid explains what the third parameter does.
The value of options is an OR of zero or more of the following constants:
WNOHANG
return immediately if no child has exited.
WUNTRACED
also return if a child has stopped (but not traced via ptrace(2)). Status for traced children which have stopped is provided even if this option is not specified.
WCONTINUED (since Linux 2.6.10)
also return if a stopped child has been resumed by delivery of SIGCONT.
State change is very precisely and narrowly defined. From man waitpid:
A state change is considered to be: the child terminated; the child was stopped by a signal; or the child was resumed by a signal.
Going to sleep is not a state change. Being stopped by SIGSTOP/SIGTSTP is.

What is the meaning of the instruction {interrupt do_IRQ} in linux kernel?

What is the meaning of the instruction {interrupt do_IRQ} in linux kernel file arch/x86/kernel/entry_64.S ? Is interrupt a instruction or a macro? Where is the definition? How to use it ?
847 common_interrupt:
848 XCPT_FRAME
849 addq $-0x80,(%rsp) /* Adjust vector to [-256,-1] range */
850 interrupt do_IRQ
851 /* 0(%rsp): old_rsp-ARGOFFSET */
It's declared a short distance above:
/* 0(%rsp): ~(interrupt number) */
.macro interrupt func
/* reserve pt_regs for scratch regs and rbp */
subq $ORIG_RAX-RBP, %rsp
CFI_ADJUST_CFA_OFFSET ORIG_RAX-RBP
call save_args
PARTIAL_FRAME 0
call \func
.endm
I don't know what that does, though. :-)
Interrupt are basically used for suspending all the current processes running on the current interrupted cpu core & then run the generated interrupt related work. & the interrupt related work is done with the handler routine or function which is registered.
Interrupt may be generated by H/W or S/W. and there are basically two types of interrupt as...1-)soft interrupt & 2-)hard interrupt.
so whenever a particular interrupt is generated its handler routine or function is called & this calling is related with the parameter passed in the function do_IRQ(struct pt_regs *regs) which is pt_regs structure type & it basically stores the registers values as...
struct pt_regs{
unsigned long r0;
unsigned long r1;
...
...
};
& for more info u can follow this link https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Hardware_interrupts.html

recvfrom() timeout with alarm()

I'm debugging the following code:
signal(SIGALRM, testt);
alarm(1);
result = recvfrom( listening_socket, buf, maxlen, 0, &from, &fromlen );
printf("stoped\n");
As described by man 3 siginterrupt the alarm should interrupt the system call, but in my case it doesn't. Alarm handler is invoked, but recvfrom is not interrupted.
However, when a new signal handler is specified with the signal(2) function, the system call is interrupted by default.
If I add siginterrupt(SIGALRM, 1); after setting the alarm handler then recvfrom is interrupted as expected.
What am I missing? What's wrong with my code?
NOTES: Replacing signal with sigaction is not what I'm looking for.
The siginterrupt(3) manpage was incorrect (version 3.33 of the Linux manpage set corrects the error). The default behavior of glibc's signal is to establish a signal that does not interrupt system calls. You can see this for yourself with strace:
void handler(int unused) {}
int main(void)
{
signal(SIGALRM, handler);
}
&longrightarrow;
$ strace -e trace=rt_sigaction ./a.out
rt_sigaction(SIGALRM,
{0x4005b4, [ALRM], SA_RESTORER|SA_RESTART, 0x7fe732d3d490},
{SIG_DFL, [], 0}, 8) = 0
Note the SA_RESTART in there. You can either keep doing what you're doing -- call siginterrupt(SIGALRM, 1) after establishing the handler -- or you can switch to using sigaction, which will let you set the flags the way you want in the first place. You said you didn't want to do that (why?) but nonetheless that is what I would recommend.

Resources