What is the meaning of the instruction {interrupt do_IRQ} in linux kernel?

What is the meaning of the instruction {interrupt do_IRQ} in linux kernel? - linux

What is the meaning of the instruction {interrupt do_IRQ} in linux kernel file arch/x86/kernel/entry_64.S ? Is interrupt a instruction or a macro? Where is the definition? How to use it ?
847 common_interrupt:
848 XCPT_FRAME
849 addq $-0x80,(%rsp) /* Adjust vector to [-256,-1] range */
850 interrupt do_IRQ
851 /* 0(%rsp): old_rsp-ARGOFFSET */

It's declared a short distance above:
/* 0(%rsp): ~(interrupt number) */
.macro interrupt func
/* reserve pt_regs for scratch regs and rbp */
subq $ORIG_RAX-RBP, %rsp
CFI_ADJUST_CFA_OFFSET ORIG_RAX-RBP
call save_args
PARTIAL_FRAME 0
call \func
.endm
I don't know what that does, though. :-)

Interrupt are basically used for suspending all the current processes running on the current interrupted cpu core & then run the generated interrupt related work. & the interrupt related work is done with the handler routine or function which is registered.
Interrupt may be generated by H/W or S/W. and there are basically two types of interrupt as...1-)soft interrupt & 2-)hard interrupt.
so whenever a particular interrupt is generated its handler routine or function is called & this calling is related with the parameter passed in the function do_IRQ(struct pt_regs *regs) which is pt_regs structure type & it basically stores the registers values as...
struct pt_regs{
unsigned long r0;
unsigned long r1;
...
...
};
& for more info u can follow this link https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Hardware_interrupts.html

Related

where is the context switching finally happening in the linux kernel source?

In linux, process scheduling occurs after all interrupts (timer interrupt, and other interrupts) or when a process relinquishes CPU(by calling explicit schedule() function). Today I was trying to see where context switching occurs in linux source (kernel version 2.6.23)
(I think I checked this several years ago but I'm not sure now..I was looking at sparc arch then.)
I looked it up from the main_timer_handler(in arch/x86_64/kernel/time.c), but couldn't find it.
Finally I found it in ./arch/x86_64/kernel/entry.S.
ENTRY(common_interrupt)
XCPT_FRAME
interrupt do_IRQ
/* 0(%rsp): oldrsp-ARGOFFSET */
ret_from_intr:
cli
TRACE_IRQS_OFF
decl %gs:pda_irqcount
leaveq
CFI_DEF_CFA_REGISTER rsp
CFI_ADJUST_CFA_OFFSET -8
exit_intr:
GET_THREAD_INFO(%rcx)
testl $3,CS-ARGOFFSET(%rsp)
je retint_kernel
...(omit)
GET_THREAD_INFO(%rcx)
jmp retint_check
#ifdef CONFIG_PREEMPT
/* Returning to kernel space. Check if we need preemption */
/* rcx: threadinfo. interrupts off. */
ENTRY(retint_kernel)
cmpl $0,threadinfo_preempt_count(%rcx)
jnz retint_restore_args
bt $TIF_NEED_RESCHED,threadinfo_flags(%rcx)
jnc retint_restore_args
bt $9,EFLAGS-ARGOFFSET(%rsp) /* interrupts off? */
jnc retint_restore_args
call preempt_schedule_irq
jmp exit_intr
#endif
CFI_ENDPROC
END(common_interrupt)
At the end of the ISR is a call to preempt_schedule_irq! and the preempt_schedule_irq is defined in kernel/sched.c as below(it calls schedule() in the middle).
/*
* this is the entry point to schedule() from kernel preemption
* off of irq context.
* Note, that this is called and return with irqs disabled. This will
* protect us against recursive calling from irq.
*/
asmlinkage void __sched preempt_schedule_irq(void)
{
struct thread_info *ti = current_thread_info();
#ifdef CONFIG_PREEMPT_BKL
struct task_struct *task = current;
int saved_lock_depth;
#endif
/* Catch callers which need to be fixed */
BUG_ON(ti->preempt_count || !irqs_disabled());
need_resched:
add_preempt_count(PREEMPT_ACTIVE);
/*
* We keep the big kernel semaphore locked, but we
* clear ->lock_depth so that schedule() doesnt
* auto-release the semaphore:
*/
#ifdef CONFIG_PREEMPT_BKL
saved_lock_depth = task->lock_depth;
task->lock_depth = -1;
#endif
local_irq_enable();
schedule();
local_irq_disable();
#ifdef CONFIG_PREEMPT_BKL
task->lock_depth = saved_lock_depth;
#endif
sub_preempt_count(PREEMPT_ACTIVE);
/* we could miss a preemption opportunity between schedule and now */
barrier();
if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
goto need_resched;
}
So I found where the scheduling occurs, but my question is, "where in the source code does the actually context switching happen?". For context switching, the stack, mm settings, registers should be switched and the PC (program counter) should be set to the new task. Where can I find the source code for that? I followed schedule() --> context_switch() --> switch_to(). Below is the context_switch function which calls switch_to() function.(kernel/sched.c)
/*
* context_switch - switch to the new MM and the new
* thread's register state.
*/
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
struct mm_struct *mm, *oldmm;
prepare_task_switch(rq, prev, next);
mm = next->mm;
oldmm = prev->active_mm;
/*
* For paravirt, this is coupled with an exit in switch_to to
* combine the page table reload and the switch backend into
* one hypercall.
*/
arch_enter_lazy_cpu_mode();
if (unlikely(!mm)) {
next->active_mm = oldmm;
atomic_inc(&oldmm->mm_count);
enter_lazy_tlb(oldmm, next);
} else
switch_mm(oldmm, mm, next);
if (unlikely(!prev->mm)) {
prev->active_mm = NULL;
rq->prev_mm = oldmm;
}
/*
* Since the runqueue lock will be released by the next
* task (which is an invalid locking op but in the case
* of the scheduler it's an obvious special-case), so we
* do an early lockdep release here:
*/
#ifndef __ARCH_WANT_UNLOCKED_CTXSW
spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
#endif
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev); // <---- this line
barrier();
/*
* this_rq must be evaluated again because prev may have moved
* CPUs since it called schedule(), thus the 'rq' on its stack
* frame will be invalid.
*/
finish_task_switch(this_rq(), prev);
}
The 'switch_to' is an assembly code under include/asm-x86_64/system.h.
my question is, is the processor switched to the new task inside the 'switch_to()' function? Then, are the codes 'barrier(); finish_task_switch(this_rq(), prev);' run at some other time later? By the way, this was in interrupt context, so if to_switch() is just the end of this ISR, who finishes this interrupt? Or, if the finish_task_switch runs, how is CPU occupied by the new task?
I would really appreciate if someone could explain and clarify things to me.

Almost all of the work for a context switch is done by the normal SYSCALL/SYSRET mechanism. The process pushes its state on the stack of "current" the current running process. Calling do_sched_yield just changes the value of current, so the return just restores the state of a different task.
Preemption gets trickier, since it doesn't happen at a normal boundary. The preemption code has to save and restore all of the task state, which is slow. That's why non-RT kernels avoid doing preemption. The arch-specific switch_to code is what saves all the prev task state and sets up the next task state so that SYSRET will run the next task correctly. There are no magic jumps or anything in the code, it is just setting up the hardware for userspace.

How to make mprotect() to make forward progress after handling pagefaulte exception? [duplicate]

I want to write a signal handler to catch SIGSEGV.
I protect a block of memory for read or write using
char *buffer;
char *p;
char a;
int pagesize = 4096;
mprotect(buffer,pagesize,PROT_NONE)
This protects pagesize bytes of memory starting at buffer against any reads or writes.
Second, I try to read the memory:
p = buffer;
a = *p
This will generate a SIGSEGV, and my handler will be called.
So far so good. My problem is that, once the handler is called, I want to change the access write of the memory by doing
mprotect(buffer,pagesize,PROT_READ);
and continue normal functioning of my code. I do not want to exit the function.
On future writes to the same memory, I want to catch the signal again and modify the write rights and then record that event.
Here is the code:
#include <signal.h>
#include <stdio.h>
#include <malloc.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/mman.h>
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
char *buffer;
int flag=0;
static void handler(int sig, siginfo_t *si, void *unused)
{
printf("Got SIGSEGV at address: 0x%lx\n",(long) si->si_addr);
printf("Implements the handler only\n");
flag=1;
//exit(EXIT_FAILURE);
}
int main(int argc, char *argv[])
{
char *p; char a;
int pagesize;
struct sigaction sa;
sa.sa_flags = SA_SIGINFO;
sigemptyset(&sa.sa_mask);
sa.sa_sigaction = handler;
if (sigaction(SIGSEGV, &sa, NULL) == -1)
handle_error("sigaction");
pagesize=4096;
/* Allocate a buffer aligned on a page boundary;
initial protection is PROT_READ | PROT_WRITE */
buffer = memalign(pagesize, 4 * pagesize);
if (buffer == NULL)
handle_error("memalign");
printf("Start of region: 0x%lx\n", (long) buffer);
printf("Start of region: 0x%lx\n", (long) buffer+pagesize);
printf("Start of region: 0x%lx\n", (long) buffer+2*pagesize);
printf("Start of region: 0x%lx\n", (long) buffer+3*pagesize);
//if (mprotect(buffer + pagesize * 0, pagesize,PROT_NONE) == -1)
if (mprotect(buffer + pagesize * 0, pagesize,PROT_NONE) == -1)
handle_error("mprotect");
//for (p = buffer ; ; )
if(flag==0)
{
p = buffer+pagesize/2;
printf("It comes here before reading memory\n");
a = *p; //trying to read the memory
printf("It comes here after reading memory\n");
}
else
{
if (mprotect(buffer + pagesize * 0, pagesize,PROT_READ) == -1)
handle_error("mprotect");
a = *p;
printf("Now i can read the memory\n");
}
/* for (p = buffer;p<=buffer+4*pagesize ;p++ )
{
//a = *(p);
*(p) = 'a';
printf("Writing at address %p\n",p);
}*/
printf("Loop completed\n"); /* Should never happen */
exit(EXIT_SUCCESS);
}
The problem is that only the signal handler runs and I can't return to the main function after catching the signal.

When your signal handler returns (assuming it doesn't call exit or longjmp or something that prevents it from actually returning), the code will continue at the point the signal occurred, reexecuting the same instruction. Since at this point, the memory protection has not been changed, it will just throw the signal again, and you'll be back in your signal handler in an infinite loop.
So to make it work, you have to call mprotect in the signal handler. Unfortunately, as Steven Schansker notes, mprotect is not async-safe, so you can't safely call it from the signal handler. So, as far as POSIX is concerned, you're screwed.
Fortunately on most implementations (all modern UNIX and Linux variants as far as I know), mprotect is a system call, so is safe to call from within a signal handler, so you can do most of what you want. The problem is that if you want to change the protections back after the read, you'll have to do that in the main program after the read.
Another possibility is to do something with the third argument to the signal handler, which points at an OS and arch specific structure that contains info about where the signal occurred. On Linux, this is a ucontext structure, which contains machine-specific info about the $PC address and other register contents where the signal occurred. If you modify this, you change where the signal handler will return to, so you can change the $PC to be just after the faulting instruction so it won't re-execute after the handler returns. This is very tricky to get right (and non-portable too).
edit
The ucontext structure is defined in <ucontext.h>. Within the ucontext the field uc_mcontext contains the machine context, and within that, the array gregs contains the general register context. So in your signal handler:
ucontext *u = (ucontext *)unused;
unsigned char *pc = (unsigned char *)u->uc_mcontext.gregs[REG_RIP];
will give you the pc where the exception occurred. You can read it to figure out what instruction it
was that faulted, and do something different.
As far as the portability of calling mprotect in the signal handler is concerned, any system that follows either the SVID spec or the BSD4 spec should be safe -- they allow calling any system call (anything in section 2 of the manual) in a signal handler.

You've fallen into the trap that all people do when they first try to handle signals. The trap? Thinking that you can actually do anything useful with signal handlers. From a signal handler, you are only allowed to call asynchronous and reentrant-safe library calls.
See this CERT advisory as to why and a list of the POSIX functions that are safe.
Note that printf(), which you are already calling, is not on that list.
Nor is mprotect. You're not allowed to call it from a signal handler. It might work, but I can promise you'll run into problems down the road. Be really careful with signal handlers, they're tricky to get right!
EDIT
Since I'm being a portability douchebag at the moment already, I'll point out that you also shouldn't write to shared (i.e. global) variables without taking the proper precautions.

You can recover from SIGSEGV on linux. Also you can recover from segmentation faults on Windows (you'll see a structured exception instead of a signal). But the POSIX standard doesn't guarantee recovery, so your code will be very non-portable.
Take a look at libsigsegv.

You should not return from the signal handler, as then behavior is undefined. Rather, jump out of it with longjmp.
This is only okay if the signal is generated in an async-signal-safe function. Otherwise, behavior is undefined if the program ever calls another async-signal-unsafe function. Hence, the signal handler should only be established immediately before it is necessary, and disestablished as soon as possible.
In fact, I know of very few uses of a SIGSEGV handler:
use an async-signal-safe backtrace library to log a backtrace, then die.
in a VM such as the JVM or CLR: check if the SIGSEGV occurred in JIT-compiled code. If not, die; if so, then throw a language-specific exception (not a C++ exception), which works because the JIT compiler knew that the trap could happen and generated appropriate frame unwind data.
clone() and exec() a debugger (do not use fork() – that calls callbacks registered by pthread_atfork()).
Finally, note that any action that triggers SIGSEGV is probably UB, as this is accessing invalid memory. However, this would not be the case if the signal was, say, SIGFPE.

There is a compilation problem using ucontext_t or struct ucontext (present in /usr/include/sys/ucontext.h)
http://www.mail-archive.com/arch-general#archlinux.org/msg13853.html

How does linux know when to allocate more pages to a call stack?

Given the program below, segfault() will (As the name suggests) segfault the program by accessing 256k below the stack. nofault() however, gradually pushes below the stack all the way to 1m below, but never segfaults.
Additionally, running segfault() after nofault() doesn't result in an error either.
If I put sleep()s in nofault() and use the time to cat /proc/$pid/maps I see the allocated stack space grows between the first and second call, this explains why segfault() doesn't crash afterwards - there's plenty of memory.
But the disassembly shows there's no change to %rsp. This makes sense since that would screw up the call stack.
I presumed that the maximum stack size would be baked into the binary at compile time (In retrospect that would be very hard for a compiler to do) or that it would just periodically check %rsp and add a buffer after that.
How does the kernel know when to increase the stack memory?
#include <stdio.h>
#include <unistd.h>
void segfault(){
char * x;
int a;
for( x = (char *)&x-1024*256; x<(char *)(&x+1); x++){
a = *x & 0xFF;
printf("%p = 0x%02x\n",x,a);
}
}
void nofault(){
char * x;
int a;
sleep(20);
for( x = (char *)(&x); x>(char *)&x-1024*1024; x--){
a = *x & 0xFF;
printf("%p = 0x%02x\n",x,a);
}
sleep(20);
}
int main(){
nofault();
segfault();
}

The processor raises a page fault when you access an unmapped page. The kernel's page fault handler checks whether the address is reasonably close to the process's %rsp and if so, it allocates some memory and resumes the process. If you are too far below %rsp, the kernel passes the fault along to the process as a signal.
I tried to find the precise definition of what addresses are close enough to %rsp to trigger stack growth, and came up with this from linux/arch/x86/mm.c:
/*
* Accessing the stack below %sp is always a bug.
* The large cushion allows instructions like enter
* and pusha to work. ("enter $65535, $31" pushes
* 32 pointers and then decrements %sp by 65535.)
*/
if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
bad_area(regs, error_code, address);
return;
}
But experimenting with your program I found that 65536+32*sizeof(unsigned long) isn't the actual cutoff point between segfault and no segfault. It seems to be about twice that value. So I'll just stick with the vague "reasonably close" as my official answer.

ptrace: STEPINTO returns breakpoint signal

Yet another ptrace question:
The process is paused for the initial SIGTRAP, then i call ptrace with STEPINTO to step the next instruction. After the ptrace call, i call waitpid(...) and sure enough it does receive a SIGTRAP signal! Great!
however, when i examine the siginfo_t structure by calling
ptrace(PTRACE_GETSIGINFO, pid, 0, &siginfo)
the fields are as follows:
si_signo = SIGTRAP
si_code = 1
however, 1 is the signal code for a breakpoint. No breakpoint was hit in this case:
/* `si_code' values for SIGTRAP signal. */
enum
{
TRAP_BRKPT = 1, /* Process breakpoint. */
# define TRAP_BRKPT TRAP_BRKPT
TRAP_TRACE /* Process trace trap. */
# define TRAP_TRACE TRAP_TRACE
};
What's going on?
All calls involved return success (0) and errno is 0, too.
edit: so what happens when i hit a real 0xCC breakpoint?
A SIGTRAP is fired and the si_code field is 0x80 ! 0x80 isnt even defined for SIGTRAP! :(

Is there something wrong with my spin lock?

Here is my implementation of a spin lock, but it seems it can not protect the critical code. Is there something wrong with my implementation?
static __inline__ int xchg_asm(int* lock, int val)
{
int ret;
__asm__ __volatile__(
LOCK "movl (%1),%%eax;
xchg (%1),%2;
movl %%eax, %0" :"=m" (ret) :"d"(lock), "c"(val)
);
return ret;
}
void spin_init(spinlock_t* sl)
{
sl->val = 0;
}
void spin_lock(spinlock_t* sl)
{
int ret;
do {
ret = xchg_asm(&(sl->val), 1);
} while ( ret==0 );
}
void spin_unlock(spinlock_t* sl)
{
xchg_asm(&(sl->val), 0);
}

Your code equals to:
static __inline__ int xchg_asm(int* lock, int val) {
int save_old_value_at_eax;
save_old_value_at_eax = *lock; /* with a wrong lock prefix */
xchg *lock with val and discard the original value of *lock.
return save_old_value_at_eax; /* but it not the real original value of *lock */
}
You can see from the code, save_old_value_at_eax is no the real original value while the cpu perform xchg. You should get the old/original value by the xchg instruction, not by saving it before perform xchg. ("it is not the real old/original value" means, if another CPU takes the lock after this CPU saves the value but before this CPU performs the xchg instruction, this CPU will get the wrong old value, and it think it took the lock successful, thus, two CPUs enter the C.S. at the same time). You have separated a read-modify-write instruction to three instructions, the whole three instructions are not atomically(even you move the lock prefix to xchg).
I guess you thought the lock prefix will lock the WHOLE three instructions, but actually lock prefix can only be used for the only instruction which it is attached(not all instructions can be attached)
And we don't need lock prefix on SMP for xchg. Quote from linux_kernel_src/arch/x86//include/asm/cmpxchg.h
/*
* Note: no "lock" prefix even on SMP: xchg always implies lock anyway.
* Since this is generally used to protect other memory information, we
* use "asm volatile" and "memory" clobbers to prevent gcc from moving
* information around.
*/
My suggestions:
DON'T REPEAT YOURSELF, please use the spin lock of the linux kernel.
DON'T REPEAT YOURSELF, please use the xchg(), cmpxchg() of the linux kernel if you do want to implement a spin lock.
learn more about instructions. you can also find out how the linux kernel implement it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string