In a SIGILL handler, how can I skip the offending instruction?

In a SIGILL handler, how can I skip the offending instruction? - linux

I'm going JIT code generation, and I want to insert invalid opcodes into the stream in order to perform some meta-debugging. Everything is fine and good until it hits the instruction, at which point the thing goes into an infinite loop of illegal instruction to signal handler and back.
Is there any way I can set the thing to simply skip the bad instruction?

It's very hacky and UNPORTABLE but:
void sighandler (int signo, siginfo_t si, void *data) {
ucontext_t *uc = (ucontext_t *)data;
int instruction_length = /* the length of the "instruction" to skip */
uc->uc_mcontext.gregs[REG_RIP] += instruction_length;
}
install the sighandler like that:
struct sigaction sa, osa;
sa.sa_flags = SA_ONSTACK | SA_RESTART | SA_SIGINFO;
sa.sa_sigaction = sighandler;
sigaction(SIGILL, &sa, &osa);
That could work if you know how far to skip (and it's a Intel proc) :-)

You can also try another approach (if it applies to your case):
you can use a SIGTRAP which is easier to manage.
void sigtrap_handler(int sig){
printf("Process %d received sigtrap %d.\n", getpid(),sig);
}
signal(SIGTRAP,sigtrap_handler);
asm("int3"); // causes a SIGTRAP

Related

Do I need a QMutex for variable that is accessed by single statement?

In this document, a QMutex is used to protect "number" from being modified by multiple threads at same time.
I have a code in which a thread is instructed to do different work according to a flag set by another thread.
//In thread1
if(flag)
dowork1;
else
dowork2;
//In thread2
void setflag(bool f)
{
flag=f;
}
I want to know if a QMutex is needed to protect flag, i.e.,
//In thread1
mutex.lock();
if(flag)
{
mutex.unlock();
dowork1;
}
else
{
mutex.unlock();
dowork2;
}
//In thread2
void setflag(bool f)
{
mutex.lock();
flag=f;
mutex.unlock();
}
The code is different from the document in that flag is accessed(read/written) by single statement in both threads, and only one thread modifies the value of flag.
PS:
I always see the example in multi-thread programming tutorials that one thread does "count++", the other thread does "count--", and the tutorials say you should use a Mutex to protect the variable "count". I cannot get the point of using a mutex. Does it mean the execution of single statement "count++" or "count--" can be interrupted in the middle and produce unexpected result? What unexpected results can be gotten?

Does it mean the execution of single statement "count++" or "count--"
can be interrupted in the middle and produce unexpected result? What
unexpected results can be gotten?
Just answering to this part: Yes, the execution can be interrupted in the middle of a statement.
Let's imagine a simple case:
class A {
void foo(){
++a;
}
int a = 0;
};
The single statement ++a is translated in assembly to
mov eax, DWORD PTR [rdi]
add eax, 1
mov DWORD PTR [rdi], eax
which can be seen as
eax = a;
eax += 1;
a = eax;
If foo() is called on the same instance of A in 2 different threads (be it on a single core, or multiple cores) you cannot predict what will be the result of the program.
It can behave nicely:
thread 1 > eax = a // eax in thread 1 is equal to 0
thread 1 > eax += 1 // eax in thread 1 is equal to 1
thread 1 > a = eax // a is set to 1
thread 2 > eax = a // eax in thread 2 is equal to 1
thread 2 > eax += 1 // eax in thread 2 is equal to 2
thread 2 > a = eax // a is set to 2
or not:
thread 1 > eax = a // eax in thread 1 is equal to 0
thread 2 > eax = a // eax in thread 2 is equal to 0
thread 2 > eax += 1 // eax in thread 2 is equal to 1
thread 2 > a = eax // a is set to 1
thread 1 > eax += 1 // eax in thread 1 is equal to 1
thread 1 > a = eax // a is set to 1
In a well defined program, N calls to foo() should result in a == N.
But calling foo() on the same instance of A from multiple threads creates undefined behavior. There is no way to know the value of a after N calls to foo().
It will depend on how you compiled your program, what optimization flags were used, which compiler was used, what was the load of your CPU, the number of core of your CPU,...
NB
class A {
public:
bool check() const { return a == b; }
int get_a() const { return a; }
int get_b() const { return b; }
void foo(){
++a;
++b;
}
private:
int a = 0;
int b = 0;
};
Now we have a class that, for an external observer, keeps a and b equal at all time.
The optimizer could optimize this class into:
class A {
public:
bool check() const { return true; }
int get_a() const { return a; }
int get_b() const { return b; }
void foo(){
++a;
++b;
}
private:
int a = 0;
int b = 0;
};
because it does not change the observable behavior of the program.
However if you invoke undefined behavior by calling foo() on the same instance of A from multiple threads, you could end up if a = 3, b = 2 and check() still returning true. Your code has lost its meaning, the program is not doing what it is supposed to and can be doing about anything.
From here you can imagine more complex cases, like if A manages network connections, you can end up sending the data for client #10 to client #6. If your program is running in a factory, you can end up activating the wrong tool.
If you want the definition of undefined behavior you can look here : https://en.cppreference.com/w/cpp/language/ub
and in the C++ standard
For a better understanding of UB you can look for CppCon talks on the topic.

For any standard object (including bool) that is accessed from multiple threads, where at least one of the threads may modify the object's state, you need to protect access to that object using a mutex, otherwise you will invoke undefined behavior.
As a practical matter, for a bool that undefined behavior probably won't come in the form of a crash, but more likely in the form of thread B sometimes not "seeing" changes made to the bool by thread A, due to caching and/or optimization issues (e.g. the optimizer "knows" that the bool can't change during a function call, so it doesn't bother checking it more than once)
If you don't want to guard your accesses with a mutex, the other option is to change flag from a bool to a std::atomic<bool>; the std::atomic<bool> type has exactly the semantics you are looking for, i.e. it can be read and/or written from any thread without invoking undefined behavior.

Look here for an explanation: Do I have to use atomic<bool> for "exit" bool variable?
To synchronize access to flag you can make it a std::atomic<bool>.
Or you can use a QReadWriteLock together with a QReadLocker and a QWriteLocker. Compared to using a QMutex this gives you the advantage that you do not need to care about the call to QMutex::unlock() if you use exceptions or early return statements.
Alternatively you can use a QMutexLocker if the QReadWriteLock does not match your use case.
QReadWriteLock lock;
...
//In thread1
{
QReadLocker readLocker(&lock);
if(flag)
dowork1;
else
dowork2;
}
...
//In thread2
void setflag(bool f)
{
QWriteLocker writeLocker(&lock);
flag=f;
}

Keeping your program expressing its intent (ie. accessing shared vars under locks) is a big win for program maintenance and clarity. You need to have some pretty good reasons to abandon that clarity for obscure approaches like the atomics and devising consistent race conditions.
Good reasons include you have measured your program spending too much time toggling the mutex. In any decent implementation, the difference between a non-contested mutex and an atomic is minute -- the mutex lock and unlock typical employ an optimistic compare-and-swap, returning quickly. If your vendor doesn't provide a decent implementation, you might bring that up with them.
In your example, dowork1 and dowork2 are invoked with the mutex locked; so the mutex isn't just protecting flag, but also serializing these functions. If that is just an artifact of how you posed the question, then race conditions (variants of atomics travesty) are less scary.
In your PS (dup of comment above):
Yes, count++ is best thought of as:
mov $_count, %r1
ld (%r1), %r0
add $1, %r0, %r2
st %r2,(%r1)
Even machines with natural atomic inc (x86,68k,370,dinosaurs) instructions might not be used consistently by the compiler.
So, if two threads do count--; and count++; at close to the same time, the result could be -1, 0, 1. (ignoring the language weenies that say your house might burn down).
barriers:
if CPU0 executes:
store $1 to b
store $2 to c
and CPU1 executes:
load barrier -- discard speculatively read values.
load b to r0
load c to r1
Then CPU1 could read r0,r1 as: (0,0), (1,0), (1,2), (0,2).
This is because the observable order of the memory writes is weak; the processor may make them visible in an arbitrary fashion.
So, we change CPU0 to execute:
store $1 to b
store barrier -- stop storing until all previous stores are visible
store $2 to c
Then, if CPU1 saw that r1 (c) was 2, then r0 (b) has to be 1. The store barrier enforces that.

For me, its seems to be more handy to use a mutex here.
In general not using mutex when sharing references could lead to
problems.
The only downside of using mutex here seems to be, that you will slightly decrease the performance, because your threads have to wait for each other.
What kind of errors could happen ?
Like somebody in the comments said its a different situation if
your share fundamental datatype e.g. int, bool, float
or a object references. I added some qt code
example, which emphases 2 possible problems during NOT using mutex. The problem #3 is a fundamental one and pretty well described in details by Benjamin T and his nice answer.
Blockquote
main.cpp
#include <QCoreApplication>
#include <QThread>
#include <QtDebug>
#include <QTimer>
#include "countingthread.h"
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
int amountThread = 3;
int counter = 0;
QString *s = new QString("foo");
QMutex *mutex = new QMutex();
//we construct a lot of thread
QList<CountingThread*> threadList;
//we create all threads
for(int i=0;i<amountThread;i++)
{
CountingThread *t = new CountingThread();
#ifdef TEST_ATOMIC_VAR_SHARE
t->addCounterdRef(&counter);
#endif
#ifdef TEST_OBJECT_VAR_SHARE
t->addStringRef(s);
//we add a mutex, which is shared to read read write
//just used with TEST_OBJECT_SHARE_FIX define uncommented
t->addMutexRef(mutex);
#endif
//t->moveToThread(t);
threadList.append(t);
}
//we start all with low prio, otherwise we produce something like a fork bomb
for(int i=0;i<amountThread;i++)
threadList.at(i)->start(QThread::Priority::LowPriority);
return a.exec();
}
countingthread.h
#ifndef COUNTINGTHREAD_H
#define COUNTINGTHREAD_H
#include <QThread>
#include <QtDebug>
#include <QTimer>
#include <QMutex>
//atomic var is shared
//#define TEST_ATOMIC_VAR_SHARE
//more complex object var is shared
#define TEST_OBJECT_VAR_SHARE
// we add the fix
#define TEST_OBJECT_SHARE_FIX
class CountingThread : public QThread
{
Q_OBJECT
int *m_counter;
QString *m_string;
QMutex *m_locker;
public :
void addCounterdRef(int *r);
void addStringRef(QString *s);
void addMutexRef(QMutex *m);
void run() override;
};
#endif // COUNTINGTHREAD_H
countingthread.cpp
#include "countingthread.h"
void CountingThread::run()
{
//forever
while(1)
{
#ifdef TEST_ATOMIC_VAR_SHARE
//first use of counter
int counterUse1Copy= (*m_counter);
//some other operations, here sleep 10 ms
this->msleep(10);
//we will retry to use a second time
int counterUse2Copy= (*m_counter);
if(counterUse1Copy != counterUse2Copy)
qDebug()<<this->thread()->currentThreadId()<<" problem #1 found, counter not like we expect";
//we increment afterwards our counter
(*m_counter) +=1; //this works for fundamental types, like float, int, ...
#endif
#ifdef TEST_OBJECT_VAR_SHARE
#ifdef TEST_OBJECT_SHARE_FIX
m_locker->lock();
#endif
m_string->replace("#","-");
//this will crash here !!, with problem #2,
//segmentation fault, is not handle by try catch
m_string->append("foomaster");
m_string->append("#");
if(m_string->length()>10000)
qDebug()<<this->thread()->currentThreadId()<<" string is: " << m_string;
#ifdef TEST_OBJECT_SHARE_FIX
m_locker->unlock();
#endif
#endif
}//end forever
}
void CountingThread::addCounterdRef(int *r)
{
m_counter = r;
qDebug()<<this->thread()->currentThreadId()<<" add counter with value: " << *m_counter << " and address : "<< m_counter ;
}
void CountingThread::addStringRef(QString *s)
{
m_string = s;
qDebug()<<this->thread()->currentThreadId()<<" add string with value: " << *m_string << " and address : "<< m_string ;
}
void CountingThread::addMutexRef(QMutex *m)
{
m_locker = m;
}
If you follow up the code you are able to perform 2 tests.
If you uncomment TEST_ATOMIC_VAR_SHARE and comment TEST_OBJECT_VAR_SHARE in countingthread.h
your will see
problem #1 if you use your variable multiple times in your thread, it could be changes in the background from another thread, besides my expectation there was no app crash or weird exception in my build environment during execution using an int counter.
If you uncomment TEST_OBJECT_VAR_SHARE and comment TEST_OBJECT_SHARE_FIX and comment TEST_ATOMIC_VAR_SHARE in countingthread.h
your will see
problem #2 you get a segmentation fault, which is not possible to handle via try catch. This appears because multiple threads are using string functions for editing on the same object.
If you uncomment TEST_OBJECT_SHARE_FIX too you see the right handling via mutex.
problem #3 see answer from Benjamin T
What is Mutex:
I really like the chicken explanation which vallabh suggested.
I also found an good explanation here

How to make mprotect() to make forward progress after handling pagefaulte exception? [duplicate]

I want to write a signal handler to catch SIGSEGV.
I protect a block of memory for read or write using
char *buffer;
char *p;
char a;
int pagesize = 4096;
mprotect(buffer,pagesize,PROT_NONE)
This protects pagesize bytes of memory starting at buffer against any reads or writes.
Second, I try to read the memory:
p = buffer;
a = *p
This will generate a SIGSEGV, and my handler will be called.
So far so good. My problem is that, once the handler is called, I want to change the access write of the memory by doing
mprotect(buffer,pagesize,PROT_READ);
and continue normal functioning of my code. I do not want to exit the function.
On future writes to the same memory, I want to catch the signal again and modify the write rights and then record that event.
Here is the code:
#include <signal.h>
#include <stdio.h>
#include <malloc.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/mman.h>
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
char *buffer;
int flag=0;
static void handler(int sig, siginfo_t *si, void *unused)
{
printf("Got SIGSEGV at address: 0x%lx\n",(long) si->si_addr);
printf("Implements the handler only\n");
flag=1;
//exit(EXIT_FAILURE);
}
int main(int argc, char *argv[])
{
char *p; char a;
int pagesize;
struct sigaction sa;
sa.sa_flags = SA_SIGINFO;
sigemptyset(&sa.sa_mask);
sa.sa_sigaction = handler;
if (sigaction(SIGSEGV, &sa, NULL) == -1)
handle_error("sigaction");
pagesize=4096;
/* Allocate a buffer aligned on a page boundary;
initial protection is PROT_READ | PROT_WRITE */
buffer = memalign(pagesize, 4 * pagesize);
if (buffer == NULL)
handle_error("memalign");
printf("Start of region: 0x%lx\n", (long) buffer);
printf("Start of region: 0x%lx\n", (long) buffer+pagesize);
printf("Start of region: 0x%lx\n", (long) buffer+2*pagesize);
printf("Start of region: 0x%lx\n", (long) buffer+3*pagesize);
//if (mprotect(buffer + pagesize * 0, pagesize,PROT_NONE) == -1)
if (mprotect(buffer + pagesize * 0, pagesize,PROT_NONE) == -1)
handle_error("mprotect");
//for (p = buffer ; ; )
if(flag==0)
{
p = buffer+pagesize/2;
printf("It comes here before reading memory\n");
a = *p; //trying to read the memory
printf("It comes here after reading memory\n");
}
else
{
if (mprotect(buffer + pagesize * 0, pagesize,PROT_READ) == -1)
handle_error("mprotect");
a = *p;
printf("Now i can read the memory\n");
}
/* for (p = buffer;p<=buffer+4*pagesize ;p++ )
{
//a = *(p);
*(p) = 'a';
printf("Writing at address %p\n",p);
}*/
printf("Loop completed\n"); /* Should never happen */
exit(EXIT_SUCCESS);
}
The problem is that only the signal handler runs and I can't return to the main function after catching the signal.

When your signal handler returns (assuming it doesn't call exit or longjmp or something that prevents it from actually returning), the code will continue at the point the signal occurred, reexecuting the same instruction. Since at this point, the memory protection has not been changed, it will just throw the signal again, and you'll be back in your signal handler in an infinite loop.
So to make it work, you have to call mprotect in the signal handler. Unfortunately, as Steven Schansker notes, mprotect is not async-safe, so you can't safely call it from the signal handler. So, as far as POSIX is concerned, you're screwed.
Fortunately on most implementations (all modern UNIX and Linux variants as far as I know), mprotect is a system call, so is safe to call from within a signal handler, so you can do most of what you want. The problem is that if you want to change the protections back after the read, you'll have to do that in the main program after the read.
Another possibility is to do something with the third argument to the signal handler, which points at an OS and arch specific structure that contains info about where the signal occurred. On Linux, this is a ucontext structure, which contains machine-specific info about the $PC address and other register contents where the signal occurred. If you modify this, you change where the signal handler will return to, so you can change the $PC to be just after the faulting instruction so it won't re-execute after the handler returns. This is very tricky to get right (and non-portable too).
edit
The ucontext structure is defined in <ucontext.h>. Within the ucontext the field uc_mcontext contains the machine context, and within that, the array gregs contains the general register context. So in your signal handler:
ucontext *u = (ucontext *)unused;
unsigned char *pc = (unsigned char *)u->uc_mcontext.gregs[REG_RIP];
will give you the pc where the exception occurred. You can read it to figure out what instruction it
was that faulted, and do something different.
As far as the portability of calling mprotect in the signal handler is concerned, any system that follows either the SVID spec or the BSD4 spec should be safe -- they allow calling any system call (anything in section 2 of the manual) in a signal handler.

You've fallen into the trap that all people do when they first try to handle signals. The trap? Thinking that you can actually do anything useful with signal handlers. From a signal handler, you are only allowed to call asynchronous and reentrant-safe library calls.
See this CERT advisory as to why and a list of the POSIX functions that are safe.
Note that printf(), which you are already calling, is not on that list.
Nor is mprotect. You're not allowed to call it from a signal handler. It might work, but I can promise you'll run into problems down the road. Be really careful with signal handlers, they're tricky to get right!
EDIT
Since I'm being a portability douchebag at the moment already, I'll point out that you also shouldn't write to shared (i.e. global) variables without taking the proper precautions.

You can recover from SIGSEGV on linux. Also you can recover from segmentation faults on Windows (you'll see a structured exception instead of a signal). But the POSIX standard doesn't guarantee recovery, so your code will be very non-portable.
Take a look at libsigsegv.

You should not return from the signal handler, as then behavior is undefined. Rather, jump out of it with longjmp.
This is only okay if the signal is generated in an async-signal-safe function. Otherwise, behavior is undefined if the program ever calls another async-signal-unsafe function. Hence, the signal handler should only be established immediately before it is necessary, and disestablished as soon as possible.
In fact, I know of very few uses of a SIGSEGV handler:
use an async-signal-safe backtrace library to log a backtrace, then die.
in a VM such as the JVM or CLR: check if the SIGSEGV occurred in JIT-compiled code. If not, die; if so, then throw a language-specific exception (not a C++ exception), which works because the JIT compiler knew that the trap could happen and generated appropriate frame unwind data.
clone() and exec() a debugger (do not use fork() – that calls callbacks registered by pthread_atfork()).
Finally, note that any action that triggers SIGSEGV is probably UB, as this is accessing invalid memory. However, this would not be the case if the signal was, say, SIGFPE.

There is a compilation problem using ucontext_t or struct ucontext (present in /usr/include/sys/ucontext.h)
http://www.mail-archive.com/arch-general#archlinux.org/msg13853.html

perf_event_open Overflow Signal

I want to count the (more or less) exact amount of instructions for some piece of code. Additionally, I want to receive a Signal after a specific amount of instructions passed.
For this purpose, I use the overflow signal behaviour provided by
perf_event_open.
I'm using the second way the manpage proposes to achieve overflow signals:
Signal overflow
Events can be set to deliver a signal when a threshold
is crossed. The signal handler is set up using the poll(2), select(2),
epoll(2) and fcntl(2), system calls.
[...]
The other way is by use of the PERF_EVENT_IOC_REFRESH ioctl. This
ioctl adds to a counter that decrements each time the event overflows.
When nonzero, a POLL_IN signal is sent on overflow, but once the value
reaches 0, a signal is sent of type POLL_HUP and the underlying event
is disabled.
Further explanation of PERF_EVENT_IOC_REFRESH ioctl:
PERF_EVENT_IOC_REFRESH
Non-inherited overflow counters can use this to enable a
counter for a number of overflows specified by the argument,
after which it is disabled. Subsequent calls of this ioctl
add the argument value to the current count. A signal with
POLL_IN set will happen on each overflow until the count
reaches 0; when that happens a signal with POLL_HUP set is
sent and the event is disabled. Using an argument of 0 is
considered undefined behavior.
A very minimal example would look like this:
#define _GNU_SOURCE 1
#include <asm/unistd.h>
#include <fcntl.h>
#include <linux/perf_event.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
long perf_event_open(struct perf_event_attr* event_attr, pid_t pid, int cpu, int group_fd, unsigned long flags)
{
return syscall(__NR_perf_event_open, event_attr, pid, cpu, group_fd, flags);
}
static void perf_event_handler(int signum, siginfo_t* info, void* ucontext) {
if(info->si_code != POLL_HUP) {
// Only POLL_HUP should happen.
exit(EXIT_FAILURE);
}
ioctl(info->si_fd, PERF_EVENT_IOC_REFRESH, 1);
}
int main(int argc, char** argv)
{
// Configure signal handler
struct sigaction sa;
memset(&sa, 0, sizeof(struct sigaction));
sa.sa_sigaction = perf_event_handler;
sa.sa_flags = SA_SIGINFO;
// Setup signal handler
if (sigaction(SIGIO, &sa, NULL) < 0) {
fprintf(stderr,"Error setting up signal handler\n");
perror("sigaction");
exit(EXIT_FAILURE);
}
// Configure perf_event_attr struct
struct perf_event_attr pe;
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_INSTRUCTIONS; // Count retired hardware instructions
pe.disabled = 1; // Event is initially disabled
pe.sample_type = PERF_SAMPLE_IP;
pe.sample_period = 1000;
pe.exclude_kernel = 1; // excluding events that happen in the kernel-space
pe.exclude_hv = 1; // excluding events that happen in the hypervisor
pid_t pid = 0; // measure the current process/thread
int cpu = -1; // measure on any cpu
int group_fd = -1;
unsigned long flags = 0;
int fd = perf_event_open(&pe, pid, cpu, group_fd, flags);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
perror("perf_event_open");
exit(EXIT_FAILURE);
}
// Setup event handler for overflow signals
fcntl(fd, F_SETFL, O_NONBLOCK|O_ASYNC);
fcntl(fd, F_SETSIG, SIGIO);
fcntl(fd, F_SETOWN, getpid());
ioctl(fd, PERF_EVENT_IOC_RESET, 0); // Reset event counter to 0
ioctl(fd, PERF_EVENT_IOC_REFRESH, 1); //
// Start monitoring
long loopCount = 1000000;
long c = 0;
long i = 0;
// Some sample payload.
for(i = 0; i < loopCount; i++) {
c += 1;
}
// End monitoring
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); // Disable event
long long counter;
read(fd, &counter, sizeof(long long)); // Read event counter value
printf("Used %lld instructions\n", counter);
close(fd);
}
So basically I'm doing the following:
Set up a signal handler for SIGIO signals
Create a new performance counter with perf_event_open (returns a file descriptor)
Use fcntl to add signal sending behavior to the file descriptor.
Run a payload loop to execute many instructions.
When executing the payload loop, at some point 1000 instructions (the sample_interval) will have been executed. According to the perf_event_open manpage this triggers an overflow which will then decrement an internal counter.
Once this counter reaches zero, "a signal is sent of type POLL_HUP and the underlying event is disabled."
When a signal is sent, the control flow of the current process/thread is stopped, and the signal handler is executed. Scenario:
1000 instructions have been executed.
Event is automatically disabled and a signal is sent.
Signal is immediately delivered, control flow of the process is stopped and the signal handler is executed.
This scenario would mean two things:
The final amount of counted instructions would always be equal to an example which does not use signals at all.
The instruction pointer which has been saved for the signal handler (and can be accessed through ucontext) would directly point to the instruction which caused the overflow.
Basically you could say, the signal behavior can be seen as synchronous.
This is the perfect semantic for what I want to achieve.
However, as far as I'm concerned, the signal I configured is generally rather asynchronous and some time may pass until it is eventually delivered and the signal handler is executed. This may pose a problem for me.
For example, consider the following scenario:
1000 instructions have been executed.
Event is automatically disabled and a signal is sent.
Some more instructions pass
Signal is delivered, control flow of the process is stopped and the signal handler is executed.
This scenario would mean two things:
The final amount of counted instructions would be less than an example which does not use signals at all.
The instruction pointer which has been saved for the signal handler would point to the instructions which caused the overflow or to any one after it.
So far, I've tested above example a lot and did not experience missed instructions which would support the first scenario.
However, I'd really like to know, whether I can rely on this assumption or not.
What happens in the kernel?

I want to count the (more or less) exact amount of instructions for some piece of code. Additionally, I want to receive a Signal after a specific amount of instructions passed.
You have two task which may conflict with each other. When you want to get counting (exact amounts of some hardware event), just use performance monitoring unit of your CPU in counting mode (don't set sample_period/sample_freq of perf_event_attr structure used) and place the measurement code in your target program (as it was done in your example). In this mode according to the man page of perf_event_open no overflows will be generated (CPU's PMU are usually 64-bit wide and don't overflow when not set to small negative value when sampling mode is used):
Overflows are generated only by sampling events (sample_period must a nonzero value).
To count part of program, use ioctls of perf_event_open returned fd as described in man page
perf_event ioctl calls - Various ioctls act on perf_event_open() file descriptors: PERF_EVENT_IOC_ENABLE ... PERF_EVENT_IOC_DISABLE ... PERF_EVENT_IOC_RESET
You can read current value with rdpmc (on x86) or by read syscall on the fd like in the short example from the man page:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <sys/ioctl.h>
#include <linux/perf_event.h>
#include <asm/unistd.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_INSTRUCTIONS;
pe.disabled = 1;
pe.exclude_kernel = 1;
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx\n", pe.config);
exit(EXIT_FAILURE);
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
printf("Measuring instruction count for this printf\n");
/* Place target code here instead of printf */
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("Used %lld instructions\n", count);
close(fd);
}
Additionally, I want to receive a Signal after a specific amount of instructions passed.
Do you really want to get signal or you just need instruction pointers at every 1000 instructions executed? If you want to collect pointers, use perf_even_open with sampling mode, but do it from other program to disable measuring of the event collection code. Also, it will have less negative effect on your target program, if you will use not signals for every overflow (with huge amount of kernel-tracer interactions and switching from/to kernel), but instead use capabilities of perf_events to collect several overflow events into single mmap buffer and poll on this buffer. On overflow interrupt from PMU perf interrupt handler will be called to save the instruction pointer into buffer and then counting will be reset and program will return to execution. In your example, perf interrupt handler will woke your program, it will do several syscalls, return to kernel and then kernel will restart target code (so overhead per sample is greater than using mmap and parsing it). With precise_ip flag you may activate advanced sampling of your PMU (if it has such mode, like PEBS and PREC_DIST in intel x86/em64t for some counters like INST_RETIRED, UOPS_RETIRED, BR_INST_RETIRED, BR_MISP_RETIRED, MEM_UOPS_RETIRED, MEM_LOAD_UOPS_RETIRED, MEM_LOAD_UOPS_LLC_HIT_RETIRED and with simple hack to cycles too; or like IBS of AMD x86/amd64; paper about PEBS and IBS), when instruction address is saved directly by hardware with low skid. Some very advanced PMUs has ability to do sampling in hardware, storing overflow information of several events in row with automatic reset of counter without software interrupts (some descriptions on precise_ip are in the same paper).
I don't know if it is possible in perf_events subsystem and in your CPU to have two perf_event tasks active at same time: both count events in the target process and in the same time have sampling from other process. With advanced PMU this can be possible in the hardware and perf_events in modern kernel may allow it. But you give no details on your kernel version and your CPU vendor and family, so we can't answer this part.
You also may try other APIs to access PMU like PAPI or likwid (https://github.com/RRZE-HPC/likwid). Some of them may directly read PMU registers (sometimes MSR) and may allow sampling at the same time when counting is enabled.

ptrace one thread from another

Experimenting with the ptrace() system call, I am trying to trace another thread of the same process. According to the man page, both the tracer and the tracee are specific threads (not processes), so I don't see a reason why it should not work. So far, I have tried the following:
use PTRACE_TRACEME from the clone()d child: the call succeeds, but does not do what I want, probably because the parent of the to-be-traced thread is not the thread that called clone()
use PTRACE_ATTACH or PTRACE_SEIZE from the parent thread: this always fails with EPERM, even if the process runs as root and with prctl(PR_SET_DUMPABLE, 1)
In all cases, waitpid(-1, &status, __WALL) fails with ECHILD (same when passing the child pid explicitly).
What should I do to make it work?
If it is not possible at all, is it by desing or a bug in the kernel (I am using version 3.8.0). In the former case, could you point me to the right bit of the documentation?

As #mic_e pointed out, this is a known fact about the kernel - not quite a bug, but not quite correct either. See the kernel mailing list thread about it. To provide an excerpt from Linus Torvalds:
That "new" (last November) check isn't likely going away. It solved
so many problems (both security and stability), and considering that
(a) in a year, only two people have ever even noticed
(b) there's a work-around as per above that isn't horribly invasive
I have to say that in order to actually go back to the old behaviour,
we'd have to have somebody who cares deeply, go back and check every
single special case, deadlock, and race.
The solution is to actually start the process that is being traced in a subprocess - you'll need to make the ptracing process be the parent of the other.
Here's an outline of doing this based on another answer that I wrote:
// this number is arbitrary - find a better one.
#define STACK_SIZE (1024 * 1024)
int main_thread(void *ptr) {
// do work for main thread
}
int main(int argc, char *argv[]) {
void *vstack = malloc(STACK_SIZE);
pid_t v;
if (clone(main_thread, vstack + STACK_SIZE, CLONE_PARENT_SETTID | CLONE_FILES | CLONE_FS | CLONE_IO, NULL, &v) == -1) { // you'll want to check these flags
perror("failed to spawn child task");
return 3;
}
long ptv = ptrace(PTRACE_SEIZE, v, NULL, NULL);
if (ptv == -1) {
perror("failed monitor sieze");
return 1;
}
// do actual ptrace work
}

Linux synchronization with FIFO waiting queue

Are there locks in Linux where the waiting queue is FIFO? This seems like such an obvious thing, and yet I just discovered that pthread mutexes aren't FIFO, and semaphores apparently aren't FIFO either (I'm working on kernel 2.4 (homework))...
Does Linux have a lock with FIFO waiting queue, or is there an easy way to make one with existing mechanisms?

Here is a way to create a simple queueing "ticket lock", built on pthreads primitives. It should give you some ideas:
#include <pthread.h>
typedef struct ticket_lock {
pthread_cond_t cond;
pthread_mutex_t mutex;
unsigned long queue_head, queue_tail;
} ticket_lock_t;
#define TICKET_LOCK_INITIALIZER { PTHREAD_COND_INITIALIZER, PTHREAD_MUTEX_INITIALIZER }
void ticket_lock(ticket_lock_t *ticket)
{
unsigned long queue_me;
pthread_mutex_lock(&ticket->mutex);
queue_me = ticket->queue_tail++;
while (queue_me != ticket->queue_head)
{
pthread_cond_wait(&ticket->cond, &ticket->mutex);
}
pthread_mutex_unlock(&ticket->mutex);
}
void ticket_unlock(ticket_lock_t *ticket)
{
pthread_mutex_lock(&ticket->mutex);
ticket->queue_head++;
pthread_cond_broadcast(&ticket->cond);
pthread_mutex_unlock(&ticket->mutex);
}

If you are asking what I think you are asking the short answer is no. Threads/processes are controlled by the OS scheduler. One random thread is going to get the lock, the others aren't. Well, potentially more than one if you are using a counting semaphore but that's probably not what you are asking.
You might want to look at pthread_setschedparam but it's not going to get you where I suspect you want to go.
You could probably write something but I suspect it will end up being inefficient and defeat using threads in the first place since you will just end up randomly yielding each thread until the one you want gets control.
Chances are good you are just thinking about the problem in the wrong way. You might want to describe your goal and get better suggestions.

I had a similar requirement recently, except dealing with multiple processes. Here's what I found:
If you need 100% correct FIFO ordering, go with caf's pthread ticket lock.
If you're happy with 99% and favor simplicity, a semaphore or a mutex can do really well actually.
Ticket lock can be made to work across processes:
You need to use shared memory, process-shared mutex and condition variable, handle processes dying with the mutex locked (-> robust mutex) ... Which is a bit overkill here, all I need is the different instances don't get scheduled at the same time and the order to be mostly fair.
Using a semaphore:
static sem_t *sem = NULL;
void fifo_init()
{
sem = sem_open("/server_fifo", O_CREAT, 0600, 1);
if (sem == SEM_FAILED) fail("sem_open");
}
void fifo_lock()
{
int r;
struct timespec ts;
if (clock_gettime(CLOCK_REALTIME, &ts) == -1) fail("clock_gettime");
ts.tv_sec += 5; /* 5s timeout */
while ((r = sem_timedwait(sem, &ts)) == -1 && errno == EINTR)
continue; /* Restart if interrupted */
if (r == 0) return;
if (errno == ETIMEDOUT) fprintf(stderr, "timeout ...\n");
else fail("sem_timedwait");
}
void fifo_unlock()
{
/* If we somehow end up with more than one token, don't increment the semaphore... */
int val;
if (sem_getvalue(sem, &val) == 0 && val <= 0)
if (sem_post(sem)) fail("sem_post");
usleep(1); /* Yield to other processes */
}
Ordering is almost 100% FIFO.
Note: This is with a 4.4 Linux kernel, 2.4 might be different.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string