Uninterruptible read/write calls - linux

At some point during my C programming adventures on Linux, I encountered flags (possibly ioctl/fcntl?), that make reads and writes on a file descriptor uninterruptible.
Unfortunately I cannot recall how to do this, or where I read it. Can anyone shed some light?
Update0
To refine my query, I'm after the same blocking and guarantees that fwrite() and fread() provide, sans userspace buffering.

You can avoid EINTR from read() and write() by ensuring all your signal handlers are installed with the SA_RESTART flag of sigaction().
However this does not protect you from short reads / writes. This is only possible by putting the read() / write() into a loop (it does not require an additional buffer beyond the one that must already be supplied to the read() / write() call.)
Such a loop would look like:
/* If return value is less than `count', then errno == 0 indicates end of file,
* otherwise errno indicates the error that occurred. */
ssize_t hard_read(int fd, void *buf, size_t count)
{
ssize_t rv;
ssize_t total_read = 0;
while (total_read < count)
{
rv = read(fd, (char *)buf + total_read, count - total_read);
if (rv == 0)
errno = 0;
if (rv < 1)
if (errno == EINTR)
continue;
else
break;
total_read += rv;
}
return rv;
}

Do you wish to disable interrupts while reading/writing, or guarantee that nobody else will read/write the file while you are?
For the second, you can use fcntl()'s F_GETLK, F_SETLK and F_SETLKW to acquire, release and test for record locks respectively. However, since POSIX locks are only advisory, Linux does not enforce them - it's only meaningful between cooperating processes.
The first task involves diving into ring zero and disabling interrupts on your local processor (or all, if you're on an SMP system). Remember to enable them again when you're done!

Related

Does every call to write sends switches to kernel mode?

I know that a call to the glibc "write" function calls in it's turn to the sys_call write function which is a kernel function. because sys_call is a kernel function the CPU has to change the ring to zero store the processes registers and so on.
But does it always switches to kernel mode? for example, if i do
write(-1,buffer,LENGTH)
does it still tries to find it in the file descriptors array?
I see in the glibc source code that it does check for fd>0 but i don't see any jump to the sys_call there (it seems like the baracks for main() ends before any call to the alias_write.
/* Write NBYTES of BUF to FD. Return the number written, or -1. */
ssize_t
__libc_write (int fd, const void *buf, size_t nbytes)
{
if (nbytes == 0)
return 0;
if (fd < 0)
{
__set_errno (EBADF);
return -1;
}
if (buf == NULL)
{
__set_errno (EINVAL);
return -1;
}
__set_errno (ENOSYS);
return -1;
}
libc_hidden_def (__libc_write)
stub_warning (write)
weak_alias (__libc_write, __write)
libc_hidden_weak (__write)
weak_alias (__libc_write, write)
#include <stub-tag.h>
So the question is both:
Where does the glibc actually calls the sys_write
Is it true that glibc doesn't call the sys_write if fd<0?
I see in the glibc source code that it does check for fd>0 but i don't see any jump to the sys_call there
You are looking at the wrong code.
There are multiple definitions of __libc_write used under different conditions. The one you looked at is in io/write.c.
The one that is actually used on Linux is generated from sysdeps/unix/syscall-template.S and it does actually execute the switch to kernel mode (and back to user mode) even when fd==-1, etc.

How to make mprotect() to make forward progress after handling pagefaulte exception? [duplicate]

I want to write a signal handler to catch SIGSEGV.
I protect a block of memory for read or write using
char *buffer;
char *p;
char a;
int pagesize = 4096;
mprotect(buffer,pagesize,PROT_NONE)
This protects pagesize bytes of memory starting at buffer against any reads or writes.
Second, I try to read the memory:
p = buffer;
a = *p
This will generate a SIGSEGV, and my handler will be called.
So far so good. My problem is that, once the handler is called, I want to change the access write of the memory by doing
mprotect(buffer,pagesize,PROT_READ);
and continue normal functioning of my code. I do not want to exit the function.
On future writes to the same memory, I want to catch the signal again and modify the write rights and then record that event.
Here is the code:
#include <signal.h>
#include <stdio.h>
#include <malloc.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/mman.h>
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
char *buffer;
int flag=0;
static void handler(int sig, siginfo_t *si, void *unused)
{
printf("Got SIGSEGV at address: 0x%lx\n",(long) si->si_addr);
printf("Implements the handler only\n");
flag=1;
//exit(EXIT_FAILURE);
}
int main(int argc, char *argv[])
{
char *p; char a;
int pagesize;
struct sigaction sa;
sa.sa_flags = SA_SIGINFO;
sigemptyset(&sa.sa_mask);
sa.sa_sigaction = handler;
if (sigaction(SIGSEGV, &sa, NULL) == -1)
handle_error("sigaction");
pagesize=4096;
/* Allocate a buffer aligned on a page boundary;
initial protection is PROT_READ | PROT_WRITE */
buffer = memalign(pagesize, 4 * pagesize);
if (buffer == NULL)
handle_error("memalign");
printf("Start of region: 0x%lx\n", (long) buffer);
printf("Start of region: 0x%lx\n", (long) buffer+pagesize);
printf("Start of region: 0x%lx\n", (long) buffer+2*pagesize);
printf("Start of region: 0x%lx\n", (long) buffer+3*pagesize);
//if (mprotect(buffer + pagesize * 0, pagesize,PROT_NONE) == -1)
if (mprotect(buffer + pagesize * 0, pagesize,PROT_NONE) == -1)
handle_error("mprotect");
//for (p = buffer ; ; )
if(flag==0)
{
p = buffer+pagesize/2;
printf("It comes here before reading memory\n");
a = *p; //trying to read the memory
printf("It comes here after reading memory\n");
}
else
{
if (mprotect(buffer + pagesize * 0, pagesize,PROT_READ) == -1)
handle_error("mprotect");
a = *p;
printf("Now i can read the memory\n");
}
/* for (p = buffer;p<=buffer+4*pagesize ;p++ )
{
//a = *(p);
*(p) = 'a';
printf("Writing at address %p\n",p);
}*/
printf("Loop completed\n"); /* Should never happen */
exit(EXIT_SUCCESS);
}
The problem is that only the signal handler runs and I can't return to the main function after catching the signal.
When your signal handler returns (assuming it doesn't call exit or longjmp or something that prevents it from actually returning), the code will continue at the point the signal occurred, reexecuting the same instruction. Since at this point, the memory protection has not been changed, it will just throw the signal again, and you'll be back in your signal handler in an infinite loop.
So to make it work, you have to call mprotect in the signal handler. Unfortunately, as Steven Schansker notes, mprotect is not async-safe, so you can't safely call it from the signal handler. So, as far as POSIX is concerned, you're screwed.
Fortunately on most implementations (all modern UNIX and Linux variants as far as I know), mprotect is a system call, so is safe to call from within a signal handler, so you can do most of what you want. The problem is that if you want to change the protections back after the read, you'll have to do that in the main program after the read.
Another possibility is to do something with the third argument to the signal handler, which points at an OS and arch specific structure that contains info about where the signal occurred. On Linux, this is a ucontext structure, which contains machine-specific info about the $PC address and other register contents where the signal occurred. If you modify this, you change where the signal handler will return to, so you can change the $PC to be just after the faulting instruction so it won't re-execute after the handler returns. This is very tricky to get right (and non-portable too).
edit
The ucontext structure is defined in <ucontext.h>. Within the ucontext the field uc_mcontext contains the machine context, and within that, the array gregs contains the general register context. So in your signal handler:
ucontext *u = (ucontext *)unused;
unsigned char *pc = (unsigned char *)u->uc_mcontext.gregs[REG_RIP];
will give you the pc where the exception occurred. You can read it to figure out what instruction it
was that faulted, and do something different.
As far as the portability of calling mprotect in the signal handler is concerned, any system that follows either the SVID spec or the BSD4 spec should be safe -- they allow calling any system call (anything in section 2 of the manual) in a signal handler.
You've fallen into the trap that all people do when they first try to handle signals. The trap? Thinking that you can actually do anything useful with signal handlers. From a signal handler, you are only allowed to call asynchronous and reentrant-safe library calls.
See this CERT advisory as to why and a list of the POSIX functions that are safe.
Note that printf(), which you are already calling, is not on that list.
Nor is mprotect. You're not allowed to call it from a signal handler. It might work, but I can promise you'll run into problems down the road. Be really careful with signal handlers, they're tricky to get right!
EDIT
Since I'm being a portability douchebag at the moment already, I'll point out that you also shouldn't write to shared (i.e. global) variables without taking the proper precautions.
You can recover from SIGSEGV on linux. Also you can recover from segmentation faults on Windows (you'll see a structured exception instead of a signal). But the POSIX standard doesn't guarantee recovery, so your code will be very non-portable.
Take a look at libsigsegv.
You should not return from the signal handler, as then behavior is undefined. Rather, jump out of it with longjmp.
This is only okay if the signal is generated in an async-signal-safe function. Otherwise, behavior is undefined if the program ever calls another async-signal-unsafe function. Hence, the signal handler should only be established immediately before it is necessary, and disestablished as soon as possible.
In fact, I know of very few uses of a SIGSEGV handler:
use an async-signal-safe backtrace library to log a backtrace, then die.
in a VM such as the JVM or CLR: check if the SIGSEGV occurred in JIT-compiled code. If not, die; if so, then throw a language-specific exception (not a C++ exception), which works because the JIT compiler knew that the trap could happen and generated appropriate frame unwind data.
clone() and exec() a debugger (do not use fork() – that calls callbacks registered by pthread_atfork()).
Finally, note that any action that triggers SIGSEGV is probably UB, as this is accessing invalid memory. However, this would not be the case if the signal was, say, SIGFPE.
There is a compilation problem using ucontext_t or struct ucontext (present in /usr/include/sys/ucontext.h)
http://www.mail-archive.com/arch-general#archlinux.org/msg13853.html

linux wake_up_interruptible() having no effect

I am writing a "sleepy" device driver for an Operating Systems class.
The way it works is, the user accesses the device via read()/write().
When the user writes to the device like so: write(fd, &wait, size), the device is put to sleep for the amount of time in seconds of the value of wait. If the wait time expires then driver's write method returns 0 and the program finishes. But if the user reads from the driver while a process is sleeping on a wait queue, then the driver's write method returns immediately with the number of seconds the sleeping process had left to wait before the timeout would have occurred on its own.
Another catch is that 10 instances of the device are created, and each of the 10 devices must be independent of each other. So a read to device 1 must only wake up sleeping processes on device 1.
Much code has been provided, and I have been charged with the task of mainly writing the read() and write() methods for the driver.
The way I have tried to solve the problem of keeping the devices independent of each other is to include two global static arrays of size 10. One of type wait_head_queue_t, and one of type Int(Bool flags). Both of these arrays are initialized once when I open the device via open(). The problem is that when I call wake_up_interruptible(), nothing happens, and the program terminates upon timeout. Here is my write method:
ssize_t sleepy_write(struct file *filp, const char __user *buf, size_t count, loff_t *f_pos){
struct sleepy_dev *dev = (struct sleepy_dev *)filp->private_data;
ssize_t retval = 0;
int mem_to_be_copied = 0;
if (mutex_lock_killable(&dev->sleepy_mutex))
{
return -EINTR;
}
// check size
if(count != 4) // user must provide 4 byte Int
{
return EINVAL; // = 22
}
// else if the user provided valid sized input...
else
{
if((mem_to_be_copied = copy_from_user(&long_buff[0], buf, count)))
{
return -EFAULT;
}
// check for negative wait time entered by user
if(long_buff[0] > -1)// "long_buff[]"is global,for now only holds 1 value
{
proc_read_flags[MINOR(dev->cdev.dev)] = 0; //****** flag array
retval = wait_event_interruptible_timeout(wqs[MINOR(dev->cdev.dev)], proc_read_flags[MINOR(dev->cdev.dev)] == 1, long_buff[0] * HZ) / HZ;
proc_read_flags[MINOR(dev->cdev.dev)] = 0; // MINOR numbers for each
// device correspond to array indices
// devices 0 - 9
// "wqs" is array of wait queues
}
else
{
printk(KERN_INFO "user entered negative value for sleep time\n");
}
}
mutex_unlock(&dev->sleepy_mutex);
return retval;}
Unlike the many examples on this topic, I am switching the flag back to zero immediately before the call to wait_event_interruptible_timeout() because flag values seem to be lingering between subsequent runs of the program. Here is the code for my read method:
ssize_t sleepy_read(struct file *filp, char __user *buf, size_t count,
loff_t *f_pos){
struct sleepy_dev *dev = (struct sleepy_dev *)filp->private_data;
ssize_t retval = 0;
if (mutex_lock_killable(&dev->sleepy_mutex))
return -EINTR;
// switch the flag
proc_read_flags[MINOR(dev->cdev.dev)] = 1; // again device minor numbers
// correspond to array indices
// TODO: this is not waking up the process in write!
// wake up the queue
wake_up_interruptible(&wqs[MINOR(dev->cdev.dev)]);
mutex_unlock(&dev->sleepy_mutex);
return retval;}
The way I am trying to test the program is to have two main.c's, one for writing to the device and one for reading from the device, and I just ./a.out them in separate consoles in my ubuntu installation in Virtual Box. Another thing, the way it is set up now, neither the writing or reading a.outs return until timeout occurs. I apologize for the spotty formatting of the code. I'm not sure exactly what is going on here, so any help would be much appreciated! Thanks!
Your write method hold sleepy_mutex while wait event. So read method waits on mutex_lock_killable(&dev->sleepy_mutex) while the mutex become unlocked by the writer. It is occured only when writer's timeout exceeds, and write method returns. It is the behaviour you observe.
Usually, wait_event* is executed outside of any critical section. That can be achieved by using _lock-suffixed variants of such macros, or simply wrapping cond argument of such macros with spinlock acquire/release pair:
int check_cond()
{
int res;
spin_lock(&lock);
res = <cond>;
spin_unlock(&lock);
return res;
}
...
wait_event_interruptible(&wq, check_cond());
Unfortunately, wait_event-family macros cannot be used, when condition checking should be protected with a mutex. In that case, you can use wait_woken() function with manual condition checking code. Or rewrite your code without needs of mutex lock/unlock around condition checking.
For achive "reader wake writer, if it is sleep" functionality, you can adopt code from that answer https://stackoverflow.com/a/29765695/3440745.
Writer code:
//Declare local variable at the beginning of the function
int cflag;
...
// Outside of any critical section(after mutex_unlock())
cflag = proc_read_flags[MINOR(dev->cdev.dev)];
wait_event_interruptible_timeout(&wqs[MINOR(dev->cdev.dev)],
proc_read_flags[MINOR(dev->cdev.dev)] != cflag, long_buff[0]*HZ);
Reader code:
// Mutex holding protects this flag's increment from concurrent one.
proc_read_flags[MINOR(dev->cdev.dev)]++;
wake_up_interruptible_all(&wqs[MINOR(dev->cdev.dev)]);

Why my implementation of sbrk system call does not work?

I try to write a very simple os to better understand the basic principles. And I need to implement user-space malloc. So at first I want to implement and test it on my linux-machine.
At first I have implemented the sbrk() function by the following way
void* sbrk( int increment ) {
return ( void* )syscall(__NR_brk, increment );
}
But this code does not work. Instead, when I use sbrk given by os, this works fine.
I have tryed to use another implementation of the sbrk()
static void *sbrk(signed increment)
{
size_t newbrk;
static size_t oldbrk = 0;
static size_t curbrk = 0;
if (oldbrk == 0)
curbrk = oldbrk = brk(0);
if (increment == 0)
return (void *) curbrk;
newbrk = curbrk + increment;
if (brk(newbrk) == curbrk)
return (void *) -1;
oldbrk = curbrk;
curbrk = newbrk;
return (void *) oldbrk;
}
sbrk invoked from this function
static Header *morecore(unsigned nu)
{
char *cp;
Header *up;
if (nu < NALLOC)
nu = NALLOC;
cp = sbrk(nu * sizeof(Header));
if (cp == (char *) -1)
return NULL;
up = (Header *) cp;
up->s.size = nu; // ***Segmentation fault
free((void *)(up + 1));
return freep;
}
This code also does not work, on the line (***) I get segmentation fault.
Where is a problem ?
Thanks All. I have solved my problem using new implementation of the sbrk. The given code works fine.
void* __sbrk__(intptr_t increment)
{
void *new, *old = (void *)syscall(__NR_brk, 0);
new = (void *)syscall(__NR_brk, ((uintptr_t)old) + increment);
return (((uintptr_t)new) == (((uintptr_t)old) + increment)) ? old :
(void *)-1;
}
The first sbrk should probably have a long increment. And you forgot to handle errors (and set errno)
The second sbrk function does not change the address space (as sbrk does). You could use mmap to change it (but using mmap instead of sbrk won't update the kernel's view of data segment end as sbrk does). You could use cat /proc/1234/maps to query the address space of process of pid 1234). or even read (e.g. with fopen&fgets) the /proc/self/maps from inside your program.
BTW, sbrk is obsolete (most malloc implementations use mmap), and by definition every system call (listed in syscalls(2)) is executed by the kernel (for sbrk the kernel maintains the "data segment" limit!). So you cannot avoid the kernel, and I don't even understand why you want to emulate any system call. Almost by definition, you cannot emulate syscalls since they are the only way to interact with the kernel from a user application. From the user application, every syscall is an atomic elementary operation (done by a single SYSENTER machine instruction with appropriate contents in machine registers).
You could use strace(1) to understand the actual syscalls done by your running program.
BTW, the GNU libc is a free software. You could look into its source code. musl-libc is a simpler libc and its code is more readable.
At last compile with gcc -Wall -Wextra -g and use the gdb debugger (you can even query the registers, if you wanted to). Perhaps read the x86/64-ABI specification and the Linux Assembly HowTo.

Linux synchronization with FIFO waiting queue

Are there locks in Linux where the waiting queue is FIFO? This seems like such an obvious thing, and yet I just discovered that pthread mutexes aren't FIFO, and semaphores apparently aren't FIFO either (I'm working on kernel 2.4 (homework))...
Does Linux have a lock with FIFO waiting queue, or is there an easy way to make one with existing mechanisms?
Here is a way to create a simple queueing "ticket lock", built on pthreads primitives. It should give you some ideas:
#include <pthread.h>
typedef struct ticket_lock {
pthread_cond_t cond;
pthread_mutex_t mutex;
unsigned long queue_head, queue_tail;
} ticket_lock_t;
#define TICKET_LOCK_INITIALIZER { PTHREAD_COND_INITIALIZER, PTHREAD_MUTEX_INITIALIZER }
void ticket_lock(ticket_lock_t *ticket)
{
unsigned long queue_me;
pthread_mutex_lock(&ticket->mutex);
queue_me = ticket->queue_tail++;
while (queue_me != ticket->queue_head)
{
pthread_cond_wait(&ticket->cond, &ticket->mutex);
}
pthread_mutex_unlock(&ticket->mutex);
}
void ticket_unlock(ticket_lock_t *ticket)
{
pthread_mutex_lock(&ticket->mutex);
ticket->queue_head++;
pthread_cond_broadcast(&ticket->cond);
pthread_mutex_unlock(&ticket->mutex);
}
If you are asking what I think you are asking the short answer is no. Threads/processes are controlled by the OS scheduler. One random thread is going to get the lock, the others aren't. Well, potentially more than one if you are using a counting semaphore but that's probably not what you are asking.
You might want to look at pthread_setschedparam but it's not going to get you where I suspect you want to go.
You could probably write something but I suspect it will end up being inefficient and defeat using threads in the first place since you will just end up randomly yielding each thread until the one you want gets control.
Chances are good you are just thinking about the problem in the wrong way. You might want to describe your goal and get better suggestions.
I had a similar requirement recently, except dealing with multiple processes. Here's what I found:
If you need 100% correct FIFO ordering, go with caf's pthread ticket lock.
If you're happy with 99% and favor simplicity, a semaphore or a mutex can do really well actually.
Ticket lock can be made to work across processes:
You need to use shared memory, process-shared mutex and condition variable, handle processes dying with the mutex locked (-> robust mutex) ... Which is a bit overkill here, all I need is the different instances don't get scheduled at the same time and the order to be mostly fair.
Using a semaphore:
static sem_t *sem = NULL;
void fifo_init()
{
sem = sem_open("/server_fifo", O_CREAT, 0600, 1);
if (sem == SEM_FAILED) fail("sem_open");
}
void fifo_lock()
{
int r;
struct timespec ts;
if (clock_gettime(CLOCK_REALTIME, &ts) == -1) fail("clock_gettime");
ts.tv_sec += 5; /* 5s timeout */
while ((r = sem_timedwait(sem, &ts)) == -1 && errno == EINTR)
continue; /* Restart if interrupted */
if (r == 0) return;
if (errno == ETIMEDOUT) fprintf(stderr, "timeout ...\n");
else fail("sem_timedwait");
}
void fifo_unlock()
{
/* If we somehow end up with more than one token, don't increment the semaphore... */
int val;
if (sem_getvalue(sem, &val) == 0 && val <= 0)
if (sem_post(sem)) fail("sem_post");
usleep(1); /* Yield to other processes */
}
Ordering is almost 100% FIFO.
Note: This is with a 4.4 Linux kernel, 2.4 might be different.

Resources