Does every call to write sends switches to kernel mode? - linux

I know that a call to the glibc "write" function calls in it's turn to the sys_call write function which is a kernel function. because sys_call is a kernel function the CPU has to change the ring to zero store the processes registers and so on.
But does it always switches to kernel mode? for example, if i do
write(-1,buffer,LENGTH)
does it still tries to find it in the file descriptors array?
I see in the glibc source code that it does check for fd>0 but i don't see any jump to the sys_call there (it seems like the baracks for main() ends before any call to the alias_write.
/* Write NBYTES of BUF to FD. Return the number written, or -1. */
ssize_t
__libc_write (int fd, const void *buf, size_t nbytes)
{
if (nbytes == 0)
return 0;
if (fd < 0)
{
__set_errno (EBADF);
return -1;
}
if (buf == NULL)
{
__set_errno (EINVAL);
return -1;
}
__set_errno (ENOSYS);
return -1;
}
libc_hidden_def (__libc_write)
stub_warning (write)
weak_alias (__libc_write, __write)
libc_hidden_weak (__write)
weak_alias (__libc_write, write)
#include <stub-tag.h>
So the question is both:
Where does the glibc actually calls the sys_write
Is it true that glibc doesn't call the sys_write if fd<0?

I see in the glibc source code that it does check for fd>0 but i don't see any jump to the sys_call there
You are looking at the wrong code.
There are multiple definitions of __libc_write used under different conditions. The one you looked at is in io/write.c.
The one that is actually used on Linux is generated from sysdeps/unix/syscall-template.S and it does actually execute the switch to kernel mode (and back to user mode) even when fd==-1, etc.

Related

does brk and sbrk round the program break to the nearest page boundary?

My question is as tilte says, accroding to my text book
int brk(void *end_data_segment);
The brk() system call sets the program break to the location specified by
end_data_segment. Since virtual memory is allocated in units of pages,
end_data_segment is effectively rounded up to the next page boundary.
and since on Linux, sbrk() is implemented as a library function that uses the brk() system call, so I expect that both function will round program break to the next page boundary. but when I test on a x86_64 Linux machine(ubuntu), it turns out both functions move the program break to the exact position as requested(I tried using brk, result is the same).
int main(int argc, char *argv[])
{
void *ori = sbrk(100);
printf("original program break at %p\n", ori);
void *now = sbrk(0);
printf("program break now at %p\n", now);
return 0;
}
this is the output
original program break at 0x56491e28f000
program break now at 0x56491e28f064
so what's going on here?
brk allocates/deallocates pages. That implementation detail based on the fact that the smallest unit of data for memory management in a virtual memory operating system is a page is transparent to the caller, however.
In the Linux kernel, brk saves the unaligned value and uses the aligned value to determine if pages need to be allocated/deallocated:
asmlinkage unsigned long sys_brk(unsigned long brk)
{
[...]
newbrk = PAGE_ALIGN(brk);
oldbrk = PAGE_ALIGN(mm->brk);
if (oldbrk == newbrk)
goto set_brk;
[...]
if (do_brk(oldbrk, newbrk-oldbrk) != oldbrk)
goto out;
set_brk:
mm->brk = brk;
[...]
}
As for sbrk: glibc calls brk and maintains the (unaligned) value of the current program break (__curbrk) in userspace:
void *__curbrk;
[...]
void *
__sbrk (intptr_t increment)
{
void *oldbrk;
if (__curbrk == NULL || __libc_multiple_libcs)
if (__brk (0) < 0) /* Initialize the break. */
return (void *) -1;
if (increment == 0)
return __curbrk;
oldbrk = __curbrk;
[...]
if (__brk (oldbrk + increment) < 0)
return (void *) -1;
return oldbrk;
}
Consequently, the return value of sbrk does not reflect the page alignment that happens in the Linux kernel.

ioctl() call resets file descriptor to 0

Consider the following code:
file_fd = open(device, O_RDWR);
if (file_fd < 0) {
perror("open");
return -1;
}
printf("File descriptor: %d\n", file_fd);
uint32_t DskSize;
if (ioctl(file_fd, BLKGETSIZE, &DskSize) < 0) {
perror("ioctl");
return -1;
}
printf("File descriptor after: %d\n", file_fd);
This snippet yields this:
File descriptor: 3
File descriptor after: 0
Why does my file descriptor get reset to 0? The program writes the stuff out to stdout instead of my block device.
This should not happen. I expect my file_fd to be non-zero and retain its value.
Looks like you smash your stack.
Since there are only two stack variables file_fd and DskSize and changing DskSize changes file_fd suggests that DiskSize must be unsigned long or size_t (a 64-bit value), not uint32_t.
Looking at BLKGETSIZE implementation confirms that the value type is unsigned long.
You may like to run your applications under valgrind, it reports this kind of errors.

Why my implementation of sbrk system call does not work?

I try to write a very simple os to better understand the basic principles. And I need to implement user-space malloc. So at first I want to implement and test it on my linux-machine.
At first I have implemented the sbrk() function by the following way
void* sbrk( int increment ) {
return ( void* )syscall(__NR_brk, increment );
}
But this code does not work. Instead, when I use sbrk given by os, this works fine.
I have tryed to use another implementation of the sbrk()
static void *sbrk(signed increment)
{
size_t newbrk;
static size_t oldbrk = 0;
static size_t curbrk = 0;
if (oldbrk == 0)
curbrk = oldbrk = brk(0);
if (increment == 0)
return (void *) curbrk;
newbrk = curbrk + increment;
if (brk(newbrk) == curbrk)
return (void *) -1;
oldbrk = curbrk;
curbrk = newbrk;
return (void *) oldbrk;
}
sbrk invoked from this function
static Header *morecore(unsigned nu)
{
char *cp;
Header *up;
if (nu < NALLOC)
nu = NALLOC;
cp = sbrk(nu * sizeof(Header));
if (cp == (char *) -1)
return NULL;
up = (Header *) cp;
up->s.size = nu; // ***Segmentation fault
free((void *)(up + 1));
return freep;
}
This code also does not work, on the line (***) I get segmentation fault.
Where is a problem ?
Thanks All. I have solved my problem using new implementation of the sbrk. The given code works fine.
void* __sbrk__(intptr_t increment)
{
void *new, *old = (void *)syscall(__NR_brk, 0);
new = (void *)syscall(__NR_brk, ((uintptr_t)old) + increment);
return (((uintptr_t)new) == (((uintptr_t)old) + increment)) ? old :
(void *)-1;
}
The first sbrk should probably have a long increment. And you forgot to handle errors (and set errno)
The second sbrk function does not change the address space (as sbrk does). You could use mmap to change it (but using mmap instead of sbrk won't update the kernel's view of data segment end as sbrk does). You could use cat /proc/1234/maps to query the address space of process of pid 1234). or even read (e.g. with fopen&fgets) the /proc/self/maps from inside your program.
BTW, sbrk is obsolete (most malloc implementations use mmap), and by definition every system call (listed in syscalls(2)) is executed by the kernel (for sbrk the kernel maintains the "data segment" limit!). So you cannot avoid the kernel, and I don't even understand why you want to emulate any system call. Almost by definition, you cannot emulate syscalls since they are the only way to interact with the kernel from a user application. From the user application, every syscall is an atomic elementary operation (done by a single SYSENTER machine instruction with appropriate contents in machine registers).
You could use strace(1) to understand the actual syscalls done by your running program.
BTW, the GNU libc is a free software. You could look into its source code. musl-libc is a simpler libc and its code is more readable.
At last compile with gcc -Wall -Wextra -g and use the gdb debugger (you can even query the registers, if you wanted to). Perhaps read the x86/64-ABI specification and the Linux Assembly HowTo.

Accessing another process virtual memory in Linux (debugging)

How does gdb access another process virtual memory on Linux? Is it all done via /proc?
How does gdb access another process virtual memory on Linux? Is it all done via /proc?
On Linux for reading memory:
1) If the number of bytes to read is fewer than 3 * sizeof (long) or the filesystem /proc is unavailable or reading from /proc/PID/mem is unsuccessful then ptrace is used with PTRACE_PEEKTEXT to read data.
These are these conditions in the function linux_proc_xfer_partial():
/* Don't bother for one word. */
if (len < 3 * sizeof (long))
return 0;
/* We could keep this file open and cache it - possibly one per
thread. That requires some juggling, but is even faster. */
xsnprintf (filename, sizeof filename, "/proc/%d/mem",
ptid_get_pid (inferior_ptid));
fd = gdb_open_cloexec (filename, O_RDONLY | O_LARGEFILE, 0);
if (fd == -1)
return 0;
2) If the number of bytes to read is greater or equal to 3 * sizeof (long) and /proc is available then pread64 or (lseek() and read() are used:
static LONGEST
linux_proc_xfer_partial (struct target_ops *ops, enum target_object object,
const char *annex, gdb_byte *readbuf,
const gdb_byte *writebuf,
ULONGEST offset, LONGEST len)
{
.....
/* If pread64 is available, use it. It's faster if the kernel
supports it (only one syscall), and it's 64-bit safe even on
32-bit platforms (for instance, SPARC debugging a SPARC64
application). */
#ifdef HAVE_PREAD64
if (pread64 (fd, readbuf, len, offset) != len)
#else
if (lseek (fd, offset, SEEK_SET) == -1 || read (fd, readbuf, len) != len)
#endif
ret = 0;
else
ret = len;
close (fd);
return ret;
}
On Linux for writing memory:
1) ptrace with PTRACE_POKETEXT or PTRACE_POKEDATA is used.
As for your second question:
where can I find information about ... setting hardware watchpoints
gdb, Internals Watchpoint:s http://sourceware.org/gdb/wiki/Internals%20Watchpoints
Reference:
http://linux.die.net/man/2/ptrace
http://www.alexonlinux.com/how-debugger-works

Uninterruptible read/write calls

At some point during my C programming adventures on Linux, I encountered flags (possibly ioctl/fcntl?), that make reads and writes on a file descriptor uninterruptible.
Unfortunately I cannot recall how to do this, or where I read it. Can anyone shed some light?
Update0
To refine my query, I'm after the same blocking and guarantees that fwrite() and fread() provide, sans userspace buffering.
You can avoid EINTR from read() and write() by ensuring all your signal handlers are installed with the SA_RESTART flag of sigaction().
However this does not protect you from short reads / writes. This is only possible by putting the read() / write() into a loop (it does not require an additional buffer beyond the one that must already be supplied to the read() / write() call.)
Such a loop would look like:
/* If return value is less than `count', then errno == 0 indicates end of file,
* otherwise errno indicates the error that occurred. */
ssize_t hard_read(int fd, void *buf, size_t count)
{
ssize_t rv;
ssize_t total_read = 0;
while (total_read < count)
{
rv = read(fd, (char *)buf + total_read, count - total_read);
if (rv == 0)
errno = 0;
if (rv < 1)
if (errno == EINTR)
continue;
else
break;
total_read += rv;
}
return rv;
}
Do you wish to disable interrupts while reading/writing, or guarantee that nobody else will read/write the file while you are?
For the second, you can use fcntl()'s F_GETLK, F_SETLK and F_SETLKW to acquire, release and test for record locks respectively. However, since POSIX locks are only advisory, Linux does not enforce them - it's only meaningful between cooperating processes.
The first task involves diving into ring zero and disabling interrupts on your local processor (or all, if you're on an SMP system). Remember to enable them again when you're done!

Resources