I try to write a very simple os to better understand the basic principles. And I need to implement user-space malloc. So at first I want to implement and test it on my linux-machine.
At first I have implemented the sbrk() function by the following way
void* sbrk( int increment ) {
return ( void* )syscall(__NR_brk, increment );
}
But this code does not work. Instead, when I use sbrk given by os, this works fine.
I have tryed to use another implementation of the sbrk()
static void *sbrk(signed increment)
{
size_t newbrk;
static size_t oldbrk = 0;
static size_t curbrk = 0;
if (oldbrk == 0)
curbrk = oldbrk = brk(0);
if (increment == 0)
return (void *) curbrk;
newbrk = curbrk + increment;
if (brk(newbrk) == curbrk)
return (void *) -1;
oldbrk = curbrk;
curbrk = newbrk;
return (void *) oldbrk;
}
sbrk invoked from this function
static Header *morecore(unsigned nu)
{
char *cp;
Header *up;
if (nu < NALLOC)
nu = NALLOC;
cp = sbrk(nu * sizeof(Header));
if (cp == (char *) -1)
return NULL;
up = (Header *) cp;
up->s.size = nu; // ***Segmentation fault
free((void *)(up + 1));
return freep;
}
This code also does not work, on the line (***) I get segmentation fault.
Where is a problem ?
Thanks All. I have solved my problem using new implementation of the sbrk. The given code works fine.
void* __sbrk__(intptr_t increment)
{
void *new, *old = (void *)syscall(__NR_brk, 0);
new = (void *)syscall(__NR_brk, ((uintptr_t)old) + increment);
return (((uintptr_t)new) == (((uintptr_t)old) + increment)) ? old :
(void *)-1;
}
The first sbrk should probably have a long increment. And you forgot to handle errors (and set errno)
The second sbrk function does not change the address space (as sbrk does). You could use mmap to change it (but using mmap instead of sbrk won't update the kernel's view of data segment end as sbrk does). You could use cat /proc/1234/maps to query the address space of process of pid 1234). or even read (e.g. with fopen&fgets) the /proc/self/maps from inside your program.
BTW, sbrk is obsolete (most malloc implementations use mmap), and by definition every system call (listed in syscalls(2)) is executed by the kernel (for sbrk the kernel maintains the "data segment" limit!). So you cannot avoid the kernel, and I don't even understand why you want to emulate any system call. Almost by definition, you cannot emulate syscalls since they are the only way to interact with the kernel from a user application. From the user application, every syscall is an atomic elementary operation (done by a single SYSENTER machine instruction with appropriate contents in machine registers).
You could use strace(1) to understand the actual syscalls done by your running program.
BTW, the GNU libc is a free software. You could look into its source code. musl-libc is a simpler libc and its code is more readable.
At last compile with gcc -Wall -Wextra -g and use the gdb debugger (you can even query the registers, if you wanted to). Perhaps read the x86/64-ABI specification and the Linux Assembly HowTo.
Related
My question is as tilte says, accroding to my text book
int brk(void *end_data_segment);
The brk() system call sets the program break to the location specified by
end_data_segment. Since virtual memory is allocated in units of pages,
end_data_segment is effectively rounded up to the next page boundary.
and since on Linux, sbrk() is implemented as a library function that uses the brk() system call, so I expect that both function will round program break to the next page boundary. but when I test on a x86_64 Linux machine(ubuntu), it turns out both functions move the program break to the exact position as requested(I tried using brk, result is the same).
int main(int argc, char *argv[])
{
void *ori = sbrk(100);
printf("original program break at %p\n", ori);
void *now = sbrk(0);
printf("program break now at %p\n", now);
return 0;
}
this is the output
original program break at 0x56491e28f000
program break now at 0x56491e28f064
so what's going on here?
brk allocates/deallocates pages. That implementation detail based on the fact that the smallest unit of data for memory management in a virtual memory operating system is a page is transparent to the caller, however.
In the Linux kernel, brk saves the unaligned value and uses the aligned value to determine if pages need to be allocated/deallocated:
asmlinkage unsigned long sys_brk(unsigned long brk)
{
[...]
newbrk = PAGE_ALIGN(brk);
oldbrk = PAGE_ALIGN(mm->brk);
if (oldbrk == newbrk)
goto set_brk;
[...]
if (do_brk(oldbrk, newbrk-oldbrk) != oldbrk)
goto out;
set_brk:
mm->brk = brk;
[...]
}
As for sbrk: glibc calls brk and maintains the (unaligned) value of the current program break (__curbrk) in userspace:
void *__curbrk;
[...]
void *
__sbrk (intptr_t increment)
{
void *oldbrk;
if (__curbrk == NULL || __libc_multiple_libcs)
if (__brk (0) < 0) /* Initialize the break. */
return (void *) -1;
if (increment == 0)
return __curbrk;
oldbrk = __curbrk;
[...]
if (__brk (oldbrk + increment) < 0)
return (void *) -1;
return oldbrk;
}
Consequently, the return value of sbrk does not reflect the page alignment that happens in the Linux kernel.
I know that a call to the glibc "write" function calls in it's turn to the sys_call write function which is a kernel function. because sys_call is a kernel function the CPU has to change the ring to zero store the processes registers and so on.
But does it always switches to kernel mode? for example, if i do
write(-1,buffer,LENGTH)
does it still tries to find it in the file descriptors array?
I see in the glibc source code that it does check for fd>0 but i don't see any jump to the sys_call there (it seems like the baracks for main() ends before any call to the alias_write.
/* Write NBYTES of BUF to FD. Return the number written, or -1. */
ssize_t
__libc_write (int fd, const void *buf, size_t nbytes)
{
if (nbytes == 0)
return 0;
if (fd < 0)
{
__set_errno (EBADF);
return -1;
}
if (buf == NULL)
{
__set_errno (EINVAL);
return -1;
}
__set_errno (ENOSYS);
return -1;
}
libc_hidden_def (__libc_write)
stub_warning (write)
weak_alias (__libc_write, __write)
libc_hidden_weak (__write)
weak_alias (__libc_write, write)
#include <stub-tag.h>
So the question is both:
Where does the glibc actually calls the sys_write
Is it true that glibc doesn't call the sys_write if fd<0?
I see in the glibc source code that it does check for fd>0 but i don't see any jump to the sys_call there
You are looking at the wrong code.
There are multiple definitions of __libc_write used under different conditions. The one you looked at is in io/write.c.
The one that is actually used on Linux is generated from sysdeps/unix/syscall-template.S and it does actually execute the switch to kernel mode (and back to user mode) even when fd==-1, etc.
I've looked in the kernel source for linux kernel 4.4.0-57-generic and don't see any locks in the writev() source. Is there something I'm missing? I don't see how writev() is atomic or thread-safe.
Not a kernel expert here, but I'll share my point of view anyway. Feel free to spot any mistakes.
Browsing the kernel (v4.9 though I wouldn't expect it to be so different), and trying to trace the writev(2) system call, I can observe subsequent function calls that create the following path:
SYSCALL_DEFINE3(writev, ..)
do_writev(..)
vfs_writev(..)
do_readv_writev(..)
Now the path branches, depending on whether a write_iter method is implemented and hooked on the struct file_operations field of the struct file that the system call is referring to.
If it's not NULL, the path is:
5a. do_iter_readv_writev(..), which calls the method filp->f_op->write_iter(..) at this point.
If it is NULL, the path is:
5b. do_loop_readv_writev(..), which calls repeatedly in a loop the method filp->f_op->write at this point.
So, as far as I understand, the writev() system call is as thread safe as the underlying write() (or write_iter()) is, which of course can be implemented in various ways, e.g. in a device driver, and may or may not use locks according to its needs and its design.
EDIT:
In kernel v4.4 the paths look pretty similar:
SYSCALL_DEFINE3(writev, ..)
vfs_writev(..)
do_readv_writev(..)
and then it depends on whether the write_iter method as a field in struct file_operations of the struct file is NULL or not, just like the case in v4.9, described above.
VFS (Virtual File System) by itself doesn't garantee atomicity of writev() call. It just calls filesystem-specific .write_iter method of struct file_operations.
It is responsibility of specific filesystem implementation for make method atomically write to the file.
For example, in ext4 filesystem function ext4_file_write_iter uses
mutex_lock(&inode->i_mutex);
for make writting atomic.
Found it in fs.h:
static inline void file_start_write(struct file *file)
{
if (!S_ISREG(file_inode(file)->i_mode))
return;
__sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, true);
}
and then in super.c:
/*
* This is an internal function, please use sb_start_{write,pagefault,intwrite}
* instead.
*/
int __sb_start_write(struct super_block *sb, int level, bool wait)
{
bool force_trylock = false;
int ret = 1;
#ifdef CONFIG_LOCKDEP
/*
* We want lockdep to tell us about possible deadlocks with freezing
* but it's it bit tricky to properly instrument it. Getting a freeze
* protection works as getting a read lock but there are subtle
* problems. XFS for example gets freeze protection on internal level
* twice in some cases, which is OK only because we already hold a
* freeze protection also on higher level. Due to these cases we have
* to use wait == F (trylock mode) which must not fail.
*/
if (wait) {
int i;
for (i = 0; i < level - 1; i++)
if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) {
force_trylock = true;
break;
}
}
#endif
if (wait && !force_trylock)
percpu_down_read(sb->s_writers.rw_sem + level-1);
else
ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);
WARN_ON(force_trylock & !ret);
return ret;
}
EXPORT_SYMBOL(__sb_start_write);
Thanks again.
I'm trying to use this function to copy a buffer from the user to one in kernel.
both buffers were allocated. I'm using while in case not all the bytes were copied on the first try. but for some reason, nothing is copied and the program is stuck in the while loop.
what can be the reasons for that?
void my_copy_from_user(const char* source_buff, char* dest_buff, int size_to_copy){
int not_copied = size_to_copy
int left = size_to_copy;
while( not_copied ){
not_copied = copy_from_user(dest_buff, source_buff, left);
dest_buff += (left - not_copied);
source_buff += (left - not_copied);
left = not_copied;
}
}
It is possible that it is legitimately failing for reasons that you cannot recover from.
Please look at: http://lxr.free-electrons.com/source/arch/x86/lib/usercopy_32.c#L681
unsigned long _copy_from_user(void *to, const void __user *from, unsigned n)
{
if (access_ok(VERIFY_READ, from, n))
n = __copy_from_user(to, from, n);
else
memset(to, 0, n);
return n;
}
This is the underlying implementation for copy_from_user for Linux on x86 processors. It first checks access_ok. If access is not allowed, it will fail and return with n (the number of bytes you requested to copy) immediately. This would cause an infinite loop.
Two points:
I do not think you should invoke copy_from_user in a loop like that. If it fails to copy in kernel mode, there is a reason why. This is a different beast from read() functions when reading from sockets, etc, where you are encouraged to read() in a loop.
Are you sure that you are passing in the correct dest_buff to copy_from_user?
Tips:
Printk all the values and see what's happening. Is left being changed or not? It is likely not.
I was wondering if strcpy or strcat like functions causes any system call or they are handled internally by the OS?
No system call is involved. In fact, the source code of most if not all implementations would look like this:
char *
strcpy(char *s1, const char *s2) {
char *s = s1;
while ((*s++ = *s2++) != 0) ;
return (s1);
}
strcat is similar:
char *
strcat(char *s1, const char *s2)
{
strcpy(&s1[strlen(s1)], s2);
return s1;
}
On Linux, those calls are implemented by the standard library (and those are part of the standard C library). See also glibc. System calls are invocations from user code to kernel code for hardware access (e.g. memory allocation); they are accomplished with an interrupt 0x80.
No OS calls are REQUIRED for such simple operations - they can be performed easily in the libraries.
Note that the OS may be entered during such calls, eg. because they generate a page-fault or some other hardware interrupt occurs.