Measure time taken for linux kernel from bootup to userpace - linux

Is there a Kernel instrumentation based way to measure the time at which the Kernel transfers over to the Userspace during boot-up ? I could use printk's with timing information, but I just wasn't sure, where exactly to place this printk, in order to observe when the Kernel transfers over to the Userspace.

The start_kernel() is called by architecture specific code (arch/architecture_type). After the kernel loads, it calls the first user-space process,i.e. /sbin/init (or systemd on a more recent distribution) from init_post() Both these functions are defined in init/main.c.
You might want to read this blog for a detailed description of the boot process.

Create your own init that logs to /dev/kmsg immediately
A simpler alternative to hacking the kernel code with a printk is to use the following init:
#include <stdio.h>
#include <unistd.h>
int main(void) {
FILE *fp;
fp = fopen("/dev/kmsg", "w");
fputs("hello init\n", fp);
fclose(fp);
while (1)
sleep(0xFFFFFFFF);
}
and use the kernel command line parameters:
init=/path/to/myinit printk.devkmsg=on printk.time=y
Now just after the boot ends and init starts we see a message:
[<timestamp>] hello init
This is not 100% precise as you will lose some CPU cycles for the fopen, but I don't think it will matter much.
Minimal reproducible setup to test it out:
https://github.com/cirosantilli/linux-kernel-module-cheat/tree/88cd83cd020af12b987afab5e990099f7efa2107#custom-init
https://github.com/cirosantilli/linux-kernel-module-cheat/blob/88cd83cd020af12b987afab5e990099f7efa2107/kernel_module/user/init_dev_kmsg.c

Related

mmap flag MAP_UNINITIALIZED not defined

mmap() docs mentions flag MAP_UNINITIALIZED, but the flag doesn't seem to be defined.
Tried on Centos7, and Xenial, neither distro has the flag defined in sys/mman.h as alleged.
Astonishingly, the internet doesn't seem to be aware of this. What's the story?
Edit: I understand from the docs that the flag is only honoured on embedded or low-security devices, but that doesn't mean the flag shouldn't be defined... How do you use it in portable code? Google has revealed code where it is defined as 0 in cases where not supported, except in my cases it's not defined at all.
In order to understand what to do about the fact that #include <sys/mman.h> does not define MAP_UNINITIALIZED, it is helpful to understand how the interface to the kernel is defined.
To build a kernel module, you will need the kernel headers used to build the kernel for the exact version of the kernel for which you wish to build the module. As you wish to run in userspace, you won't need these.
The headers that define the kernel API for userspace are largely in /usr/include/linux and /usr/include/asm (see this for how they are generated). One of the more important consumers of these headers is the C standard library, e.g., glibc, which must be built against some version of these headers. Since the linux kernel API is backwards compatible, you may have a glibc (or other library implementation) built against an older version of these headers than the kernel you are running. I'm by no means an expert on how all the various distros distribute glibc, but it is my impression that the kernel headers defining its userspace API are generally the version that glibc has been built against.
Finally, glibc defines its API through headers also installed under /usr/include such as /usr/include/sys. I don't know exactly what, if any, backward or forward compatibility is provided for applications built with older or newer glibc headers, but I'm guessing that the library .so version number gets bumped when backward comparability would be broken.
So now we can understand your problem to be that the glibc headers don't actually define MAP_UNINITIALIZED for the distros/versions that you tried.
However, the linux kernel API has exposed MAP_UNINITIALIZED, as this patch demonstrates. If the glibc headers don't define it for you, you can use the linux kernel API headers and #include <linux/mman.h> if this defines it. Note that you will still need to #include <sys/mman.h> in order to get the prototype for mmap, among other things.
If your linux kernel API headers don't define MAP_UNINITIALIZED but you have a kernel version that implements it, you can define it yourself:
#define MAP_UNINITIALIZED 0x4000000
You don't have to worry that you are effectively using "newer" headers than your glibc was built with, because the glibc implementation of mmap is very thin:
#include <sys/types.h>
#include <sys/mman.h>
#include <errno.h>
#include <sysdep.h>
#ifndef MMAP_PAGE_SHIFT
#define MMAP_PAGE_SHIFT 12
#endif
__ptr_t
__mmap (__ptr_t addr, size_t len, int prot, int flags, int fd, off_t offset)
{
if (offset & ((1 << MMAP_PAGE_SHIFT) - 1))
{
__set_errno (EINVAL);
return MAP_FAILED;
}
return (__ptr_t) INLINE_SYSCALL (mmap2, 6, addr, len, prot, flags, fd,
offset >> MMAP_PAGE_SHIFT);
}
weak_alias (__mmap, mmap)
It is just passing your flags straight through to the kernel.
The kernel normally needs to clear the memory, to protect the privacy of both kernel space and other process' memory.
Continue reading:
This flag is honored only if the kernel was configured with the CONFIG_MMAP_ALLOW_UNINITIALIZED option. Because of the security implications, that option is normally enabled only on embedded devices (i.e., devices where one has complete control of the contents of user memory).

What linux kernel code creates /sys/devices/system/cpu/cpuX?

I am developing a cpufreq driver (as a loadable kernel module) for the microblaze architecture. I have some FPGA logic that is able to scale the on-system clock and it works quite well. I have followed the information in Documentation/cpu-freq/cpu-drivers.txt and looked at the model in the blackfin cpufreq driver.
I have also made the necessary changes to arch/microblaze/Kconfig in order to have the cpufreq options built into the kernel (not modules).
When I first loaded the driver, cpufreq_register_driver() was returning -ENODEV, which implied that it couldn't find a CPU. I set the driver flag to CPUFREQ_STICKY and was able to insert the module.
However, at this point I realized that /sys/devices/system/cpu/cpu0 isn't present (although /sys/devices/system/cpu/cpufreq is there). So, why is that? What part of the kernel code is responsible for creating that directory?
I discovered where the /sys/devices/system/cpu/cpuX sysfs entry was created by looking at the cpufreq_cpu_callback() in drivers/cpufreq/cpufreq.c. This has a call to get_cpu_sysdev(), which I assumed was the element that I was looking for.
This call is defined in drivers/base/cpu.c, where I also noticed the code that puts together the cpu specific sysdev entry; register_cpu(). For most architectures, this is in arch/${ARCH}/kernel/setup.c, and I used the blackfin code as an example.
DEFINE_PER_CPU(struct cpu, cpu_data);
static int __init topology_init(void)
{
unsigned int cpu;
for_each_possible_cpu(cpu) {
register_cpu(&per_cpu(cpu_data, cpu), cpu);
}
return 0;
}
After adding this code to arch/microblaze/kernel/setup.c, I now have the directory I need and I'm able to make use of the different governors available to talk to my cpufreq driver. Now I just have to make sleep 1 take 1 second at 1/3 the clock rate instead of 3 seconds!

How to allocate a new TLS area with clone system call

Short version of question: What parameter do I need to pass to the clone system call on x86_64 Linux system if I want to allocate a new TLS area for the thread that I am creating.
Long version:
I am working on a research project and for something I am experimenting with I want to create threads using the clone system call instead of using pthread_create. However, I also want to be able to use thread local storage. I don't plan on creating many threads right now, so it would be fine for me to create a new TLS area for each thread that I create with the clone system call.
I was looking at the man page for clone and it has the following information about the flag for the TLS parameter:
CLONE_SETTLS (since Linux 2.5.32)
The newtls argument is the new TLS (Thread Local Storage) descriptor.
(See set_thread_area(2).)
So I looked at the man page for set_thread_area and noticed the following which looked promising:
When set_thread_area() is passed an entry_number of -1, it uses a
free TLS entry. If set_thread_area() finds a free TLS entry, the value of
u_info->entry_number is set upon return to show which entry was changed.
However, after experimenting with this some it appears that set_thread_area is not implemented in my system (Ubunut 10.04 on an x86_64 platform). When I run the following code I get an error that says: set_thread_area() failed: Function not implemented
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <linux/unistd.h>
#include <asm/ldt.h>
int main()
{
struct user_desc u_info;
u_info.entry_number = -1;
int rc = syscall(SYS_set_thread_area,&u_info);
if(rc < 0) {
perror("set_thread_area() failed");
exit(-1);
}
printf("entry_number is %d",u_info.entry_number);
}
I also saw that when I use strace the see what happens when pthread_create is called that I don't see any calls to set_thread_area. I have also been looking at the nptl pthread source code to try to understand what they do when creating threads. But I don't completely understand it yet and I think it is more complex than what I'm trying to do since I don't need something that is as robust at the pthread implementation. I'm assuming that the set_thread_area system call is for x86 and that there is a different mechanism used for x86_64. But for the moment I have not been able to figure out what it is so I'm hoping this question will help me get some ideas about what I need to look at.
I am working on a research project and for something I am experimenting with I want to create threads using the clone system call instead of using pthread_create
In the exceedingly unlikely scenario where your new thread never calls any libc functions (either directly, or by calling something else which calls libc; this also includes dynamic symbol resolution via PLT), then you can pass whatever TLS storage you desire as the the new_tls parameter to clone.
You should ignore all references to set_thread_area -- they only apply to 32-bit/ix86 case.
If you are planning to use libc in your newly-created thread, you should abandon your approach: libc expects TLS to be set up a certain way, and there is no way for you to arrange for such setup when you call clone directly. Your new thread will intermittently crash when libc discovers that you didn't set up TLS properly. Debugging such crashes is exceedingly difficult, and the only reliable solution is ... to use pthread_create.
The other answer is absolutely correct in that setting up a thread outside of libc's control is guaranteed to cause trouble at a certain point. You can do it, but you can no longer rely on libc's services, definitely not on any of the pthread_* functions or thread-local variables (defined as such using __thread or thread_local).
That being said, you can set one of the segment registers used for TLS (GS and FS) even on x86-64. The system call to look for is prctl(ARCH_SET_GS, ...).
You can see an example comparing setting up TLS registers on i386 and x86-64 in this piece of code.

Disabling vsyscalls in Linux

I'm working on a piece of software that monitors other processes' system calls using ptrace(2). Unfortunately most modern operating system implement some kind of fast user-mode syscalls that are called vsyscalls in Linux.
Is there any way to disable the use of vsyscalls/vDSO for a single process or, if that is not possible, for the whole operating system?
Try echo 0 > /proc/sys/kernel/vsyscall64
If you're trying to ptrace on gettimeofday calls and they aren't showing up, what time source is the system using (pmtimer, acpi, tsc, hpet, etc). I wonder if you'd humor me by trying to force your timer to something older like pmtimer. It's possible one of the many gtod timer specific optimizations is causing your ptrace calls to be avoided, even with vsyscall set to zero.
Is there any way to disable the use of vsyscalls/vDSO for a single process or, if that is not possible, for the whole operating system?
It turns out there IS a way to effectively disable linking vDSO for a single process without disabling it system-wide using ptrace!
All you have to do is to stop the traced process before it returns from execve and remove the AT_SYSINFO_EHDR entry from the auxiliary vector (which comes directly after environment variables along the memory region pointed to in rsp). PTRACE_EVENT_EXEC is a good place to do this.
AT_SYSINFO_EHDR is what the kernel uses to tell the system linker where vDSO is mapped in the process's address space. If this entry is not present, ld seems to act as if the system hasn't mapped a vDSO.
Note that this doesn't somehow unmap the vDSO from your processes memory, it merely ignores it when linking other shared libraries. A malicious program will still be able to interact with it if the author really wanted to.
I know this answer is a bit late, but I hope this information will spare some poor soul a headache
For newer systems echo 0 > /proc/sys/kernel/vsyscall64 might not work. In Ubuntu 16.04 vDSO can be disabled system-wide by adding the kernel parameter vdso=0 in /etc/default/grub under the parameter: GRUB_CMDLINE_LINUX_DEFAULT.
IMPORTANT: Parameter GRUB_CMDLINE_LINUX_DEFAULT might be overwriten by other configuration files in /etc/default/grub.d/..., so double check when to add your custom configuration.
Picking up on Tenders McChiken's approach, I did create a wrapper that disables vDSO for an arbitrary binary, without affecting the rest of the system: https://github.com/danteu/novdso
The general procedure is quite simple:
use ptrace to wait for return from execve(2)
find the address of the auxvector
overwrite the AT_SYSINFO_EHDR entry with AT_IGNORE, telling the application to ignore the following value
I know this is an older question, but nobody has mentioned a third useful way of disabling the vDSO on a per-process basis. You can overwrite the libc functions with your own that performs the actual system call using LD_PRELOAD.
A simple shared library for overriding the gettimeofday and time functions, for example, could look like this:
vdso_override.c:
#include <time.h>
#include <sys/time.h>
#include <unistd.h>
#include <sys/syscall.h>
int gettimeofday(struct timeval *restrict tv, struct timezone *restrict tz)
{
return syscall(__NR_gettimeofday, (long)tv, (long)tz, 0, 0, 0, 0);
}
time_t time(time_t *tloc)
{
return syscall(__NR_time, (long)tloc, 0, 0, 0, 0, 0);
}
This uses the libc wrapper to issue a raw system call (see syscall(2)), so the vDSO is circumvented. You would have to overwrite all system calls that the vDSO exports on your architecture in this way (listed at vdso(7)).
Compile with
gcc -fpic -shared -o vdso_override.so vdso_override.c
Then run any program in which you want to disable VDSO calls as follows:
LD_PRELOAD=./vdso_override.so <some program>
This of course only works if the program you are running is not actively trying to circumvent this. While you can override a symbol using LD_PRELOAD, if the target program really wants to, there is a way to find the original symbol and use that instead.

Linux using a driver from inside a driver

I am trying to interface to a microcontroller from my linux box via RS232 serial.
I have written the driver and implemented a protocol b/n pc and microcontroller, which uses a tty(/dev/ttyS0) device already present in the kernel as a module(eg via calling open, close, etc..). However, when I try to compile, it says it cannot find reference to open, write, read etc...
How do I just use an existing device driver from within a driver? Is there something else I need to include?
If not, how can I use the serial port easily from within a driver?
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/fs.h>
#include <linux/uaccess.h>
#include <linux/init.h>
#include <linux/slab.h>
#include <linux/cdev.h>
#include <linux/spinlock.h>
#include <linux/termios.h>
#include <linux/fcntl.h>
#include <linux/unistd.h>
Normally you should do such a thing in userspace - implement your device's protocol in a normal, userspace program.
It is possible, but definitely not recommended to do these things in the kernel. For example, the ppp driver implements a network driver on top of a serial driver. I don't know how it works in that case, but I'd expect that a userspace helper program opens the device, initialises its parameters etc, then passes the file descriptor into the kernel using some system call.
You cannot call arbitrary library functions from the kernel - or indeed, any library functions at all (except libraries which are actually shipped as part of the kernel). This includes kernel system calls. There are equivalent functions which it may be possible to call - for example, filp_open.
In most cases you can't just call the normal syscall from the kernel, as they expect pointers to point to userspace data, but in the kernel yours (allocated via kalloc etc) will normally point to kernel-space data. The two can't be freely mixed.

Resources