Implementing a system call for CPU hotplug on RPI3/ModelB - linux

My goal is to implement a system call in linux kernel that enables/disables a CPU core.
First, I implemented a system call that disbales CPU3 in a 4-core system.
The system call code is as follows:
#include <linux/kernel.h>
#include <linux/slab.h>
#include <asm/uaccess.h>
#include <asm/unistd.h>
#include <linux/cpumask.h>
asmlinkage long sys_new_syscall(void)
{
unsigned int cpu3 = 3;
set_cpu_online (cpu3, false) ; /* clears the CPU in the cpumask */
printk ("CPU%u is offline\n", cpu3);
return 0;
}
The system call was registered correctly in the kernel and I enabled 'cpu hotplug' feature during kernel configuration ( See picture )
Kernel configuration:
The kernel was build . But when I check the system call using test.c :
#include <stdio.h>
#include <linux/kernel.h>
#include <sys/syscall.h>
#include <unistd.h>
long new_syscall(void)
{
return syscall(394);
}
int main(int argc, char *argv[])
{
long int a = new_syscall();
printf("System call returned %ld\n", a);
return 0;
}
The OS frezzes !
What am I doing wrong ?

why would you want to implement a dedicated syscall? the standard way of offlining cpus is through writes to sysfs. in the extremely unlikely case there is a valid reason to create a dedicated syscall you will have to check how offlining works under the hood and repeat that.
set_cpu_online (cpu3, false) ; /* clears the CPU in the cpumask */
your own comment strongly suggests this is too simplistic. for instance what if the thread executing this is running on said cpu? what about threads which are queued on it?
and so on

This is kind of an old topic, but you can put a CPU up/down in kernel land by using the functions cpu_up(cpu_id) and cpu_down(cpu_id), from include/linux/cpu.h.
It seems that set_cpu_online is not exported since it doesn't seems to be safe from other kernel parts stand point (it doesn't consider process affinity and other complexities, for example).
So, your system call could be written as:
asmlinkage long sys_new_syscall(void)
{
unsigned int cpu3 = 3;
cpu_down(cpu3) ; /* clears the CPU in the cpumask */
printk ("CPU%u is offline\n", cpu3);
return 0;
}
I have an example module using those methods here: https://github.com/pappacena/cpuautoscaling.

Related

pthread_create allocates a lot of memory in Linux?

Here's a simple example
#include <iostream>
#include <thread>
#include <vector>
#include <chrono>
void* run(void*)
{
while (true)
std::this_thread::sleep_for(std::chrono::seconds(1));
}
int main()
{
std::vector<pthread_t> workers(192);
for (unsigned i = 0; i < workers.size(); ++i)
pthread_create(&workers[i], nullptr, &run, nullptr);
pthread_join(workers.back(), nullptr);
}
top shows 1'889'356 KiB VIRT! I know this isn't resident memory, but still, this is huge amount of memory for a single thread creation.
Is it really so memory-expensive (8MiB in this case) to create a thread? Is this configurable?
Or, maybe and most probably, I have some misunderstanding what virtual memory is?
Details:
I double quadruple-checked the memory usage, using:
generated a core dump of the running exe, it's also 1.6GB;
valgrind --tool=massif also confirms this size;
pmap -x <pid> also confirms the size.
As this size matches the max size of a stack (also confirmed by /proc/<pid>/limits), I tried to make the max size of the stack smaller. Tried with 1 MiB, but this didn't change anything.
Please, put aside the creation and usage of 192 threads, it has a reason behind it.
Sorry for the mixed C and C++ - initially tried with std::thread and the results are the same.
pthread_attr_setstacksize() function is available to set stack size.
This function have to be used with an thread attribute object.
The thread attribute object has to be passed as 2nd argument of pthread_create().
#include <iostream>
#include <thread>
#include <vector>
#include <chrono>
void* run(void*)
{
while (true)
std::this_thread::sleep_for(std::chrono::seconds(1));
}
int main()
{
std::vector<pthread_t> workers(192);
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setstacksize(&attr, 16384);
for (unsigned i = 0; i < workers.size(); ++i)
pthread_create(&workers[i], &attr, &run, nullptr);
pthread_join(workers.back(), nullptr);
}

Kernel probe not inserted for system_call function

I can use kprobe mechanism to attach handlers using following example code:
#include <asm/uaccess.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/version.h>
#include <linux/kallsyms.h>
#include <linux/slab.h>
#include <linux/init.h>
#include <linux/kprobes.h>
static struct kprobe kp;
int Pre_Handler(struct kprobe *p, struct pt_regs *regs){
printk("pre_handler\n");
return 0;
}
void Post_Handler(struct kprobe *p, struct pt_regs *regs, unsigned long flags) {
printk("post_handler\n");
}
int __init init (void) {
kp.pre_handler = Pre_Handler;
kp.post_handler = Post_Handler;
kp.addr = (kprobe_opcode_t *)kallsyms_lookup_name("sys_fork");
printk("%d\n", register_kprobe(&kp));
return 0;
}
void __exit cleanup(void) {
unregister_kprobe(&kp);
}
MODULE_LICENSE("GPL");
module_init(init);
module_exit(cleanup);
However, it looks like not all kernel routines can be tracked this way. I've tried to attach handlers to system_call to have them called with any system call execution with following change:
kp.addr = (kprobe_opcode_t *)kallsyms_lookup_name("system_call");
And probes aren't inserted. dmesg shows that register_kprobe returns -22 which is -EINVAL. Why is this function impossible to trace? Is it possible to attach kprobe handler before dispatching any system call?
$ uname -r
3.8.0-29-generic
system_call is protected from kprobes, it is not possible to probe the system_call function.
I think we don't have any useful information that you can get before any actual system call gets invoked. for example if you see the function system_call:
RING0_INT_FRAME # can't unwind into user space anyway
ASM_CLAC
pushl_cfi %eax # save orig_eax
SAVE_ALL
GET_THREAD_INFO(%ebp)
# system call tracing in operation / emulation
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%ebp)
jnz syscall_trace_entry
cmpl $(NR_syscalls), %eax
jae syscall_badsys
syscall_call:
call *sys_call_table(,%eax,4)
there are a few instructions before your actual system call is invoked. yeah, I am not sure if you need any information in these instructions.

Intercepting a system call [duplicate]

I'm trying to write some simple test code as a demonstration of hooking the system call table.
"sys_call_table" is no longer exported in 2.6, so I'm just grabbing the address from the System.map file, and I can see it is correct (Looking through the memory at the address I found, I can see the pointers to the system calls).
However, when I try to modify this table, the kernel gives an "Oops" with "unable to handle kernel paging request at virtual address c061e4f4" and the machine reboots.
This is CentOS 5.4 running 2.6.18-164.10.1.el5. Is there some sort of protection or do I just have a bug? I know it comes with SELinux, and I've tried putting it in to permissive mode, but it doesn't make a difference
Here's my code:
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/unistd.h>
void **sys_call_table;
asmlinkage int (*original_call) (const char*, int, int);
asmlinkage int our_sys_open(const char* file, int flags, int mode)
{
printk("A file was opened\n");
return original_call(file, flags, mode);
}
int init_module()
{
// sys_call_table address in System.map
sys_call_table = (void*)0xc061e4e0;
original_call = sys_call_table[__NR_open];
// Hook: Crashes here
sys_call_table[__NR_open] = our_sys_open;
}
void cleanup_module()
{
// Restore the original call
sys_call_table[__NR_open] = original_call;
}
I finally found the answer myself.
http://www.linuxforums.org/forum/linux-kernel/133982-cannot-modify-sys_call_table.html
The kernel was changed at some point so that the system call table is read only.
cypherpunk:
Even if it is late but the Solution
may interest others too: In the
entry.S file you will find: Code:
.section .rodata,"a"
#include "syscall_table_32.S"
sys_call_table -> ReadOnly You have to
compile the Kernel new if you want to
"hack" around with sys_call_table...
The link also has an example of changing the memory to be writable.
nasekomoe:
Hi everybody. Thanks for replies. I
solved the problem long ago by
modifying access to memory pages. I
have implemented two functions that do
it for my upper level code:
#include <asm/cacheflush.h>
#ifdef KERN_2_6_24
#include <asm/semaphore.h>
int set_page_rw(long unsigned int _addr)
{
struct page *pg;
pgprot_t prot;
pg = virt_to_page(_addr);
prot.pgprot = VM_READ | VM_WRITE;
return change_page_attr(pg, 1, prot);
}
int set_page_ro(long unsigned int _addr)
{
struct page *pg;
pgprot_t prot;
pg = virt_to_page(_addr);
prot.pgprot = VM_READ;
return change_page_attr(pg, 1, prot);
}
#else
#include <linux/semaphore.h>
int set_page_rw(long unsigned int _addr)
{
return set_memory_rw(_addr, 1);
}
int set_page_ro(long unsigned int _addr)
{
return set_memory_ro(_addr, 1);
}
#endif // KERN_2_6_24
Here's a modified version of the original code that works for me.
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/unistd.h>
#include <asm/semaphore.h>
#include <asm/cacheflush.h>
void **sys_call_table;
asmlinkage int (*original_call) (const char*, int, int);
asmlinkage int our_sys_open(const char* file, int flags, int mode)
{
printk("A file was opened\n");
return original_call(file, flags, mode);
}
int set_page_rw(long unsigned int _addr)
{
struct page *pg;
pgprot_t prot;
pg = virt_to_page(_addr);
prot.pgprot = VM_READ | VM_WRITE;
return change_page_attr(pg, 1, prot);
}
int init_module()
{
// sys_call_table address in System.map
sys_call_table = (void*)0xc061e4e0;
original_call = sys_call_table[__NR_open];
set_page_rw(sys_call_table);
sys_call_table[__NR_open] = our_sys_open;
}
void cleanup_module()
{
// Restore the original call
sys_call_table[__NR_open] = original_call;
}
Thanks Stephen, your research here was helpful to me. I had a few problems, though, as I was trying this on a 2.6.32 kernel, and getting WARNING: at arch/x86/mm/pageattr.c:877 change_page_attr_set_clr+0x343/0x530() (Not tainted) followed by a kernel OOPS about not being able to write to the memory address.
The comment above the mentioned line states:
// People should not be passing in unaligned addresses
The following modified code works:
int set_page_rw(long unsigned int _addr)
{
return set_memory_rw(PAGE_ALIGN(_addr) - PAGE_SIZE, 1);
}
int set_page_ro(long unsigned int _addr)
{
return set_memory_ro(PAGE_ALIGN(_addr) - PAGE_SIZE, 1);
}
Note that this still doesn't actually set the page as read/write in some situations. The static_protections() function, which is called inside of set_memory_rw(), removes the _PAGE_RW flag if:
It's in the BIOS area
The address is inside .rodata
CONFIG_DEBUG_RODATA is set and the kernel is set to read-only
I found this out after debugging why I still got "unable to handle kernel paging request" when trying to modify the address of kernel functions. I was eventually able to solve that problem by finding the page table entry for the address myself and manually setting it to writable. Thankfully, the lookup_address() function is exported in version 2.6.26+. Here is the code I wrote to do that:
void set_addr_rw(unsigned long addr) {
unsigned int level;
pte_t *pte = lookup_address(addr, &level);
if (pte->pte &~ _PAGE_RW) pte->pte |= _PAGE_RW;
}
void set_addr_ro(unsigned long addr) {
unsigned int level;
pte_t *pte = lookup_address(addr, &level);
pte->pte = pte->pte &~_PAGE_RW;
}
Finally, while Mark's answer is technically correct, it'll case problem when ran inside Xen. If you want to disable write-protect, use the read/write cr0 functions. I macro them like this:
#define GPF_DISABLE write_cr0(read_cr0() & (~ 0x10000))
#define GPF_ENABLE write_cr0(read_cr0() | 0x10000)
Hope this helps anyone else who stumbles upon this question.
Note that the following will also work instead of using change_page_attr and cannot be depreciated:
static void disable_page_protection(void) {
unsigned long value;
asm volatile("mov %%cr0,%0" : "=r" (value));
if (value & 0x00010000) {
value &= ~0x00010000;
asm volatile("mov %0,%%cr0": : "r" (value));
}
}
static void enable_page_protection(void) {
unsigned long value;
asm volatile("mov %%cr0,%0" : "=r" (value));
if (!(value & 0x00010000)) {
value |= 0x00010000;
asm volatile("mov %0,%%cr0": : "r" (value));
}
}
If you are dealing with kernel 3.4 and later (it can also work with earlier kernels, I didn't test it) I would recommend a smarter way to acquire the system callы table location.
For example
#include <linux/module.h>
#include <linux/kallsyms.h>
static unsigned long **p_sys_call_table;
/* Aquire system calls table address */
p_sys_call_table = (void *) kallsyms_lookup_name("sys_call_table");
That's it. No addresses, it works fine with every kernel I've tested.
The same way you can use a not exported Kernel function from your module:
static int (*ref_access_remote_vm)(struct mm_struct *mm, unsigned long addr,
void *buf, int len, int write);
ref_access_remote_vm = (void *)kallsyms_lookup_name("access_remote_vm");
Enjoy!
As others have hinted, the whole story is a bit different now on modern kernels. I'll be covering x86-64 here, for syscall hijacking on modern arm64 refer to this other answer of mine. Also NOTE: this is plain and simple syscall hijacking. Non-invasive hooking can be done in a much nicer way using kprobes.
Since Linux v4.17, x86 (both 64 and 32 bit) now uses syscall wrappers that take a struct pt_regs * as the only argument (see commit 1, commit 2). You can see arch/x86/include/asm/syscall.h for the definitions.
Additionally, as others have described already in different answers, the simplest way to modify sys_call_table is to temporarily disable CR0 WP (Write-Protect) bit, which could be done using read_cr0() and write_cr0(). However, since Linux v5.3, [native_]write_cr0 will check sensitive bits that should never change (like WP) and refuse to change them (commit). In order to work around this, we need to write CR0 manually using inline assembly.
Here is a working kernel module (tested on Linux 5.10 and 5.18) that does syscall hijacking on modern Linux x86-64 considering the above caveats and assuming that you already know the address of sys_call_table (if you also want to find that in the module, see Proper way of getting the address of non-exported kernel symbols in a Linux kernel module):
// SPDX-License-Identifier: (GPL-2.0 OR MIT)
/**
* Test syscall table hijacking on x86-64. This module will replace the `read`
* syscall with a simple wrapper which logs every invocation of `read` using
* printk().
*
* Tested on Linux x86-64 v5.10, v5.18.
*
* Usage:
*
* sudo cat /proc/kallsyms | grep sys_call_table # grab address
* sudo insmod syscall_hijack.ko sys_call_table_addr=0x<address_here>
*/
#include <linux/init.h> // module_{init,exit}()
#include <linux/module.h> // THIS_MODULE, MODULE_VERSION, ...
#include <linux/kernel.h> // printk(), pr_*()
#include <asm/special_insns.h> // {read,write}_cr0()
#include <asm/processor-flags.h> // X86_CR0_WP
#include <asm/unistd.h> // __NR_*
#ifdef pr_fmt
#undef pr_fmt
#endif
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
typedef long (*sys_call_ptr_t)(const struct pt_regs *);
static sys_call_ptr_t *real_sys_call_table;
static sys_call_ptr_t original_read;
static unsigned long sys_call_table_addr;
module_param(sys_call_table_addr, ulong, 0);
MODULE_PARM_DESC(sys_call_table_addr, "Address of sys_call_table");
// Since Linux v5.3 [native_]write_cr0 won't change "sensitive" CR0 bits, need
// to re-implement this ourselves.
static void write_cr0_unsafe(unsigned long val)
{
asm volatile("mov %0,%%cr0": "+r" (val) : : "memory");
}
static long myread(const struct pt_regs *regs)
{
pr_info("read(%ld, 0x%lx, %lx)\n", regs->di, regs->si, regs->dx);
return original_read(regs);
}
static int __init modinit(void)
{
unsigned long old_cr0;
real_sys_call_table = (typeof(real_sys_call_table))sys_call_table_addr;
pr_info("init\n");
// Temporarily disable CR0 WP to be able to write to read-only pages
old_cr0 = read_cr0();
write_cr0_unsafe(old_cr0 & ~(X86_CR0_WP));
// Overwrite syscall and save original to be restored later
original_read = real_sys_call_table[__NR_read];
real_sys_call_table[__NR_read] = myread;
// Restore CR0 WP
write_cr0_unsafe(old_cr0);
pr_info("init done\n");
return 0;
}
static void __exit modexit(void)
{
unsigned long old_cr0;
pr_info("exit\n");
old_cr0 = read_cr0();
write_cr0_unsafe(old_cr0 & ~(X86_CR0_WP));
// Restore original syscall
real_sys_call_table[__NR_read] = original_read;
write_cr0_unsafe(old_cr0);
pr_info("goodbye\n");
}
module_init(modinit);
module_exit(modexit);
MODULE_VERSION("0.1");
MODULE_DESCRIPTION("Test syscall table hijacking on x86-64.");
MODULE_AUTHOR("Marco Bonelli");
MODULE_LICENSE("Dual MIT/GPL");

error: undefined reference to `sched_setaffinity' on windows xp

Basically the code below was intended for use on linux and maybe thats the reason I get the error because I'm using windows XP, but I figure that pthreads should work just as well on both machines. I'm using gcc as my compiler and I did link with -lpthread but I got the following error anyways.
|21|undefined reference to sched_setaffinity'|
|30|undefined reference tosched_setaffinity'|
If there is another method to setting the thread affinity using pthreads (on windows) let me know. I already know all about the windows.h thread affinity functions available but I want to keep things multiplatform. thanks.
#include <stdio.h>
#include <math.h>
#include <sched.h>
double waste_time(long n)
{
double res = 0;
long i = 0;
while(i <n * 200000)
{
i++;
res += sqrt (i);
}
return res;
}
int main(int argc, char **argv)
{
unsigned long mask = 1; /* processor 0 */
/* bind process to processor 0 */
if (sched_setaffinity(0, sizeof(mask), &mask) <0)//line 21
{
perror("sched_setaffinity");
}
/* waste some time so the work is visible with "top" */
printf ("result: %f\n", waste_time (2000));
mask = 2; /* process switches to processor 1 now */
if (sched_setaffinity(0, sizeof(mask), &mask) <0)//line 30
{
perror("sched_setaffinity");
}
/* waste some more time to see the processor switch */
printf ("result: %f\n", waste_time (2000));
}
sched_getaffinity() and sched_setaffinity() are strictly Linux-specific calls. Windows provides its own set of specific Win32 API calls that affect scheduling. See this answer for sample code for Windows.

linux ptrace() get function information

i want to catch information from user defined function using ptrace() calls.
but function address is not stable(because ASLR).
how can i get another program's function information like gdb programmatically?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/user.h>
#include <sys/wait.h>
#include <sys/ptrace.h>
#include <dlfcn.h>
#include <errno.h>
void error(char *msg)
{
perror(msg);
exit(-1);
}
int main(int argc, char **argv)
{
long ret = 0;
void *handle;
pid_t pid = 0;
struct user_regs_struct regs;
int *hackme_addr = 0;
pid = atoi(argv[1]);
ret = ptrace(PTRACE_ATTACH, pid, NULL, NULL);
if(ret<0)
{
error("ptrace() error");
}
ret = waitpid(pid, NULL, WUNTRACED);
if(ret<0)
{
error("waitpid ()");
}
ret = ptrace(PTRACE_GETREGS, pid, NULL, &regs);
if(ret<0)
{
error("GETREGS error");
}
printf("EIP : 0x%x\n", (int)regs.eip);
ptrace(PTRACE_DETACH, pid, NULL, NULL);
return 0;
}
ptrace is a bit ugly, but it can be useful.
Here's a ptrace example program; it's used to make I/O-related system calls pause.
http://stromberg.dnsalias.org/~strombrg/slowdown/
You could of course also study gdb, but ISTR it's pretty huge.
You might also check out strace and ltrace, perhaps especially ltrace since it lists symbols.
HTH
You probably want to call a function that resides in a specific executable (probably, a shared object). So, first, you will have to find the base address this executable is mapped on using
/proc/pid/maps
After that, you need to find the local offset of the function you are interested in, and you can do this in two ways:
Understand the ELF file format (Linux native executable format), and searching the desired function using the mapped file (This requires some specialty)
Using a ready to use elfparser (probably readelf tool) to get the function offset under the executable. Note that you will have to figure out the real local offset since this tool usually gives you the address as if the executable was mapped to a specific address

Resources